DEV Community

Cover image for Semantic Layer Showdown: Cube vs dbt Semantic Layer vs Looker LookML
Gowtham Potureddi
Gowtham Potureddi

Posted on

Semantic Layer Showdown: Cube vs dbt Semantic Layer vs Looker LookML

A semantic layer is the part of the modern data stack that decides what "active user" means — once — so every dashboard, notebook, embedded chart, and LLM agent that asks the question receives the same answer. Skip it and every BI tool ships its own definition; ship it and the warehouse becomes the canonical source of metric truth instead of the source of metric disagreement.

This guide compares the three engines analytics engineers actually shortlist in 2026: cube.dev as the standalone open-source headless-BI engine, the dbt semantic layer powered by MetricFlow, and lookml as the original semantic model inside Looker. We walk through where each one sits between the warehouse and the consumer surface, how the data models map onto each other, how to define the same Weekly Active Users metric three ways, and how the metrics layer routes queries from Tableau, Power BI, Hex, Mode, embedded apps, and LLM agents.

PipeCode blog header for a semantic layer comparison — bold white headline 'Semantic Layer Showdown' with subtitle 'cube · dbt semantic layer · lookml' and a stylised middle-tier diagram showing warehouse → semantic layer → BI / LLM consumers on a dark gradient with a small pipecode.ai attribution.

When you want hands-on reps the moment you finish reading, drill the aggregation practice library → for the measure / metric foundations, rehearse on joins problems → for the entity-resolution patterns that semantic layers automate, and stack the group-by drills → for the granularity reasoning every measure definition leans on.


On this page


1. What a semantic layer actually solves

The "every dashboard redefines active user" problem — and why a governed metric layer is the antidote

The mental model in one line: a semantic layer is a single place where business metrics, dimensions, and joins are defined as objects — the warehouse stays the storage, the BI tool stays the surface, and the metric definition lives in between as code that every consumer reads from. Without it, the same metric is re-invented in every dashboard, every SQL snippet, and every notebook — and the numbers diverge.

The four symptoms of a missing semantic layer.

  • Five definitions of "active user." Marketing counts a session-open; product counts a feature-event; finance counts a paid event; data science counts a 30-day retention bucket; the CEO sees a different number on every dashboard.
  • Joins re-written by every analyst. The orders to customers to regions join chain lives in a Tableau workbook, a Looker explore, a Hex notebook, and a one-off SQL snippet — four copies, four chances to drift.
  • Filters re-implemented per surface. "Exclude internal users" lives as a WHERE email NOT LIKE '%@acme.internal' in some places and a WHERE is_internal = FALSE in others. Numbers drift the day a new internal domain appears.
  • No diffable governance. When the metric definition lives in a BI workbook's hidden calculated field, you cannot review it in a pull request — and you cannot tell which definition matches the "official" one.

One metric definition, many consumers — the headless-BI premise.

  • A semantic layer publishes metric definitions as code. Cube ships them as YAML / JS cubes; dbt SL ships them as YAML semantic_models + metrics; LookML ships them as view / explore files.
  • The same definition resolves into a query when a consumer asks for it. The consumer never writes the join, the GROUP BY granularity, or the filter — the layer does.
  • Every consumer — Tableau, Power BI, Hex, Mode, embedded apps, and now LLM agents — sees the same number because they all read from the same definition.
  • This is the headless BI premise: a layer that is a query API but is not a visualisation tool. The viz layer becomes interchangeable.

Where a semantic layer sits.

  • Below it: the warehouse (Snowflake, BigQuery, Databricks, Redshift, Postgres) holds the marts produced by dbt or any ELT tool.
  • The semantic layer itself: Cube / dbt SL / LookML translate metric requests into warehouse SQL and (often) cache the results.
  • Above it: the BI / notebook / embedded / LLM surfaces fan out, each one calling a SQL, REST, or GraphQL endpoint that the layer exposes.

Dimensions, measures, metrics, and joins as first-class objects.

  • Dimensions are the columns you group by — region, signup_month, device_type.
  • Measures are the aggregations on a single table — count(distinct user_id), sum(amount). Measures are the LEGO bricks.
  • Metrics are the named, business-level expressions built from measures — WAU = count_distinct_users filtered to the last 7 days. Metrics are what the dashboard asks for.
  • Joins are declared once and re-used by every metric. The consumer never has to know that orders.region_id joins to regions.id.

Why this didn't take off until dbt + Cube made it cheap.

  • The pre-2020 attempts (Looker, Power BI Datasets, Microstrategy schemas) were bundled into a single BI tool — adopting them locked you into that vendor's viz layer.
  • Cube.dev (2019) decoupled the layer from the viz tool by exposing REST / GraphQL / SQL APIs — every BI tool, notebook, and embedded app could now consume the same definitions.
  • The dbt Semantic Layer (powered by MetricFlow, 2023) put metric definitions next to dbt models — the same git repo, the same review process, the same CI.
  • LLM agents (2024–2026) finally made the governance argument concrete: a model that hallucinates a join or a filter ships a confident wrong answer; one that calls the semantic layer ships a verifiable one.

Worked example — the "five definitions of active user" failure mode

Detailed explanation. A consumer-app company has five dashboards labelled "active users." Each was built independently by a different team using a different SQL definition. The CEO opens all five in the Monday meeting and sees five different numbers. This is the canonical symptom that pushes a team to adopt a semantic layer.

Question. Given a shared events table, write the five drifted definitions of "weekly active users" and show how a single semantic-layer definition collapses them into one number. Trace why each ad-hoc version returns a different count.

Input — events.

event_id user_id event_name event_ts platform
1 100 login 2026-06-08 10:00 web
2 100 search 2026-06-08 10:01 web
3 200 login 2026-06-09 12:00 mobile
4 300 login 2026-06-09 13:00 web
5 300 purchase 2026-06-09 13:05 web
6 400 login 2026-05-30 09:00 web
7 100 login 2026-06-09 09:00 mobile

Code.

-- Definition 1: marketing — anyone with any event in last 7 days
SELECT COUNT(DISTINCT user_id) AS wau_marketing
FROM events
WHERE event_ts >= CURRENT_DATE - INTERVAL '7 days';

-- Definition 2: product — logged in at least once
SELECT COUNT(DISTINCT user_id) AS wau_product
FROM events
WHERE event_name = 'login'
  AND event_ts >= CURRENT_DATE - INTERVAL '7 days';

-- Definition 3: finance — purchased in last 7 days
SELECT COUNT(DISTINCT user_id) AS wau_finance
FROM events
WHERE event_name = 'purchase'
  AND event_ts >= CURRENT_DATE - INTERVAL '7 days';

-- Definition 4: data science — at least 2 distinct event days
SELECT COUNT(DISTINCT user_id) AS wau_ds
FROM (
    SELECT user_id, DATE(event_ts) AS d
    FROM events
    WHERE event_ts >= CURRENT_DATE - INTERVAL '7 days'
    GROUP BY user_id, DATE(event_ts)
) t
GROUP BY user_id
HAVING COUNT(*) >= 2;

-- Definition 5: mobile-only WAU
SELECT COUNT(DISTINCT user_id) AS wau_mobile
FROM events
WHERE event_ts >= CURRENT_DATE - INTERVAL '7 days'
  AND platform = 'mobile';
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Each query asks "weekly active users" but folds in a different qualifying event. None is wrong — they answer different business questions — but they are all labelled identically on five dashboards.
  2. Marketing counts every distinct user with any event = users 100, 200, 300 → 3.
  3. Product counts users with a login = users 100, 200, 300 → 3 (coincidentally the same here, drifts on other weeks).
  4. Finance counts users with a purchase = user 300 → 1.
  5. Data-science counts users active on two or more days = user 100 (active 2026-06-08 and 2026-06-09) → 1.
  6. Mobile-only WAU restricts to platform = 'mobile' → users 100 and 200 → 2.
  7. A semantic layer collapses this by publishing one named metric per business question: weekly_active_users (default — any event), weekly_logged_in_users, weekly_purchasers, weekly_engaged_users (2+ days), weekly_active_mobile. Each dashboard asks for the named metric, not for an ad-hoc SQL string.

Output.

Metric label Count
wau_marketing (any event) 3
wau_product (login) 3
wau_finance (purchase) 1
wau_ds (2+ days active) 1
wau_mobile (mobile-only) 2

Rule of thumb. If five dashboards labelled the same metric show five different numbers, the fix is not "agree on the definition once." It is "move the definition into a semantic layer so the next dashboard reads from the same file." Without the layer the next disagreement is one PR away.

Worked example — joins re-written by every analyst

Detailed explanation. A multi-table join chain (orders → customers → regions → tax tables) lives as boilerplate at the top of every Looker workbook, every Hex notebook, and every ad-hoc SQL snippet. The chain is re-typed every time. A semantic layer declares the joins once and lets every metric reference them by entity name.

Question. Given orders, customers, and regions, show the boilerplate every analyst types, then show how a semantic-layer entity declaration removes the join from the query surface entirely.

Input — schema.

table columns
orders order_id, customer_id, amount, order_date
customers customer_id, name, region_id
regions region_id, region_name, country_code

Code.

-- Every analyst types this join chain — by hand — once per query
SELECT r.region_name,
       SUM(o.amount) AS total_revenue
FROM orders o
INNER JOIN customers c ON o.customer_id = c.customer_id
INNER JOIN regions   r ON c.region_id   = r.region_id
WHERE o.order_date >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY r.region_name;
Enter fullscreen mode Exit fullscreen mode
# Semantic-layer style — declare the joins once, query the metric
# Cube.dev model (simplified)
cubes:
  - name: Orders
    sql_table: orders
    joins:
      - name: Customers
        relationship: many_to_one
        sql: "{Orders}.customer_id = {Customers}.customer_id"
    measures:
      total_revenue:
        sql: amount
        type: sum
    dimensions:
      order_date:
        sql: order_date
        type: time
  - name: Customers
    sql_table: customers
    joins:
      - name: Regions
        relationship: many_to_one
        sql: "{Customers}.region_id = {Regions}.region_id"
  - name: Regions
    sql_table: regions
    dimensions:
      region_name:
        sql: region_name
        type: string
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The ad-hoc SQL re-types INNER JOIN customers ... INNER JOIN regions ... every time. Five analysts, five copies. The day a new tax-region table is added, all five copies have to be edited.
  2. The Cube schema declares the same joins as relationship: many_to_one lines, one time, per cube. The semantic layer now knows how to traverse from Orders to Regions whenever a query mentions a column from both cubes.
  3. The consumer query becomes: "give me Orders.total_revenue grouped by Regions.region_name, filtered to the last 30 days." Cube generates the join chain on the fly — and it is the same chain every time.
  4. The dbt SL equivalent uses entities and relationships on each semantic_model; LookML uses joins declared inside each explore. The shape is the same: declare once, traverse forever.

Output.

region_name total_revenue (semantic) total_revenue (ad-hoc, copy 1) total_revenue (ad-hoc, copy 2)
EU 12,400 12,400 12,400
US 9,800 9,800 9,800
APAC 6,200 6,200 6,200

The numbers match — but only because every ad-hoc copy happens to be in sync today. The semantic-layer version is in sync by construction.

Rule of thumb. Any join chain that appears in three or more queries should live as a declared entity / join in the semantic layer. Re-typing the same INNER JOIN ... ON ... block is the analytics-engineering equivalent of cargo-cult code copy-paste.

Worked example — the headless-BI contract

Detailed explanation. "Headless BI" is the marketing term; the engineering contract behind it is "expose a query API that any front end can consume." A semantic layer that ships REST, GraphQL, and SQL endpoints can serve a Tableau dashboard, a React embedded chart, and a Slack-bot LLM agent from the same metric file. The metric author writes once; the surfaces fan out.

Question. Show how the same Weekly Active Users metric is requested over (1) a SQL endpoint, (2) a REST endpoint, and (3) a GraphQL endpoint — and explain why the output is byte-for-byte the same number on every surface.

Input. A semantic layer that has published a metric named weekly_active_users over the Events cube.

Code.

-- 1. SQL endpoint (Cube SQL API, dbt SL JDBC, Looker SQL Runner)
SELECT MEASURE(weekly_active_users)
FROM events
WHERE event_date >= DATE_TRUNC('week', CURRENT_DATE);
Enter fullscreen mode Exit fullscreen mode
# 2. REST endpoint (Cube REST, Looker API, dbt Cloud Semantic Layer API)
curl -s -H "Authorization: $TOKEN" \
  "https://semantic.example.com/cubejs-api/v1/load" \
  --data '{
    "query": {
      "measures": ["Events.weekly_active_users"],
      "timeDimensions": [{
        "dimension": "Events.event_date",
        "granularity": "week",
        "dateRange": "this week"
      }]
    }
  }'
Enter fullscreen mode Exit fullscreen mode
# 3. GraphQL endpoint (Cube GraphQL API)
query {
  cube(where: { eventDate: { inDateRange: "this week" } }) {
    events {
      weeklyActiveUsers
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Each endpoint translates the request into the same underlying warehouse SQL. The SQL endpoint is the most direct; REST and GraphQL wrap the request in JSON.
  2. The metric definition for weekly_active_users lives in one file — events.yml (Cube) or events.sql + metrics.yml (dbt SL) or events.view.lkml + events.model.lkml (LookML). Every endpoint reads from that file.
  3. The semantic layer caches the result of the compiled SQL. If three surfaces ask the same question within the cache window, the warehouse runs the query once and the layer serves the cached answer thrice.
  4. Authorization is enforced at the layer — the user's JWT or session is the same across endpoints, and row-level-security rules in the layer rewrite the SQL before it hits the warehouse.

Output (one number, three surfaces).

Surface Endpoint Response field Value
Tableau dashboard SQL MEASURE(weekly_active_users) 18,432
React embedded chart REST data[0].Events.weekly_active_users 18,432
Slack LLM agent GraphQL cube[0].events.weeklyActiveUsers 18,432

Rule of thumb. If your team needs the same metric on more than one surface, the cheapest path to consistency is the semantic layer — not a metrics_macros.sql file shared across BI tools, not a "single source of truth" doc, not a Slack thread. Code, one file, three endpoints.

Semantic layer interview question on metric governance

A senior analytics-engineering interviewer often opens with: "Walk me through how you'd give a CEO confidence that the 'active users' number on the executive dashboard is the same one the ML team trains on, the embedded customer-portal chart renders, and the Slack LLM agent quotes back when asked."

Solution Using a layered semantic-layer governance pattern

# 1. A single metric definition (Cube / dbt SL / LookML pseudocode)
metric: weekly_active_users
description: Distinct users with any event in the trailing 7 days.
type: count_distinct
target_field: user_id
filter: event_ts >= dateadd('day', -7, current_date)
owner: data-platform-team
review: PR-required
sla: <30s p95
cache_window: 1h

# 2. The metric is exposed on three endpoints (SQL / REST / GraphQL).
# 3. Three consumers register subscriptions to the metric:
consumers:
  - executive_dashboard (Looker / Tableau, SQL endpoint)
  - embedded_customer_portal (REST endpoint, RLS on tenant_id)
  - slack_llm_agent (GraphQL endpoint, RLS on slack_user.email)
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Step Action Outcome
1 Analytics engineer opens PR editing weekly_active_users.yml CI runs metric tests + freshness check
2 Reviewer approves; merge to main Semantic layer redeploys metric definition
3 Executive dashboard auto-reloads (cache invalidated) New number visible in <60s
4 Embedded chart polls REST endpoint Same new number, same cache key
5 LLM agent grounds prompt with semantic-layer schema Returns the new definition + new number, citing the layer
6 Auditor diffs git log for weekly_active_users.yml One source of truth, one commit history

Output:

Surface Definition version Value Cache state
Executive dashboard v17 (post-merge) 18,432 hot
Embedded portal v17 18,432 hot
LLM Slack agent v17 18,432 hot
Auditor PR-log v1 → v17 full diff visible n/a

Why this works — concept by concept:

  • Single source of truth — the metric file in version control is the only place the definition exists. No duplicate calculated fields, no hidden Tableau formulas, no LLM hallucinations of joins.
  • PR-required review — metric changes flow through the same code review as any other code change. The "five-dashboards-disagree" failure mode cannot recur because there is no other place to edit the definition.
  • Cache invalidation on deploy — the semantic layer flushes its cache for the affected metric the moment the new definition lands. Surfaces converge to the new number within the cache-refresh window.
  • RLS at the layer — row-level-security predicates live in the metric definition. The embedded portal automatically scopes to the tenant; the Slack agent automatically scopes to the asking user. The consumer code carries no security logic.
  • LLM grounding — the agent calls the semantic layer instead of generating SQL from scratch. Hallucinated joins become impossible because the layer publishes the schema (cubes, measures, dimensions, joins) as the agent's tool surface.
  • Cost — one warehouse query per cache window (often 1 hour), regardless of how many surfaces poll. The semantic layer is a strict cost reducer relative to N independent BI tools each running their own SQL.

SQL
Topic — aggregation
Measure and metric aggregation problems (SQL)

Practice →


2. The role of the semantic layer in a modern stack

Below the warehouse, beside the BI tool, above the consumer — and right where LLM agents finally need it

The mental model in one line: the semantic layer sits between the warehouse marts and every consumer (BI tool, notebook, embedded app, or LLM agent), translating a metric request into governed warehouse SQL with caching, row-level security, and access control along the way. Get the placement right and every downstream surface becomes interchangeable.

Three-tier stack with bottom tier 'Warehouse + dbt marts', middle tier 'Semantic layer' (highlighted with a glowing band), top tier 'BI · notebooks · embedded · LLM agents'; thin glowing arrows flow upward through all tiers and a small caching ring orbits the middle tier, on a light PipeCode card.

The three tiers in one paragraph.

  • Bottom — Warehouse + dbt marts. Snowflake, BigQuery, Databricks, Redshift, or Postgres holding the cleaned, tested, joined tables. dbt models do the row-level transforms — staging → marts. The semantic layer reads from the marts; it does not own them.
  • Middle — Semantic layer. Cube / dbt SL / LookML hold the metric definitions, dimension hierarchies, joins, and security rules. The layer compiles requests to SQL and (often) caches results.
  • Top — Consumers. Tableau, Power BI, Hex, Mode, Sigma, Streamlit, custom React dashboards, embedded analytics, and now LLM agents. Each consumer speaks SQL, REST, or GraphQL — and gets the same metric value.

Caching, query optimisation, and query rewriting.

  • Pre-aggregations (Cube). Cube can materialise a roll-up table — e.g. "daily active users by region by platform" — and route the incoming request to the pre-agg when the granularity matches. Sub-second queries against trillion-row fact tables.
  • dbt SL caching. The dbt Semantic Layer (dbt Cloud) ships a query cache keyed by metric + filters + granularity. Repeat requests within the TTL hit the cache, not the warehouse.
  • Looker PDTs (persistent derived tables). LookML's derived_table blocks can be persisted on a schedule, turning expensive transforms into a pre-computed warehouse table. The explore reads from the PDT, not from the live mart.
  • Query rewriting. Modern semantic layers detect when a roll-up table can answer the query and rewrite the SQL transparently — the consumer never knows the query plan changed.

Governed metrics vs ad-hoc SQL — the contract boundary.

  • Inside the layer: governed, named metrics with descriptions, owners, SLAs, and PR-review history.
  • Outside the layer: ad-hoc SQL still works (and is sometimes the right answer for exploratory analysis), but it is not the dashboard's source.
  • The boundary is enforced operationally: BI tools and embedded apps point only at the semantic layer's endpoint. Ad-hoc SQL is a separate Snowflake / BigQuery role that does not feed any production dashboard.

Multi-tenant security and row-level access.

  • Row-level security (RLS). A B2B SaaS company has 5,000 tenants. The semantic layer rewrites every query to add WHERE tenant_id = :current_tenant. Consumers cannot bypass it because they never write the SQL.
  • Column masking. Salary or PII columns can be selectively masked at the layer based on the caller's role.
  • Tenant isolation across consumers. The same metric file works for the internal dashboard and the customer-facing embedded chart, because the security predicate is bound to the request context, not the query.
  • Auth pass-through. The layer accepts a JWT (or OAuth token) and resolves the user's permissions to row predicates at query time. No "service account that sees everything" pattern.

Why LLM agents finally make semantic layers strategic.

  • A general-purpose LLM that sees a raw warehouse schema guesses joins and filters — and ships confident wrong answers. A semantic layer publishes the governed schema (cubes, measures, dimensions, joins, allowed filters) as the agent's tool surface.
  • Grounded queries become deterministic — the same prompt yields the same SQL because the agent is constrained to the cubes that exist.
  • Audit trails become possible — every agent call resolves to a named metric, not to an opaque generated SQL string. Compliance and finance teams can review what the agent asked for.
  • The semantic layer is now the interface the LLM agent uses, not a competing surface. This is the inversion that pushed every major BI vendor to ship a semantic-layer story in 2024–2026.

Worked example — without vs with a semantic layer, side by side

Detailed explanation. Two stacks ship the same Weekly Active Users number to a Tableau dashboard. The "without" stack has Tableau pointing directly at the Snowflake mart with a workbook-local calculated field. The "with" stack has Tableau pointing at the semantic layer's SQL endpoint and asking for the named metric.

Question. Compare the two architectures end-to-end. Show where the metric definition lives, who can edit it, and what happens when a second consumer (a Hex notebook) is added.

Input — comparison table.

Concern Without semantic layer With semantic layer
Where metric SQL lives Tableau workbook calculated field weekly_active_users.yml in git
Who can edit it Anyone with Tableau workbook access PR-required code review
How a second consumer adopts it Copy-paste the SQL into Hex Point Hex at the semantic-layer SQL endpoint
What happens when underlying schema changes Tableau workbook breaks silently CI catches the break in the next PR
RLS / multi-tenant security Per-workbook plumbing Declared once at the layer
LLM agent integration Hallucinated SQL Tool calls to named metrics

Code (without).

-- Tableau workbook calculated field — not in version control
COUNT_DISTINCT(
  IIF(
    DATE(event_ts) >= TODAY() - 7
    AND event_name IN ('login','search','purchase'),
    user_id,
    NULL
  )
)
Enter fullscreen mode Exit fullscreen mode

Code (with).

# weekly_active_users.yml — single source of truth, in git
- name: weekly_active_users
  description: Distinct users with any event in the trailing 7 days.
  type: count_distinct
  target_field: user_id
  semantic_model: events
  filters:
    - "{{ Dimension('events__event_ts') }} >= dateadd('day', -7, current_date)"
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. In the "without" world, the metric is locked inside a Tableau workbook. The Hex team copies the SQL because there is no API to call. Now two definitions exist; they drift the day someone changes the filter on one side.
  2. In the "with" world, the metric lives in a YAML file. Tableau queries the semantic layer's SQL endpoint with SELECT MEASURE(weekly_active_users) FROM events. Hex points at the same endpoint. Both surfaces see the same number by construction.
  3. When the schema changes (e.g. event_ts renamed to event_timestamp), the dbt CI / Cube CI breaks on the next PR — the change is caught before any dashboard sees a stale number.
  4. RLS in the "without" world is a per-workbook setting; in the "with" world it is a layer-level rule that applies uniformly.

Output. Two stacks ship the same number today — but the "with" stack ships the same number on Tuesday at 4pm when a new consumer onboards. The "without" stack ships a divergent number the same week.

Rule of thumb. Adopt the semantic layer the moment you have a second consumer of the same metric. One dashboard can live with an ad-hoc definition; two dashboards cannot — they will drift, and the cost of drift is one quarterly business review with conflicting numbers.

Worked example — caching: pre-aggregations, dbt SL cache, and Looker PDTs

Detailed explanation. Each platform's caching story differs. Cube's pre-aggregations roll up to a smaller table; dbt SL caches the result of a metric request; Looker PDTs persist a derived table on a schedule. All three reduce warehouse load — but they hit different layers.

Question. Given a fact table with 10 billion rows and a dashboard that asks "DAU by region by day for the last 90 days," design a caching strategy for each platform. Show the storage / freshness trade-off.

Input.

Asset Detail
Fact table events — 10B rows, ~3M new/day
Dashboard DAU by region by day, 90-day window
SLA <2s p95
Freshness <1 hour stale acceptable

Code.

# Cube — pre-aggregation
cubes:
  - name: Events
    pre_aggregations:
      dau_by_region_by_day:
        measures:
          - daily_active_users
        dimensions:
          - region
        time_dimension: event_date
        granularity: day
        partition_granularity: month
        refresh_key:
          every: 1 hour
Enter fullscreen mode Exit fullscreen mode
# dbt SL — cache settings (dbt Cloud)
saved_queries:
  - name: dau_by_region_by_day
    query_params:
      metrics: [daily_active_users]
      group_by:
        - Dimension('events__region')
        - TimeDimension('events__event_date', 'day')
      where:
        - "{{ TimeDimension('events__event_date') }} >= dateadd('day', -90, current_date)"
    cache:
      ttl: 3600  # seconds
Enter fullscreen mode Exit fullscreen mode
# Looker — PDT
view: events_dau_pdt {
  derived_table: {
    sql:
      SELECT DATE(event_ts) AS event_date,
             region,
             COUNT(DISTINCT user_id) AS dau
      FROM events
      WHERE event_ts >= CURRENT_DATE - INTERVAL '90 DAY'
      GROUP BY 1, 2 ;;
    sql_trigger_value: SELECT DATE_TRUNC('hour', CURRENT_TIMESTAMP) ;;
  }
}
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The Cube pre-agg materialises a roll-up table (events__dau_by_region_by_day) partitioned by month and refreshed hourly. The dashboard's SQL hits this table — ~90 rows × number-of-regions, not 10B rows. Sub-second.
  2. The dbt SL cache stores the result of the saved query in dbt Cloud's cache. Repeated requests with the same parameters get the cached row set. Cache misses re-run against the warehouse mart.
  3. Looker's PDT persists a derived table in the warehouse. The sql_trigger_value rebuilds the PDT every hour. Every explore that references this view reads from the PDT.
  4. All three trade some staleness for massive speed-up. The freshness SLA (<1 hour) fits each pattern. The storage cost is similar — one roll-up table per critical metric grain.

Output.

Platform Query latency Storage overhead Freshness Where the cache lives
Cube pre-agg <500ms 1 small table per pre-agg hourly refresh Warehouse + Cube metadata
dbt SL cache <2s (warm), seconds (cold) result set in dbt Cloud TTL-based dbt Cloud
Looker PDT <1s 1 derived table per PDT hourly trigger Warehouse

Rule of thumb. Pre-compute when the query pattern is predictable and high-volume; let the cache TTL absorb the long tail. Picking the wrong layer (e.g. caching the result for an exploratory cube where every consumer asks a different grain) wastes the cache hit ratio.

Worked example — multi-tenant row-level security at the semantic layer

Detailed explanation. A B2B SaaS analytics product needs to scope every chart to the calling tenant. Without a semantic layer, every BI tool re-implements the WHERE tenant_id = ? predicate — and the day someone forgets it, tenant A sees tenant B's revenue. With a semantic layer, the predicate lives in the cube definition and is enforced for every consumer.

Question. Show the RLS rule for the same metric in Cube, dbt SL, and LookML. Trace how a single user's JWT routes to the right tenant scope.

Input. A multi-tenant orders table with a tenant_id column. The user alice@tenantA.com has a JWT with tenant_id = "tenant_A".

Code.

# Cube — query rewrite
cubes:
  - name: Orders
    sql_table: orders
    public: true
    # security context injected from JWT claim
    sql: SELECT * FROM orders WHERE tenant_id = '{COMPILE_CONTEXT.securityContext.tenant_id}'
    measures:
      total_revenue:
        sql: amount
        type: sum
Enter fullscreen mode Exit fullscreen mode
# dbt SL — runtime filter via dbt Cloud security context
semantic_models:
  - name: orders
    model: ref('fct_orders')
    entities:
      - name: order_id
        type: primary
    defaults:
      agg_time_dimension: order_date
    # MetricFlow respects access controls declared in dbt_project.yml
metrics:
  - name: total_revenue
    type: simple
    type_params:
      measure: revenue_sum
    filter: "{{ Dimension('orders__tenant_id') }} = '{{ session.tenant_id }}'"
Enter fullscreen mode Exit fullscreen mode
# Looker — access_filter
explore: orders {
  access_filter: {
    field: orders.tenant_id
    user_attribute: tenant_id
  }
}
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Alice opens her tenant-A dashboard. Her JWT carries tenant_id = "tenant_A".
  2. The semantic layer extracts the claim into its security context.
  3. The cube / semantic_model / explore wraps every emitted SQL with WHERE tenant_id = 'tenant_A'. There is no path to send a query that skips the predicate.
  4. The warehouse executes the scoped query; only tenant-A rows are returned. The metric total_revenue is computed over tenant-A rows only.
  5. When Bob from tenant B opens the same dashboard URL, his JWT carries tenant_id = "tenant_B". The same metric definition resolves to a different SQL — and a different number — without any code change.

Output.

User tenant_id claim SQL emitted total_revenue
Alice (tenant A) tenant_A WHERE tenant_id = 'tenant_A' 412,300
Bob (tenant B) tenant_B WHERE tenant_id = 'tenant_B' 158,900
Internal data team (no tenant) denied / explicit override n/a

Rule of thumb. If your product is multi-tenant, the semantic layer is the only place tenant isolation belongs. Every other location (BI workbook filter, embedded chart query, app-side SDK call) is a cross-tenant data-leak waiting to ship.

Semantic layer interview question on caching and security

A senior interviewer might frame it as: "Your CEO dashboard polls 'DAU by region by day' every minute and runs into Snowflake credit overspend. The same dashboard is also embedded into a customer-facing portal where 200 tenants need their own scoped numbers. Design the semantic-layer caching and RLS strategy."

Solution Using pre-aggregations + tenant-scoped cache keys

cubes:
  - name: Events
    sql: |
      SELECT * FROM events
      WHERE tenant_id = '{COMPILE_CONTEXT.securityContext.tenant_id}'
    measures:
      dau:
        type: count_distinct
        sql: user_id
    dimensions:
      event_date:
        type: time
        sql: event_ts
      region:
        type: string
        sql: region
    pre_aggregations:
      tenant_dau_region_day:
        measures: [dau]
        dimensions: [region]
        time_dimension: event_date
        granularity: day
        partition_granularity: month
        refresh_key:
          every: 5 minute
        # tenant_id is added to the cache key automatically because
        # it appears in the cube SQL via COMPILE_CONTEXT
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Step Caller Cache key contains tenant_id? Outcome
1 Internal CEO dashboard polls yes (internal tenant) hits warm pre-agg, 200ms
2 Tenant A's portal polls yes (tenant_A) hits warm pre-agg for tenant A, 250ms
3 Tenant B's portal polls yes (tenant_B) hits warm pre-agg for tenant B, 250ms
4 Refresh tick (every 5 min) per-tenant rebuild one Snowflake query per active tenant
5 Cold tenant (no traffic for 1 day) pre-agg expires next request rebuilds — 2s once
6 Audit query for credit usage total Snowflake spend = N tenants × 12 queries/hour ~95% reduction vs naive

Output:

Metric Before After
Snowflake credits / day 240 14
p95 latency (CEO dashboard) 6.2s 0.21s
p95 latency (tenant portal) 4.8s 0.25s
Cross-tenant leak risk high (per-app plumbing) zero (layer-enforced)

Why this works — concept by concept:

  • Pre-aggregations as the cost killer — Snowflake credits scale with scanned bytes, not with queries served. Routing every poll to a roll-up table of ~10K rows instead of 10B drops the bytes scanned by 6 orders of magnitude.
  • Per-tenant cache key — the tenant_id appears in the cube SQL via COMPILE_CONTEXT, so Cube partitions the pre-agg storage by tenant. Tenant A's data is physically in a different row set from tenant B's.
  • Refresh granularity matches SLA — a 5-minute refresh is "fresh enough" for product analytics. Tightening to 1 minute would re-spend Snowflake credits; loosening to 1 hour would break the freshness contract.
  • Cold-tenant elasticity — pre-aggs for inactive tenants expire and only get rebuilt on demand. Pay for what you query.
  • Single security predicate — the WHERE tenant_id = ... line lives once. Every consumer (CEO dashboard, embedded portal, LLM agent) inherits it without writing tenant logic.
  • Cost — O(active_tenants × refresh_ticks) warehouse queries per day, each scanning O(1 day of events). Independent of consumer poll rate.

SQL
Topic — group-by
GROUP BY and granularity problems (SQL)

Practice →


3. The three platforms compared

Cube vs dbt Semantic Layer vs LookML — the scoring rubric every analytics-engineering lead should keep at hand

The mental model in one line: Cube is the standalone OSS engine with the widest BI fan-out, the dbt Semantic Layer is the dbt-native option that piggybacks on the model layer you already own, and LookML is the original — tightly coupled to Looker as the consumer. The right pick is mostly a function of which consumers your team must serve, not the model file's syntax.

Three side-by-side product cards labelled Cube.dev, dbt Semantic Layer, and Looker LookML, each card showing 3-4 strength badges and a tiny architecture sketch — Cube has REST/GraphQL/SQL pills, dbt SL has a 'next to dbt models' badge, LookML has a Looker-only badge, on a light PipeCode card.

Cube.dev (formerly Cube.js) — the standalone OSS engine.

  • What it is. An open-source semantic engine, written in Node.js + Rust, that publishes metric definitions over REST, GraphQL, and SQL APIs. Started as Cube.js in 2019, rebranded to Cube.dev.
  • Strengths. Widest BI fan-out (any tool that speaks SQL or HTTP can consume it). Pre-aggregations are best-in-class. Self-hostable via Docker; managed via Cube Cloud. Excellent for embedded analytics and LLM agents because of the REST / GraphQL surfaces.
  • Tradeoffs. Maintains its own model files (cubes) — duplication if you already model in dbt. The OSS edition lacks some governance features that ship in Cube Cloud (lineage, RBAC UI, query history).
  • Data model. cubemeasures, dimensions, joins, segments, pre_aggregations. A cube maps roughly to a fact / dim table.

dbt Semantic Layer (powered by MetricFlow) — the dbt-native option.

  • What it is. Metric definitions that live next to dbt models as semantic_model and metric YAML, compiled by MetricFlow into SQL. Available in dbt Cloud (managed) and dbt Core (CLI / open-source).
  • Strengths. Lives in your existing dbt repo — same PR review, same CI, same lineage. Best-in-class time-spine and cumulative metrics. Direct integration with Tableau, Hex, Mode, Power BI, Sigma, and Lightdash via the dbt SL JDBC connector.
  • Tradeoffs. The premium hosted Semantic Layer is gated to dbt Cloud Team / Enterprise plans (dbt Core has MetricFlow but not the cached server). Smaller embedded / API surface than Cube. AI / agent fan-out is improving but lags Cube's GraphQL story.
  • Data model. semantic_modelentities, dimensions, measures; metric → simple / ratio / derived / cumulative. Entities are the primary / foreign key declarations that drive joins.

Looker LookML — the original.

  • What it is. The semantic modelling language that ships inside Looker. Mature since 2014; the reference implementation of "metric definitions live next to the BI tool."
  • Strengths. Mature governance (Git-integrated workspace, content validation, IDE). Persistent derived tables (PDTs) are battle-tested. Deep integration with Looker's explore experience.
  • Tradeoffs. Tightly coupled to Looker as the consumer. Other BI tools cannot natively consume LookML — you would expose Looker's SQL Runner or pipe via API. The license cost scales with Looker user seats, which can dominate the BI budget.
  • Data model. viewdimensions, measures, filters; explorejoins; model → packages explores together.

Scoring rubric — five axes that decide the pick.

Axis Cube.dev dbt Semantic Layer LookML
Openness (consumer fan-out) ★★★★★ ★★★★ ★★
Time-to-value (if no existing model) ★★★ ★★★★ (if dbt already in place) ★★
Cost (TCO) $ — OSS, Cube Cloud paid $$ — dbt Cloud Team/Enterprise $$$ — per Looker seat
Embedded / LLM ★★★★★ ★★★
Governance & lineage ★★★★ (Cube Cloud) ★★★★★ (lives in dbt repo) ★★★★★

Common interview probes on platform choice.

  • "When would you pick Cube over dbt SL?" — when the consumer mix is heavily embedded analytics, LLM agents, or non-dbt BI tools, and when pre-aggregations are the dominant cost saver.
  • "When would you pick dbt SL over Cube?" — when the team already lives in a dbt repo and the consumers are mostly Tableau / Power BI / Hex / Mode through the JDBC connector.
  • "When would you stay on LookML?" — when Looker is already the standard BI tool, the team values mature governance, and there is no near-term need to serve embedded or AI consumers.
  • "What is the migration cost LookML → dbt SL?" — typically rewriting view / explore files as semantic_models, plus carefully porting calculated fields and access filters. Usually staged metric by metric, not as a big-bang.

Worked example — score the three platforms for an embedded analytics SaaS

Detailed explanation. A B2B SaaS company sells an analytics product embedded in customer apps. The consumer mix is: 80% embedded React charts (REST/GraphQL), 15% internal Tableau dashboards, 5% an early LLM agent. They already use dbt for upstream models.

Question. Score the three platforms against this consumer mix and pick the primary semantic layer.

Input — weighted axes.

Axis Weight Cube dbt SL LookML
Embedded REST/GraphQL 40% 5 3 1
Internal BI (Tableau) 15% 4 5 3
LLM agent 5% 5 3 2
dbt integration 20% 3 5 2
Cost / OSS option 10% 5 3 1
Governance 10% 4 5 5

Code.

weights = {
    "embedded":   0.40,
    "internal_bi": 0.15,
    "llm":        0.05,
    "dbt":        0.20,
    "cost":       0.10,
    "governance": 0.10,
}
scores = {
    "Cube":   {"embedded":5, "internal_bi":4, "llm":5, "dbt":3, "cost":5, "governance":4},
    "dbt_SL": {"embedded":3, "internal_bi":5, "llm":3, "dbt":5, "cost":3, "governance":5},
    "LookML": {"embedded":1, "internal_bi":3, "llm":2, "dbt":2, "cost":1, "governance":5},
}
weighted = {p: round(sum(s[a]*weights[a] for a in weights), 2) for p, s in scores.items()}
print(weighted)
# {'Cube': 4.4, 'dbt_SL': 3.85, 'LookML': 2.0}
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Embedded REST/GraphQL is the dominant axis (40%). Cube's REST + GraphQL APIs score a 5; dbt SL's JDBC + API score a 3; LookML's "no native embedded API" scores a 1.
  2. Internal Tableau is well served by all three but easiest to wire through dbt SL's JDBC.
  3. The LLM-agent axis favours Cube because of the GraphQL schema and the published meta endpoint that lists all cubes / dimensions / measures as a tool surface.
  4. dbt integration favours dbt SL — it lives in the same repo.
  5. The weighted scores collapse to ~4.4 for Cube, 3.85 for dbt SL, 2.0 for LookML — Cube wins this scenario.
  6. The team picks Cube as the primary semantic layer. dbt continues to own the model layer (staging / marts), and Cube reads from the dbt marts.

Output.

Platform Weighted score Decision
Cube.dev 4.40 primary
dbt SL 3.85 alternative if embedded shrinks
LookML 2.00 not a fit (no Looker consumer)

Rule of thumb. Score by consumer mix, not by syntax preference. The semantic layer's job is to serve consumers; the file format matters only to the small team of analytics engineers maintaining it.

Worked example — Cube data model in detail

Detailed explanation. A Cube schema is built from cube blocks. Each cube wraps a SQL table (or view), exposes measures and dimensions, declares joins to other cubes, and optionally defines segments (named filters) and pre_aggregations (materialised roll-ups).

Question. Translate a small star schema (orders, customers, regions) into a Cube schema with one revenue measure and one region dimension. Show the joins declared once and re-used.

Input — star schema.

Table Role Key columns
orders fact order_id, customer_id, amount, order_date
customers dim customer_id, region_id, name
regions dim region_id, region_name

Code.

cubes:
  - name: Orders
    sql_table: analytics.fct_orders
    joins:
      - name: Customers
        relationship: many_to_one
        sql: "{Orders}.customer_id = {Customers}.customer_id"
    measures:
      total_revenue:
        sql: amount
        type: sum
        format: currency
      order_count:
        type: count
    dimensions:
      order_id:
        sql: order_id
        type: number
        primary_key: true
      order_date:
        sql: order_date
        type: time
    segments:
      paid_orders:
        sql: "{CUBE}.status = 'paid'"

  - name: Customers
    sql_table: analytics.dim_customers
    joins:
      - name: Regions
        relationship: many_to_one
        sql: "{Customers}.region_id = {Regions}.region_id"
    dimensions:
      customer_id:
        sql: customer_id
        type: number
        primary_key: true
      name:
        sql: name
        type: string

  - name: Regions
    sql_table: analytics.dim_regions
    dimensions:
      region_id:
        sql: region_id
        type: number
        primary_key: true
      region_name:
        sql: region_name
        type: string
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Each cube maps to a warehouse table. Orders is the fact; Customers and Regions are dims.
  2. The joins block on Orders declares the join to Customers once. The joins block on Customers declares the join to Regions once. Cube composes the chain: a query for Orders.total_revenue grouped by Regions.region_name traverses both joins automatically.
  3. measures define aggregations on the cube's table. dimensions define group-by axes; primary_key: true marks the row identity for the cube.
  4. segments are named filters. A dashboard can ask "give me Orders.total_revenue for segment paid_orders" without re-typing the WHERE status = 'paid' predicate.
  5. Adding a new metric is one new measure block. Adding a new dim is one new dimension. Cubes scale linearly with metric count.

Output. A consumer can now ask: "Orders.total_revenue grouped by Regions.region_name filtered to Orders.paid_orders segment, last 30 days." The semantic layer composes the join chain and the segment filter — the consumer writes no SQL.

Rule of thumb. Declare every join once, every segment once, every measure once. The rule "don't repeat the SQL" is what makes the layer pay for itself within the first six metrics.

Worked example — dbt Semantic Layer data model in detail

Detailed explanation. A dbt Semantic Layer schema is built from semantic_models (which define entities, dimensions, and measures on a dbt model) and metrics (which express the business-level KPIs computed from those measures). MetricFlow turns metric requests into SQL.

Question. Re-express the same orders / customers / regions star schema as dbt Semantic Layer YAML. Show the entity-based joins and the simple metric for total revenue.

Input. dbt models fct_orders, dim_customers, dim_regions exist and have unique_key columns.

Code.

# models/semantic/orders.yml
semantic_models:
  - name: orders
    model: ref('fct_orders')
    defaults:
      agg_time_dimension: order_date
    entities:
      - name: order_id
        type: primary
      - name: customer_id
        type: foreign
    dimensions:
      - name: order_date
        type: time
        type_params:
          time_granularity: day
    measures:
      - name: revenue_sum
        agg: sum
        expr: amount

  - name: customers
    model: ref('dim_customers')
    entities:
      - name: customer_id
        type: primary
      - name: region_id
        type: foreign
    dimensions:
      - name: name
        type: categorical

  - name: regions
    model: ref('dim_regions')
    entities:
      - name: region_id
        type: primary
    dimensions:
      - name: region_name
        type: categorical

metrics:
  - name: total_revenue
    label: Total Revenue
    type: simple
    type_params:
      measure: revenue_sum
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Each semantic_model wraps an existing dbt model. entities declare the primary / foreign keys MetricFlow uses to auto-resolve joins.
  2. customer_id is primary in customers and foreign in orders — MetricFlow knows that orders joins to customers on customer_id without an explicit JOIN ... ON block.
  3. Similarly, region_id is primary in regions and foreign in customers — the chain orders → customers → regions is implicit.
  4. dimensions declare group-by axes; agg_time_dimension defaults the time grain for time-series queries.
  5. measures are the aggregation building blocks. A metric of type simple wraps a single measure into a named, dashboard-facing KPI.
  6. Other metric types compose more complex KPIs: ratio (numerator / denominator), derived (an arithmetic expression over other metrics), cumulative (rolling totals with date-spine support).

Output. A consumer can ask: "total_revenue grouped by region__region_name for the last 30 days." MetricFlow resolves the join chain via entities and emits SQL — no JOIN ... ON typed by the consumer.

Rule of thumb. dbt SL's entity model is most powerful when your dbt marts already follow Kimball-style conventions (one primary key per table, foreign keys named consistently). Greenfield dbt + dbt SL projects converge faster than retrofits.

Worked example — LookML data model in detail

Detailed explanation. A LookML schema is built from view files (mapping to tables) and explore files (declaring joins between views). Each view exposes dimensions and measures; each explore lists the join blocks. The Looker UI shells out queries against the explore.

Question. Re-express the same star schema in LookML — a view per table, an explore that joins them, a measure for total revenue.

Input. Looker connection to the same warehouse with fct_orders, dim_customers, dim_regions.

Code.

# views/orders.view.lkml
view: orders {
  sql_table_name: analytics.fct_orders ;;

  dimension: order_id {
    primary_key: yes
    type: number
    sql: ${TABLE}.order_id ;;
  }
  dimension_group: order {
    type: time
    timeframes: [date, week, month, quarter, year]
    sql: ${TABLE}.order_date ;;
  }
  dimension: customer_id {
    type: number
    sql: ${TABLE}.customer_id ;;
  }
  measure: total_revenue {
    type: sum
    sql: ${TABLE}.amount ;;
    value_format_name: usd
  }
  measure: order_count {
    type: count
  }
}

# views/customers.view.lkml
view: customers {
  sql_table_name: analytics.dim_customers ;;
  dimension: customer_id {
    primary_key: yes
    type: number
    sql: ${TABLE}.customer_id ;;
  }
  dimension: region_id {
    type: number
    sql: ${TABLE}.region_id ;;
  }
  dimension: name { type: string sql: ${TABLE}.name ;; }
}

# views/regions.view.lkml
view: regions {
  sql_table_name: analytics.dim_regions ;;
  dimension: region_id {
    primary_key: yes
    type: number
    sql: ${TABLE}.region_id ;;
  }
  dimension: region_name { type: string sql: ${TABLE}.region_name ;; }
}

# models/sales.model.lkml
explore: orders {
  join: customers {
    type: left_outer
    relationship: many_to_one
    sql_on: ${orders.customer_id} = ${customers.customer_id} ;;
  }
  join: regions {
    type: left_outer
    relationship: many_to_one
    sql_on: ${customers.region_id} = ${regions.region_id} ;;
  }
}
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Each view wraps a warehouse table. dimension and measure blocks declare the columns and aggregations.
  2. The explore: orders block declares the join chain. join: customers joins orders → customers; join: regions joins customers → regions. The relationship hint (many_to_one) lets Looker pick the right SQL form.
  3. The dimension_group shortcut auto-generates order_date, order_week, order_month, etc. — every common time-grain dimension for free.
  4. value_format_name: usd formats the measure as currency in the BI surface.
  5. A Looker user opens the orders explore, selects regions.region_name and orders.total_revenue, and Looker emits the join chain transparently.

Output. Same number, same governance, same star schema — expressed in LookML files inside a Git-integrated Looker workspace.

Rule of thumb. LookML's per-view structure is more verbose than YAML alternatives but reads beautifully in code review. If Looker is the only BI surface, the verbosity is offset by IDE features (autocomplete, content validation, LookML test runner). When the BI surface fans out beyond Looker, the verbosity becomes a tax.

Semantic layer interview question on platform selection

A senior analytics-engineering interviewer might ask: "You inherit a company on Looker with 200 dashboards. The CTO wants to add an embedded analytics product and an LLM agent within 12 months. Walk me through the semantic layer migration plan."

Solution Using a staged "10 top metrics" migration to Cube alongside Looker

# Stage 1: stand up Cube alongside Looker (Cube reads from the same warehouse marts).
# Stage 2: identify the 10 top metrics by Looker query volume.
# Stage 3: rewrite those 10 metrics in Cube (or dbt SL), reference dbt marts.
# Stage 4: point the new embedded product and LLM agent at Cube.
# Stage 5: dual-publish — Looker continues to serve the 200 dashboards;
#          Cube serves the new surfaces.
# Stage 6: as Looker dashboards retire or get rebuilt, port them to Cube one by one.

migration_plan:
  parallel_run_months: 12
  metric_priority:
    - weekly_active_users
    - daily_active_users
    - total_revenue
    - new_signups
    - churn_rate
    - retention_d7
    - retention_d30
    - average_order_value
    - conversion_rate
    - net_promoter_score
  consumers:
    - embedded_react_charts -> Cube REST
    - llm_slack_agent       -> Cube GraphQL
    - existing_looker       -> LookML (untouched)
    - new_internal_bi       -> Cube SQL API
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Stage Action Risk Mitigation
1 Cube deployed alongside Looker low both read same warehouse, no migration
2 Top-10 metrics catalogued from Looker query history low weighted by query volume + dashboard count
3 Top-10 rewritten in Cube YAML medium unit-test each metric against Looker output for 30 days
4 Embedded product + LLM agent ship on Cube medium new code path, no impact on Looker
5 Dual-publish, no Looker dashboards touched low full backwards compatibility
6 Looker dashboards ported as part of normal roadmap spread over months each port is one PR, reviewable

Output:

Surface Pre-migration Post-migration
Looker dashboards 200 200 (unchanged)
Embedded React charts 0 new on Cube
LLM Slack agent 0 new on Cube
New internal BI n/a on Cube SQL API
Source of truth metrics LookML (200 metric defs) Cube (10 top metrics) + LookML (long tail)

Why this works — concept by concept:

  • Parallel run, not big-bang — Looker and Cube co-exist for 12 months. No "stop-the-world" cutover. Risk distributed across the migration window.
  • Top-10 metric priority — analytics-engineering effort focuses on the metrics that power the new consumer surfaces. The long tail of 190 dashboards stays on Looker until a natural rebuild.
  • Embedded + LLM on Cube — the new surfaces are wired only to the new layer. Their existence does not depend on Looker uptime, license seats, or LookML rewrites.
  • Same warehouse marts — both layers read from the dbt-managed marts. The underlying data is one copy; the metric definitions are layered above.
  • Unit-test against Looker — for each ported metric, dual-run for 30 days and compare numbers daily. Drift > 0.5% blocks the migration of that metric.
  • Cost — incremental dual-platform cost during the 12-month run, offset by the new revenue streams (embedded product, LLM agent) that depend on Cube.

SQL
Topic — joins
JOIN problems for entity resolution (SQL)

Practice →


4. Defining a metric in each platform

One Weekly Active Users metric, three vocabularies — the side-by-side comparison every interviewer probes

The mental model in one line: the same Weekly Active Users metric is count_distinct user_id over the trailing 7 days, but each platform asks you to spell it differently — Cube's measure type: countDistinct, dbt SL's simple metric over a count_distinct measure, and LookML's measure type: count_distinct. Translating between the three is a syntax exercise once the semantics are clear.

Horizontal flow showing one metric 'Weekly Active Users' split into three parallel definition lanes — Cube cube, dbt semantic_model, LookML explore — each lane shows the same components (entity, dimensions, measures, joins) using brand-tinted pill chips, on a light PipeCode card.

The same metric, three vocabularies.

  • Cube. cube exposes a count_distinct measure on user_id. The "weekly" granularity is supplied by the consumer via the timeDimensions block, or pre-baked as a segment.
  • dbt SL. A semantic_model exposes a measure count_distinct(user_id). A metric of type simple wraps the measure and is queried with a granularity: week time dimension.
  • LookML. A view exposes a measure type: count_distinct. The dimension_group auto-generates a *_week dimension that the explore groups by.

Where joins are declared.

  • Cube. Per-cube joins block: relationship: many_to_one + sql: "{Cube}.k = {Other}.k".
  • dbt SL. Per-semantic_model entities: declare primary and foreign entities; MetricFlow infers the join.
  • LookML. Per-explore join: type: left_outer + sql_on: ${a.k} = ${b.k}.

Granularity, time dimensions, and date-spine handling.

  • Cube. Time dimensions support granularity: day | week | month | quarter | year. Pre-aggregations can be partitioned by month or week for cost control. Date-spine is handled implicitly by the consumer query.
  • dbt SL. Best-in-class time spine — declare a time_spine model once in semantic_models.yml, and MetricFlow uses it to fill in zero rows for missing dates in cumulative metrics. Granularities: second | minute | hour | day | week | month | quarter | year.
  • LookML. dimension_group auto-generates every common timeframe. PDTs can be partitioned by date. No native date-spine — analysts hand-roll a "calendar" view if needed.

Derived and ratio metrics.

  • Cube. Compose by referencing other measures in a derived measure's sql: e.g. conversion_rate: sql: "{purchases} * 1.0 / NULLIF({sessions}, 0)".
  • dbt SL. First-class metric types: ratio (numerator + denominator) and derived (expression over named metrics). The cleanest of the three for compound KPIs.
  • LookML. Compose with measure type: number and an expression referencing other measures: ${purchases} * 1.0 / NULLIF(${sessions}, 0).

Filters, segments, and parameter inputs.

  • Cube. segments are named filters reusable across queries. Templated parameters via { FILTER_PARAMS } for runtime injection.
  • dbt SL. filter blocks on metrics for static filters; where clauses on saved queries for runtime filters. Less templating, more declarative.
  • LookML. filter blocks on views and parameter blocks for runtime input. liquid template language for advanced rewrites.

Caching, materialisation, and roll-ups.

  • Cube. pre_aggregations are the headline feature — declare the roll-up grain and refresh schedule; Cube auto-routes matching queries.
  • dbt SL. dbt Cloud Semantic Layer cache; saved queries can be persisted as tables via dbt's materialized: table.
  • LookML. PDTs persisted on a schedule via sql_trigger_value or datagroup. Most mature of the three but warehouse-coupled.

Versioning a metric definition.

  • All three live in Git. Cube has meta.version per cube; dbt SL inherits dbt's version and defined_in semantics; LookML has Looker Workspaces (Git-backed branches) with content validation.

Worked example — Weekly Active Users in Cube

Detailed explanation. A Cube schema exposes a weekly_active_users measure as a count_distinct of user_id over a 7-day rolling window. The consumer queries it with a dateRange filter on the time dimension.

Question. Write the Cube definition for weekly_active_users and show the consumer query that returns the WAU per week for the trailing 12 weeks.

Input. An events table with user_id, event_ts, event_name.

Code.

cubes:
  - name: Events
    sql_table: analytics.fct_events
    measures:
      active_users:
        type: countDistinct
        sql: user_id
        title: Active users
    dimensions:
      event_ts:
        sql: event_ts
        type: time
      event_name:
        sql: event_name
        type: string
      user_id:
        sql: user_id
        type: number
    segments:
      logged_in:
        sql: "{CUBE}.event_name = 'login'"
    pre_aggregations:
      weekly_active_rollup:
        measures: [active_users]
        time_dimension: event_ts
        granularity: week
        partition_granularity: month
        refresh_key:
          every: 1 hour
Enter fullscreen mode Exit fullscreen mode
// Consumer query  REST/GraphQL/SQL all hit the same definition
{
  "measures": ["Events.active_users"],
  "timeDimensions": [{
    "dimension": "Events.event_ts",
    "granularity": "week",
    "dateRange": ["2026-03-22", "2026-06-14"]
  }]
}
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The active_users measure is countDistinct(user_id). Cube treats it as a measure usable at any granularity.
  2. The consumer asks for Events.active_users grouped by Events.event_ts at weekly granularity, over a 12-week date range.
  3. Cube checks pre-aggregations; the weekly_active_rollup is partitioned by month at weekly granularity, so it matches. The query reads the pre-agg, not the raw fact.
  4. The result is one row per week; "WAU" emerges naturally from count_distinct at the week grain. No bespoke "WAU" formula is needed because the measure + granularity combination is WAU.

Output (sample).

week active_users
2026-03-22 18,210
2026-03-29 18,405
2026-04-05 18,890
... ...
2026-06-14 19,432

Rule of thumb. In Cube, the same count_distinct measure can serve as DAU, WAU, MAU depending on the granularity the consumer asks for. Don't define three measures — define one and let the time dimension do the work.

Worked example — Weekly Active Users in dbt Semantic Layer

Detailed explanation. The dbt Semantic Layer defines the same metric as a simple metric over a count_distinct measure on the events semantic model. MetricFlow injects the granularity at query time.

Question. Write the dbt SL YAML for weekly_active_users and the MetricFlow CLI / Python call to return WAU per week for the trailing 12 weeks.

Input. A dbt model fct_events with user_id, event_ts, event_name. A time_spine model is configured.

Code.

# models/semantic/events.yml
semantic_models:
  - name: events
    model: ref('fct_events')
    defaults:
      agg_time_dimension: event_ts
    entities:
      - name: user_id
        type: foreign
      - name: event_id
        type: primary
    dimensions:
      - name: event_ts
        type: time
        type_params:
          time_granularity: day
      - name: event_name
        type: categorical
    measures:
      - name: distinct_users
        agg: count_distinct
        expr: user_id

metrics:
  - name: weekly_active_users
    label: Weekly Active Users
    type: simple
    type_params:
      measure: distinct_users
Enter fullscreen mode Exit fullscreen mode
# MetricFlow CLI consumer call
mf query \
  --metrics weekly_active_users \
  --group-by metric_time__week \
  --start-time 2026-03-22 \
  --end-time   2026-06-14
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. distinct_users is the measure (the count_distinct LEGO brick). weekly_active_users is the simple metric that wraps it as a dashboard-facing KPI.
  2. The metric_time__week group-by tells MetricFlow to aggregate the measure at the week grain using agg_time_dimension: event_ts.
  3. MetricFlow joins to the time spine (declared once in semantic_models.yml) so weeks with zero activity still appear as zero rows — not as missing rows.
  4. The query returns one row per week. Consumers can be Tableau (via JDBC), Hex (native dbt SL integration), or any tool that speaks the SL API.

Output (sample).

metric_time__week weekly_active_users
2026-03-22 18,210
2026-03-29 18,405
2026-04-05 18,890
... ...
2026-06-14 19,432

Rule of thumb. When the metric is "count something distinct at a time grain," reach for a simple metric over a count_distinct measure. Reserve ratio, derived, and cumulative for the metrics that genuinely need them.

Worked example — Weekly Active Users in LookML

Detailed explanation. A LookML view exposes a count_distinct measure on user_id. A dimension_group auto-generates event_week. The explore groups by event_week and pivots on active_users.

Question. Write the LookML view + explore for the same WAU metric and the Looker query (or SQL Runner equivalent) for the trailing 12 weeks.

Input. A Looker connection to the same fct_events warehouse table.

Code.

# views/events.view.lkml
view: events {
  sql_table_name: analytics.fct_events ;;

  dimension: event_id {
    primary_key: yes
    type: number
    sql: ${TABLE}.event_id ;;
  }
  dimension: user_id {
    type: number
    sql: ${TABLE}.user_id ;;
    hidden: yes
  }
  dimension_group: event {
    type: time
    timeframes: [date, week, month, quarter, year]
    sql: ${TABLE}.event_ts ;;
  }
  dimension: event_name {
    type: string
    sql: ${TABLE}.event_name ;;
  }
  measure: active_users {
    type: count_distinct
    sql: ${user_id} ;;
    label: "Active users"
  }
}

# models/events.model.lkml
explore: events {
  description: "Activity events fact for WAU/DAU/MAU"
}
Enter fullscreen mode Exit fullscreen mode
-- SQL emitted by Looker for the trailing 12-week WAU query
SELECT DATE_TRUNC('week', ${TABLE}.event_ts) AS event_week,
       COUNT(DISTINCT ${TABLE}.user_id)      AS active_users
FROM   analytics.fct_events AS events
WHERE  ${TABLE}.event_ts >= CURRENT_DATE - INTERVAL '84 day'
GROUP BY 1
ORDER BY 1;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The dimension_group: event auto-generates event_date, event_week, event_month, event_quarter, event_year. Selecting event_week in the Looker UI drives the DATE_TRUNC('week', ...) in the emitted SQL.
  2. The measure: active_users is count_distinct ${user_id}. Looker pairs it with the chosen time grain to produce DAU / WAU / MAU.
  3. The explore is intentionally minimal — events is a single-table fact, so no joins are needed for this metric. Joins to users, regions, etc. would be added in the same explore.
  4. The Looker UI generates the emitted SQL automatically. Power users can drop to SQL Runner; the explore is the typical surface.

Output (sample, identical to Cube / dbt SL).

event_week active_users
2026-03-22 18,210
2026-03-29 18,405
2026-04-05 18,890
... ...
2026-06-14 19,432

Rule of thumb. In LookML, the dimension_group is the productivity unlock. Define one time field, get every common grain for free. The verbosity tax is paid up front; the daily authoring tax is small.

Worked example — derived and ratio metrics across all three

Detailed explanation. A "conversion rate" metric is purchases / sessions. Each platform expresses it differently — Cube via a number-typed measure referencing two sum measures, dbt SL via a ratio metric, LookML via a number measure that references two count measures.

Question. Define conversion_rate = purchases / sessions in all three platforms. Show the safe-division pattern (NULLIF on the denominator).

Input. Cubes / semantic_models / views for sessions (with a purchases flag column).

Code.

# Cube — derived measure
cubes:
  - name: Sessions
    sql_table: analytics.fct_sessions
    measures:
      sessions_count:
        type: count
      purchases_count:
        type: count
        filters:
          - sql: "{CUBE}.purchase_flag = true"
      conversion_rate:
        type: number
        sql: "{purchases_count} * 1.0 / NULLIF({sessions_count}, 0)"
        format: percent
Enter fullscreen mode Exit fullscreen mode
# dbt SL — first-class ratio metric
metrics:
  - name: conversion_rate
    type: ratio
    type_params:
      numerator: purchases_count
      denominator: sessions_count
Enter fullscreen mode Exit fullscreen mode
# LookML — number measure with safe division
view: sessions {
  measure: sessions_count {
    type: count
  }
  measure: purchases_count {
    type: count
    filters: [purchase_flag: "yes"]
  }
  measure: conversion_rate {
    type: number
    sql: ${purchases_count} * 1.0 / NULLIF(${sessions_count}, 0) ;;
    value_format_name: percent_2
  }
}
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Cube and LookML express the ratio as a derived measure — purchases / sessions with NULLIF to protect against zero division.
  2. dbt SL provides a first-class ratio type. MetricFlow generates the NULLIF-style protection automatically and ensures both metrics are aggregated at the same granularity before the ratio is computed.
  3. All three return the same number for any given granularity. The dbt SL form is the most concise; the Cube and LookML forms make the safe-division explicit.
  4. Formatting (format: percent) lives in the layer so every consumer renders the metric the same way (e.g. 12.4%, not 0.124).

Output.

platform conversion_rate (this week)
Cube 12.4%
dbt SL 12.4%
LookML 12.4%

Rule of thumb. For first-order ratios, dbt SL's typed metric is the cleanest. For ratios with conditional filters on numerator and denominator (e.g. "conversion rate of paid users"), all three platforms let you wrap the measure in a filter — choose the one whose syntax your team is already fluent in.

Semantic layer interview question on cross-platform metric translation

A senior interviewer might frame: "We are migrating from LookML to dbt SL. Translate this LookML measure into the dbt SL equivalent: a count_distinct user_id filtered to purchase events, grouped at weekly granularity, with a 30-day rolling window option."

Solution Using a dbt SL simple metric plus a cumulative overlay

# models/semantic/events.yml
semantic_models:
  - name: events
    model: ref('fct_events')
    defaults:
      agg_time_dimension: event_ts
    entities:
      - name: event_id
        type: primary
      - name: user_id
        type: foreign
    dimensions:
      - name: event_ts
        type: time
        type_params: { time_granularity: day }
      - name: event_name
        type: categorical
    measures:
      - name: distinct_purchasers
        agg: count_distinct
        expr: user_id
        agg_time_dimension: event_ts
        # measure filter — only purchase events feed this measure
        filter: "{{ Dimension('events__event_name') }} = 'purchase'"

metrics:
  - name: weekly_purchasers
    label: Weekly purchasers
    type: simple
    type_params:
      measure: distinct_purchasers

  - name: purchasers_30d_rolling
    label: 30-day rolling purchasers
    type: cumulative
    type_params:
      measure: distinct_purchasers
      window: 30 days
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Step Action Notes
1 LookML measure count_distinct(user_id) filter purchase → dbt SL measure distinct_purchasers filter expressed in YAML, not Looker liquid
2 LookML event_week dimension → dbt SL metric_time__week group-by granularity injected at query time
3 New requirement: 30-day rolling window dbt SL cumulative metric with window: 30 days
4 Time spine joined automatically gaps filled with zero rows
5 Consumer queries: weekly_purchasers OR purchasers_30d_rolling one schema, two surfaces

Output:

metric_time__week weekly_purchasers purchasers_30d_rolling
2026-05-31 4,120 14,800
2026-06-07 4,395 15,210
2026-06-14 4,602 15,690

Why this works — concept by concept:

  • Measure-level filter — declaring the event_name = 'purchase' filter on the measure means every metric built from it inherits the filter. No risk that a downstream metric forgets the filter.
  • Metric vs measure separation — measures are LEGO bricks; metrics are the dashboard-facing KPIs. The same distinct_purchasers measure powers both the weekly and the rolling-30-day metric.
  • Cumulative metric type — MetricFlow handles the rolling-window math (joining each row to a 30-day lookback range) without the analyst hand-rolling a window function.
  • Time spine join — gaps in activity become zero rows. Dashboards don't render "missing" weeks; they render "zero" weeks, which is what executives expect.
  • Granularity-agnostic measuredistinct_purchasers works at any grain. Query at metric_time__day for DAU-style; query at metric_time__week for WAU-style; query at metric_time__month for MAU-style. One measure, three dashboards.
  • Cost — one warehouse pass per granularity per refresh; MetricFlow plans the SQL once and reuses across consumers.

SQL
Topic — case-expression
Conditional metric and CASE expression problems (SQL)

Practice →


5. Consumer fan-out — who queries the semantic layer

One query API, many surfaces — Tableau, Power BI, Hex, Mode, embedded apps, and LLM agents from the same metric file

The mental model in one line: a well-placed semantic layer is consumed by every analytics surface in your stack — BI tools through SQL or JDBC, notebooks through the SQL endpoint, embedded apps through REST or GraphQL, and LLM agents through the published schema as a tool surface. Each consumer gets the same metric definition, the same RLS predicates, and the same cache benefits.

Hub-and-spoke diagram with a central semantic-layer hub and six consumer satellites — Tableau, Power BI, Hex, Mode, Embedded app, LLM agent — each linked by a glowing spoke, plus a small ring around the hub labelled 'cache · RLS · auth', on a light PipeCode card.

The consumer map in one paragraph.

  • Heavy BI tools — Tableau, Power BI, Looker, Sigma, MicroStrategy — speak JDBC / ODBC or a vendor-specific connector.
  • Notebooks — Jupyter, Hex, Deepnote, Mode, Databricks notebooks — speak SQL via JDBC or a Python client.
  • Embedded analytics — React / Vue / Angular charts inside SaaS products — speak REST or GraphQL.
  • LLM agents — Slack bots, custom GPTs, Anthropic / OpenAI tools — speak the semantic layer's schema as a tool surface, then the layer emits SQL.

BI tools.

  • Looker. Native LookML consumer. The semantic layer is Looker for Looker.
  • Tableau. Reads dbt SL via the dbt Cloud connector. Reads Cube via the Cube SQL API. Reads LookML indirectly via Looker SQL Runner or by exporting an extract.
  • Power BI. Reads dbt SL via the dbt Cloud / Tableau-style connector. Reads Cube via SQL API.
  • Hex / Mode / Sigma. Best-in-class native dbt SL integrations; also speak Cube SQL API.

Notebooks.

  • Jupyter. A cube-jupyter-client Python package or a generic JDBC driver loads metric results into a DataFrame.
  • Hex. First-class dbt SL integration — drag a metric into a cell, get a DataFrame.
  • Deepnote. Speaks SQL over JDBC against the layer.
  • Databricks notebooks. Useful for ML feature engineering — read the metric, train on it.

Embedded analytics — Cube's GraphQL/REST surface.

  • REST. A POST to /cubejs-api/v1/load with a JSON query. Returns a JSON result with rows and metadata.
  • GraphQL. A cube(where: ...) { events { weeklyActiveUsers } } query against the published schema. Strongly typed.
  • JWT auth. Every embedded request carries a JWT signed by the host app. The layer extracts claims (tenant_id, user_id, role) and binds them to RLS predicates.
  • Front-end SDKs. @cubejs-client/react, @cubejs-client/vue give you <QueryRenderer> components that take a query and render charts.

AI/LLM agents — text-to-SQL using the semantic layer as the grounding context.

  • The agent receives the question: "What was the WAU last week broken down by region?"
  • The agent calls a tool that returns the published cubes / measures / dimensions / joins as a structured schema.
  • The agent constructs a semantic-layer query (not raw SQL): Events.active_users grouped by Regions.region_name, last week.
  • The layer compiles to SQL, runs against the warehouse, returns the result.
  • The agent renders the answer in natural language with the metric name and definition cited — auditable.

Cost-control: caching layers, materialised aggregates, query budgets.

  • Caching. Per-consumer or global; TTL or invalidation on metric deploy.
  • Materialised aggregates. Cube pre-aggs; dbt-materialised saved queries; Looker PDTs.
  • Query budgets. Per-tenant or per-consumer rate limits — the embedded React chart cannot accidentally DDoS the warehouse because the layer enforces a poll interval.

Authentication, RLS, tenant isolation across consumers.

  • Auth. Each consumer attaches a JWT or service-account token. The layer resolves the identity once.
  • RLS. Per-tenant or per-user predicates injected before the warehouse SQL is emitted.
  • Tenant isolation. Cache keys partition by tenant claim. No cross-tenant cache pollution.
  • Audit. Every query is logged with the calling identity, the metric requested, the SQL emitted, and the result row count.

Migration patterns.

  • LookML → dbt SL. Port the 10 top metrics first; keep Looker live for the long tail. Dual-run for 30 days per metric; reconcile numbers daily.
  • LookML → Cube. Same staged pattern; Cube becomes the primary for embedded + LLM, Looker keeps the existing dashboards.
  • Greenfield — pick by consumer mix. Mostly embedded + LLM → Cube. Already on dbt + mostly Tableau/Hex/Mode → dbt SL. Already on Looker with no near-term embedded plans → stay on LookML.

Worked example — wiring the same metric into Tableau, Hex, and a React app

Detailed explanation. Three consumers want the same weekly_active_users metric. Each speaks a different protocol. The semantic layer's published definitions are the same file; the wiring is per-consumer.

Question. Show the connection / query snippet for Tableau (JDBC), Hex (native SL integration), and a React embedded chart (REST).

Input. A semantic layer with weekly_active_users published. Endpoints: SQL at :13306, REST at /cubejs-api/v1/load, GraphQL at /cubejs-api/v1/graphql.

Code.

-- Tableau / JDBC consumer
SELECT MEASURE(weekly_active_users) AS wau,
       event_week
FROM events
GROUP BY event_week
ORDER BY event_week;
Enter fullscreen mode Exit fullscreen mode
# Hex notebook — dbt SL integration (Hex syntax)
import pandas as pd
from hex_sl import query

df = query(
    metrics=["weekly_active_users"],
    group_by=["metric_time__week"],
    start_time="2026-03-22",
    end_time="2026-06-14",
)
df.head()
Enter fullscreen mode Exit fullscreen mode
// React embedded chart — Cube REST via @cubejs-client/react
import { CubeProvider, useCubeQuery } from "@cubejs-client/react";
import { Line } from "react-chartjs-2";

function WauChart() {
  const { resultSet } = useCubeQuery({
    measures: ["Events.active_users"],
    timeDimensions: [{
      dimension: "Events.event_ts",
      granularity: "week",
      dateRange: ["2026-03-22", "2026-06-14"],
    }],
  });
  if (!resultSet) return <p>Loading…</p>;
  return <Line data={resultSet.chartPivot()} />;
}
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Tableau opens a JDBC connection to the layer. The query reads like normal SQL, but the MEASURE(weekly_active_users) syntax tells the layer to resolve the named metric (rather than treat weekly_active_users as a column).
  2. Hex's native integration takes Python-style arguments and translates them into the dbt SL query API. The result is a DataFrame ready for further analysis.
  3. The React component uses the @cubejs-client/react hook. The query object is the same JSON shape as a REST request; the component re-renders on result.
  4. All three consumers see the same number because they all read from the same metric definition on the layer.

Output (one number, three surfaces).

Consumer Protocol Code surface Returned (this week)
Tableau JDBC SQL SELECT MEASURE(weekly_active_users) 19,432
Hex SL Python query(metrics=[...]) 19,432
React chart REST/JS useCubeQuery(...) 19,432

Rule of thumb. Pick the consumer protocol that matches the host environment — JDBC for heavy BI, native SDK for first-class notebook integration, REST/GraphQL for embedded. The metric definition is invariant; only the wiring changes.

Worked example — RLS across a B2B SaaS embedded chart

Detailed explanation. A B2B SaaS product embeds a chart of "your active users this week" in every customer's tenant dashboard. The same chart code ships to 200 tenants; each tenant must see only their own number. RLS at the semantic layer makes the front-end code identical for every tenant.

Question. Show the JWT flow, the layer's RLS rewrite, and the resulting SQL for two different tenants.

Input. Two tenants: tenant_A (Alice) and tenant_B (Bob). Both load the same React embedded component.

Code.

// Host app generates a JWT with the tenant_id claim
const token = jwt.sign(
  { tenant_id: currentUser.tenant_id, sub: currentUser.id },
  process.env.CUBE_API_SECRET,
  { expiresIn: "10m" }
);

// Front-end attaches the token to every request
<CubeProvider
  cubeApi={cubejs(token, { apiUrl: "https://semantic.example.com/cubejs-api/v1" })}
>
  <WauChart />
</CubeProvider>
Enter fullscreen mode Exit fullscreen mode
# semantic layer — cube definition with RLS
cubes:
  - name: Events
    sql: |
      SELECT *
      FROM analytics.fct_events
      WHERE tenant_id = '{COMPILE_CONTEXT.securityContext.tenant_id}'
    measures:
      active_users:
        type: countDistinct
        sql: user_id
Enter fullscreen mode Exit fullscreen mode
-- SQL emitted for Alice (tenant_A)
SELECT COUNT(DISTINCT user_id) AS active_users
FROM   (SELECT * FROM analytics.fct_events WHERE tenant_id = 'tenant_A') events
WHERE  event_ts >= DATE_TRUNC('week', CURRENT_DATE);

-- SQL emitted for Bob (tenant_B)
SELECT COUNT(DISTINCT user_id) AS active_users
FROM   (SELECT * FROM analytics.fct_events WHERE tenant_id = 'tenant_B') events
WHERE  event_ts >= DATE_TRUNC('week', CURRENT_DATE);
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The host app signs a JWT containing tenant_id. The layer's API secret verifies the signature.
  2. The React component is the same for both tenants — no tenant logic in front-end code.
  3. The semantic layer's cube SQL templates the tenant_id from the verified JWT into the FROM clause. The warehouse only ever sees scoped data.
  4. The cache key partitions by tenant_id. Tenant A's cache and tenant B's cache are physically distinct entries.
  5. Adding a third tenant requires zero code changes — just a new JWT with tenant_id = "tenant_C".

Output.

Caller tenant_id Returned active_users
Alice tenant_A 412
Bob tenant_B 158
Internal (no JWT) request denied

Rule of thumb. Multi-tenant SaaS without semantic-layer RLS is a leak waiting to ship. Front-end tenant logic will be forgotten — at some endpoint, in some refactor — and a customer will see another customer's data. The layer-side predicate is the only one that holds.

Worked example — LLM agent grounding via the semantic layer

Detailed explanation. An LLM agent receives the question "What was WAU last week by region, and how does it compare to the week before?" The agent's tools include cube_meta (returns published cubes / measures / dimensions / joins) and cube_query (executes a structured query). The agent never writes raw SQL.

Question. Walk through the tool calls, the structured query the agent constructs, the layer's compiled SQL, and the natural-language answer.

Input. Cube with Events.active_users measure, Regions.region_name dimension, joined via Events → Customers → Regions.

Code.

// Tool: cube_meta (called once at session start)
{
  "cubes": [
    {
      "name": "Events",
      "measures": ["Events.active_users"],
      "dimensions": ["Events.event_ts", "Events.event_name"],
      "joins": [{"to": "Customers", "on": "customer_id"}]
    },
    {
      "name": "Customers",
      "joins": [{"to": "Regions", "on": "region_id"}]
    },
    {
      "name": "Regions",
      "dimensions": ["Regions.region_name"]
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode
// Tool: cube_query (called by the agent to answer the question)
{
  "measures": ["Events.active_users"],
  "dimensions": ["Regions.region_name"],
  "timeDimensions": [{
    "dimension": "Events.event_ts",
    "granularity": "week",
    "dateRange": ["2026-06-01", "2026-06-14"]
  }]
}
Enter fullscreen mode Exit fullscreen mode
-- The semantic layer compiles to (sketch)
SELECT r.region_name,
       DATE_TRUNC('week', e.event_ts) AS event_week,
       COUNT(DISTINCT e.user_id) AS active_users
FROM   analytics.fct_events e
JOIN   analytics.dim_customers c ON e.customer_id = c.customer_id
JOIN   analytics.dim_regions   r ON c.region_id  = r.region_id
WHERE  e.event_ts >= '2026-06-01'
  AND  e.event_ts <  '2026-06-15'
GROUP BY r.region_name, DATE_TRUNC('week', e.event_ts)
ORDER BY event_week, r.region_name;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The agent loads cube_meta to see the published schema. It learns: there is an Events.active_users measure, a Regions.region_name dimension, and an implicit join chain between them.
  2. The agent constructs a structured cube_querynot SQL — using the measures and dimensions it just learned. The query asks for Events.active_users grouped by Regions.region_name at weekly granularity over the last 14 days.
  3. The semantic layer compiles the structured query to SQL, including the join chain Events → Customers → Regions. The agent never wrote a JOIN clause.
  4. The result rows come back per region per week. The agent computes the week-over-week delta in natural language and presents the answer.
  5. Critically: the agent could not have hallucinated a wrong join. The schema published by cube_meta is the only surface it has access to.

Output (sample).

region_name event_week active_users wow_delta
EU 2026-06-01 9,210 n/a
EU 2026-06-08 9,440 +2.5%
US 2026-06-01 7,890 n/a
US 2026-06-08 8,150 +3.3%
APAC 2026-06-01 1,210 n/a
APAC 2026-06-08 1,332 +10.1%

The agent renders: "WAU last week was 18,922, up 4.0% from 18,310 the prior week. APAC grew fastest at +10.1%." — and cites Events.active_users as the metric.

Rule of thumb. Ground LLM agents on the semantic layer's schema, not on the raw warehouse schema. The constrained surface eliminates the entire category of "hallucinated JOIN" failures and makes every agent response auditable.

Worked example — migration from LookML to dbt SL, staged by metric

Detailed explanation. A company on Looker decides to migrate to dbt SL. Rather than a big-bang rewrite, they identify the 10 top metrics by query volume and rewrite those first. The other 190 LookML measures stay live until they are naturally retired or rebuilt.

Question. Outline the staged migration plan with a 30-day dual-run reconciliation per metric.

Input. 200 LookML measures; top 10 cover 80% of dashboard query volume.

Code.

migration:
  approach: staged
  parallel_run_days: 30
  top_metrics:
    - weekly_active_users
    - daily_active_users
    - total_revenue
    - new_signups
    - churn_rate
    - retention_d7
    - retention_d30
    - average_order_value
    - conversion_rate
    - lifetime_value
  reconciliation:
    cadence: daily
    tolerance_pct: 0.5
    block_threshold_pct: 1.0
    owner: analytics-engineering
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Inventory all LookML measures and rank by query volume from Looker's i__looker.history system table. Pick the top 10 — they cover most of the business.
  2. Rewrite each measure as a dbt SL metric (often a simple over a count_distinct or sum measure). For derived measures, use ratio or derived types.
  3. Dual-run for 30 days. Every day, a reconciliation job queries both LookML and dbt SL for the same metric, same filters, same granularity. If the numbers differ by > 0.5%, raise a warning; > 1.0%, block the migration of that metric.
  4. After 30 days of clean reconciliation, mark the dbt SL metric as primary. New dashboards point at dbt SL; the LookML measure stays read-only for one quarter, then is deleted.
  5. Repeat for the next batch of 10 metrics. The long tail (190 measures) is migrated opportunistically as dashboards are rebuilt.

Output.

Stage Metrics migrated LookML measures live Dual-run pass rate
Stage 1 (months 1–2) 10 (top 10) 200 100%
Stage 2 (months 3–4) 30 (next 20) 170 100%
Stage 3 (months 5–6) 80 (next 50) 120 98%
Stage 4 (months 7–12) 200 (all) 0 99% overall

Rule of thumb. Stage the migration by query volume, not alphabetically and not by team ownership. The top 10 metrics carry the migration's business value; the long tail can sit on LookML for as long as Looker is licensed.

Semantic layer interview question on cross-consumer reliability

A senior interviewer might frame it as: "Your CTO walks in with a complaint: 'The number on the executive dashboard is different from the number in the embedded customer portal and different again from what the Slack bot says.' Walk me through the diagnosis and the fix."

Solution Using semantic-layer consolidation as the audit-and-fix pattern

# Step 1: diagnose — list every place "weekly_active_users" is computed
diagnosis:
  exec_dashboard:
    source: Looker calculated field
    formula: count_distinct user_id where event_ts >= ...
  embedded_portal:
    source: Tableau workbook
    formula: count_distinct user_id where event_name = 'login' and event_ts >= ...
  slack_bot:
    source: ad-hoc Snowflake query in Python
    formula: count_distinct user_id where event_ts >= dateadd(day, -7, current_date)

# Step 2: fix — publish one metric, point every surface at it
metrics:
  - name: weekly_active_users
    description: Distinct users with any event in the trailing 7 days.
    type: simple
    type_params:
      measure: distinct_users
    filter: "{{ TimeDimension('events__event_ts') }} >= dateadd('day', -7, current_date)"

# Step 3: rewire consumers
consumers:
  exec_dashboard:  semantic_layer_sql_endpoint
  embedded_portal: semantic_layer_rest_endpoint
  slack_bot:       semantic_layer_graphql_endpoint
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Step Consumer Pre-fix number Why it differed Post-fix number
1 Exec dashboard 19,221 included all events 18,922
2 Embedded portal 14,650 filtered to login only 18,922
3 Slack bot 18,990 off-by-one in date window 18,922
4 All three after fix 18,922 all reading from same metric 18,922

Output:

Consumer Source of weekly_active_users Discrepancy after consolidation
Exec dashboard semantic layer SQL endpoint 0
Embedded portal semantic layer REST endpoint 0
Slack LLM bot semantic layer GraphQL endpoint 0

Why this works — concept by concept:

  • Diagnosis first — every "numbers don't match" incident starts as a survey of every place the metric is computed. The semantic layer's value proposition is precisely this consolidation.
  • One file, one definition — the migrated metric lives once in weekly_active_users.yml. There is no other place to edit it.
  • Consumer rewiring — each surface stops computing the metric in-tool and starts reading from the layer's endpoint. The viz / chart / bot code shrinks.
  • Cache convergence — within one cache TTL of the fix landing, every consumer reports the same number. The "numbers different by consumer" symptom physically cannot reproduce.
  • Audit trail — every future change to the metric flows through a PR. The CTO's "why did this number change?" question is answered by git log weekly_active_users.yml.
  • Cost — usually a reduction — three separate consumer-side computations collapse into one cached semantic-layer query.

SQL
Topic — sql
End-to-end SQL practice for analytics engineering

Practice →


Cheat sheet — semantic layer recipes

  • All-in on Looker, no near-term embedded. Stay on LookML, integrate dbt for the upstream model layer, and let Looker continue to be the consumer. The semantic layer is Looker.
  • dbt-native, multiple BI tools (Tableau / Power BI / Hex / Mode). Adopt the dbt Semantic Layer + MetricFlow. Definitions live next to dbt models; CI catches metric breaks at PR time.
  • Embedded analytics or AI agents are the dominant consumer. Choose Cube.dev for the REST / GraphQL surface. The semantic layer becomes the agent's tool surface.
  • Need pre-aggregated cubes for sub-second BI on billion-row facts. Use Cube's pre_aggregations — declare the roll-up grain, partition by month, refresh on a TTL.
  • Need date spine + cumulative / rolling-window metrics. dbt SL has best-in-class time-spine support; cumulative metrics with window: 30 days are first-class.
  • Need to ground an LLM agent on governed metrics. Publish the semantic layer's schema as the agent's tool surface. The agent constructs structured queries; the layer compiles to SQL.
  • Multi-tenant SaaS embedded analytics. Route every consumer through the semantic layer with JWT-driven RLS. The cube sql template inlines tenant_id from securityContext.
  • Migrating off LookML. Stage by query volume — port the top 10 metrics first, dual-run for 30 days, reconcile daily, and let the long tail retire naturally over the next 12 months.
  • Picking between Cube and dbt SL. If consumers are 60%+ embedded / API / LLM, lean Cube. If consumers are 60%+ existing BI tools and you already own a dbt repo, lean dbt SL.
  • Defining a metric once. Pick the LEGO-brick measure (count_distinct, sum, count) and let the platform compose the metric. Avoid hand-rolling separate measures for DAU, WAU, MAU — use one and vary granularity.
  • Safe division in derived metrics. Wrap the denominator in NULLIF(..., 0) (Cube and LookML) or use the typed ratio metric (dbt SL).
  • Cache eviction on metric edit. Make sure the semantic layer flushes the cache on deploy — otherwise consumers see the old number for the duration of the TTL after a definition change.
  • Audit trail. Every metric definition lives in git. git log <metric.yml> is the answer to "why did this number change?" — for compliance, finance, and the CEO.

Frequently asked questions

Do I need a semantic layer if I use dbt?

You can run dbt without a semantic layer — your marts will be clean and well-tested, and analysts will write SQL on top of them. The semantic layer matters the moment more than one consumer asks for the same metric. Without it, the metric is re-implemented in each BI tool, each notebook, and each embedded chart — and the numbers drift. The dbt Semantic Layer (powered by MetricFlow) is the natural choice if you already use dbt, because metric definitions live in the same repo as your models and flow through the same PR review and CI. If your consumers include embedded analytics or LLM agents, Cube is often the better fit because of its REST/GraphQL surface; in that case you keep dbt for the model layer and put Cube on top.

Can I use the dbt Semantic Layer without dbt Cloud?

Partially. MetricFlow (the engine behind the dbt Semantic Layer) ships as open source and runs from dbt Core via the mf CLI — you can define semantic_models and metrics, validate them, and query metrics locally. What you do not get without dbt Cloud Team or Enterprise is the hosted Semantic Layer server, the caching tier, the JDBC connector for Tableau / Power BI / Hex / Mode, and the official integrations. For most teams that pay for any BI tool today, the dbt Cloud tier pays for itself in eliminated metric-duplication and reduced warehouse spend; for pure OSS shops, MetricFlow + a custom query gateway is a viable path but more work.

Is Cube.dev free?

The Cube Core engine is open source and free to self-host — Docker, your warehouse, your servers. Cube Cloud is the paid managed offering, which adds the IDE, hosted pre-aggregation runners, deployment automation, query history, role-based access control, and lineage. Teams typically start on Cube Core for evaluation and adopt Cube Cloud once metric count grows beyond 50 or embedded analytics SLAs require professional infrastructure. The OSS edition is genuinely usable in production — Cube was the first semantic layer with this OSS-with-paid-managed model and remains the most popular standalone semantic engine.

How does LookML compare to the dbt Semantic Layer?

LookML is the original semantic modelling language and is tightly coupled to Looker as the consumer surface — view and explore files map to the Looker UI, the IDE, and the SQL Runner. The dbt Semantic Layer is consumer-agnostic and lives in your existing dbt repo. LookML wins on IDE polish, content validation, and the fact that Looker users get a fully managed experience. dbt SL wins on portability (Tableau, Power BI, Hex, Mode all consume it natively), on cost (no per-seat Looker license), and on lineage (metrics live next to the models they read from). The most common 2026 migration pattern is staged: dbt SL for new dashboards and the top 10 metrics, LookML for the long tail until naturally retired.

Can LLM agents query a semantic layer?

Yes — and this is the use case that pushed every major BI vendor to ship a semantic-layer story in 2024–2026. The agent reads the published schema (cubes / semantic_models / views, plus measures, dimensions, and joins) as a tool surface, then constructs structured queries against the layer. The layer compiles to SQL. The agent never writes raw SQL itself, which eliminates the "hallucinated JOIN" failure mode that plagues text-to-SQL on raw warehouse schemas. Cube's GraphQL surface and dbt SL's API are both well-suited for this; the LookML route works via Looker's API but with more friction. For LLM-heavy roadmaps, Cube is the most common pick.

What's the difference between a metrics layer and a semantic layer?

The terms overlap heavily in 2026 marketing. A "metrics layer" emphasises the named KPIs (WAU, MAU, revenue, churn, retention) — the dashboard-facing numbers. A "semantic layer" emphasises the broader modelling surface — measures, dimensions, joins, entity relationships, segments — that lets you derive metrics. dbt SL, Cube, and LookML are all semantic layers in this fuller sense; LookML calls itself "semantic modelling," dbt SL talks about "semantic_models," and Cube talks about "cubes." The 2020-era "metrics layer" companies (Transform, Supergrain, Trace, GoodData) either pivoted into full semantic layers or were acquired by larger semantic-layer / BI vendors. Practically, when you read "metrics layer" today, assume "semantic layer" — they refer to the same engineering object.

Practice on PipeCode

Pipecode.ai is Leetcode for Data Engineering — every semantic-layer recipe above ships with hands-on practice rooms where you write the `count_distinct` measure, the entity-based join, and the safe-division ratio against real graded inputs. PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine, so you can rehearse the analytics-engineering moves behind Cube, dbt SL, and LookML against the same SQL fundamentals every interviewer probes.

Practice aggregation now →
JOIN drills →

Top comments (0)