Alex Merced

Posted on May 1

Semantic Layer: The Definitive Guide

#ai #architecture #data #dataengineering

The term "semantic layer" has been part of the data industry's vocabulary for over 35 years. It first appeared in a 1991 patent filing by Business Objects, and it has since been reinvented, abandoned, and reinvented again across three distinct eras of data architecture. Today, it sits at the center of one of the most consequential design debates in the industry: should the semantic layer be a standalone product you bolt onto your stack, or a native capability of the platform that already manages your data?

This guide covers the full arc: what a semantic layer is, where it came from, how it split into two competing architectural approaches, and why the choice between them determines whether your AI agents produce accurate answers or hallucinated nonsense.

What a Semantic Layer Actually Is

A semantic layer is an abstraction that maps the physical structure of your data (table names, column names, join logic, filter conditions) to the business terms that people actually use (revenue, churn rate, active customer, cost per acquisition). It sits between the raw data and every consumer of that data: BI dashboards, AI agents, Python notebooks, Excel spreadsheets, and custom applications.

The semantic layer has three core responsibilities:

Metric consistency. When the finance team says "revenue," they mean recognized revenue net of refunds. When the sales team says "revenue," they mean bookings including pending deals. Without a semantic layer, both teams write their own SQL, get different numbers, and spend the next two weeks arguing about which dashboard is right. A semantic layer defines "revenue" once, in one place, and every downstream consumer uses that definition.

Access governance. The semantic layer controls who sees what. A marketing analyst querying customer data should not see Social Security numbers. A regional manager should only see data for their region. These rules (row-level security, column masking, role-based access) are defined at the semantic layer and enforced consistently regardless of which tool is doing the querying.

Query abstraction. Business users and AI agents should not need to know that "customer churn rate" requires joining three tables, filtering out test accounts, calculating a 90-day rolling window, and dividing by the active customer count from the prior period. The semantic layer encapsulates that logic in a reusable definition. Consumers ask for "churn rate" and get the right answer.

The Origin Story: Business Objects, 1991

The semantic layer was invented to solve a simple problem: business users could not write SQL.

In 1991, Business Objects filed a patent for a "relational database access system using semantically dynamic objects." The product feature was called "Universes." It worked like this: a data architect would build a metadata model that mapped physical database tables and join paths into business-friendly objects ("Customer," "Product," "Sales Amount"). Report builders could then drag and drop these objects to create queries without touching SQL.

This was a significant advance. Before Universes, generating a report from a relational database required either a developer who understood the schema or a business user willing to learn SQL. Business Objects eliminated that requirement entirely.

IBM's Cognos followed with "Framework Manager," which served the same purpose: map the physical database into a logical, business-friendly model. SAP built InfoProviders and BEx queries on top of SAP BW. Microsoft introduced SQL Server Analysis Services.

Every major enterprise BI vendor in the 1990s built some version of a semantic layer. But they all shared the same fundamental limitation: the semantic layer was proprietary and locked to a single vendor's BI tool. If you built your metrics in Business Objects Universes, those definitions did not carry over to Cognos. If you modeled your data in SSAS, Tableau could not read it. The semantic layer existed, but it was a walled garden.

OLAP Cubes: The Implicit Semantic Layer

Running parallel to the relational semantic layer was the OLAP (Online Analytical Processing) cube. Products like SQL Server Analysis Services, Cognos TM1, and Oracle Essbase pre-computed data into multidimensional structures: dimensions (Customer, Product, Time), measures (Revenue, Quantity, Profit), and hierarchies (Year > Quarter > Month > Day).

The cube itself functioned as a semantic layer. Business users did not query tables; they navigated dimensions. They did not write SQL; they used MDX (Multidimensional Expressions) or simply clicked through pivot-table interfaces. The business logic was baked into the cube's structure.

OLAP cubes worked well for their era. Pre-computing aggregations meant that analytical queries returned in seconds, even on the hardware of the early 2000s. But they had three fatal weaknesses:

Rigidity. Adding a new dimension or changing a hierarchy required rebuilding the cube, which could take hours for large datasets. Business requirements change faster than cubes can be rebuilt.
Cost. Cubes stored pre-aggregated copies of data. For large organizations, this meant maintaining terabytes of redundant, pre-computed data on expensive storage.
Specialization. Operating an OLAP cube required specialized skills (MDX, cube design, aggregation strategies) that most data teams did not have.

As cloud data warehouses like Snowflake, BigQuery, and Redshift made raw compute cheap and fast, the need for pre-aggregation declined. You could run the analytical query directly against the detail data and get results in seconds. The cube's primary value proposition, speed through pre-computation, was no longer unique.

The Self-Service Era and the Loss of the Semantic Layer

The 2010s brought a dramatic shift. Self-service BI tools like Tableau and Power BI connected directly to databases, bypassing the semantic layer entirely. This was marketed as democratization: give every analyst direct access to the data, and they will find their own insights.

For small teams, this worked. For organizations with more than a handful of analysts, it created a problem that the industry calls "metric drift." Without a centralized semantic layer, each analyst wrote their own SQL. Each SQL query embedded its own business logic. Revenue was calculated five different ways by five different people, and no one could agree on which number was correct.

The first response to metric drift came from Looker, founded in 2012, which introduced LookML as a code-based semantic layer. You defined your metrics, dimensions, and relationships in version-controlled modeling files. This was a meaningful evolution because it separated the semantic logic from the BI tool's proprietary report format. Google acquired Looker for $2.6 billion in 2019, validating that the semantic layer was worth owning. But LookML was still tied to Looker's ecosystem. If you used Tableau or Power BI as your primary BI tool, LookML did not help.

The broader industry realization was clear: skipping the semantic layer does not eliminate the need for one. It just distributes the problem across every team and every dashboard, where it becomes harder to find and harder to fix.

Dremio: The Semantic Layer Built Into the Query Engine From Day One

While Looker was coupling the semantic layer to a BI tool, a different approach was emerging. Dremio was founded in 2015 by Tomer Shiran and Jacques Nadeau, creators and contributors to the Apache Drill project. When Dremio publicly launched in July 2017, it introduced what it called a "governed, self-service semantic layer" as a core architectural component, not an add-on.

The key difference: Dremio's semantic layer was integrated directly into a high-performance query engine. From its first release, the platform shipped with:

Virtual Datasets (Views). SQL-defined business logic that users could create, share, and layer on top of any connected data source. No data movement required.
Reflections. Patented, transparent materialized views that the query optimizer substitutes automatically. Users query the governed view; Dremio serves the fastest available Reflection behind the scenes.
Federated access. The semantic layer worked across data sources (S3, HDFS, relational databases) from the start, not just against a single warehouse.

Dremio added Wikis and Labels (Tags) in subsequent releases, providing Markdown-formatted documentation and classification metadata directly attached to datasets in the catalog. This meant the semantic layer was not just a set of views; it included the context that made those views discoverable and understandable.

This was architecturally distinct from every other semantic layer on the market. AtScale (founded 2013) and Cube (open-sourced 2019) built the semantic layer as a separate product. Dremio built it into the same platform that executed the queries and managed the catalog. That design decision would become increasingly important as AI agents entered the picture.

The Modern Resurgence: Two Divergent Paths

By the early 2020s, the semantic layer was firmly back. dbt Labs acquired Transform (the creators of MetricFlow) in February 2023 to build a code-based metrics layer. Cube had open-sourced its API-first semantic layer in 2019 and launched Cube Cloud commercially in 2021. AtScale had been building its enterprise virtualization layer since 2013.

The market had split into two fundamentally different architectural forms, and the choice between them has significant consequences for how your data platform operates.

Path 1: The semantic layer as a standalone product. Companies like AtScale (2013), Cube (2019), and dbt (MetricFlow, 2023) built the semantic layer as a separate service that sits between your data warehouse and your BI tools. You deploy it as its own infrastructure, manage it as its own system, and integrate it with your existing stack.

Path 2: The semantic layer as a platform feature. Dremio (2017) integrated the semantic layer directly into its query engine and data catalog from the start. There is no separate service to deploy. The semantic layer is a native capability of the same platform that stores, governs, and queries your data.

Both approaches solve the metric consistency problem. They differ in how they solve it, what they require operationally, and how well they extend to AI-driven analytics.

Path 1: The Semantic Layer as a Standalone Product

Three standalone semantic layer products dominate the current market. Each targets a different architecture and team profile.

AtScale (Founded 2013)

AtScale, founded by veterans of the Yahoo data team, positions itself as a "universal semantic layer" for large enterprises. It creates a virtualization layer across multiple data warehouses (Snowflake, BigQuery, Databricks), presenting a unified semantic model to BI tools. Its strongest feature is native MDX and DAX compatibility, which makes it the only option for organizations with heavy Excel and SSAS dependencies.

AtScale excels when you have data spread across multiple warehouses and need a single semantic model that works across all of them. The tradeoff is infrastructure complexity and licensing cost. AtScale requires dedicated infrastructure, and its enterprise pricing model reflects its positioning.

Cube (Open-Sourced 2019)

Cube started as Statsbot in 2016 before pivoting to become an open-source, API-first semantic layer in 2019. It provides REST, GraphQL, SQL, MDX, and DAX APIs, making it the most flexible option for embedded analytics and customer-facing dashboards. Cube Cloud launched commercially in 2021. Cube's pre-aggregation engine can deliver sub-second responses for complex queries by pre-computing results and caching them.

Cube excels when your primary consumer is a custom application, not a BI tool. The tradeoff is operational overhead: Cube runs as a separate server, requires its own infrastructure, and demands expertise in designing pre-aggregation strategies to achieve optimal performance.

dbt Semantic Layer (MetricFlow, 2023)

The dbt Semantic Layer is powered by MetricFlow, the technology dbt Labs acquired when it purchased Transform in February 2023. It lets teams define metrics as code in YAML files within their existing dbt projects. Metrics are version-controlled, reviewed via pull requests, and deployed alongside your dbt transformations. In late 2025, dbt Labs moved MetricFlow to an Apache 2.0 license, signaling a commitment to open, portable metric definitions.

The dbt Semantic Layer excels when your team is already a dbt shop and wants metrics managed in the same Git-based workflow as transformations. The tradeoff is that it requires dbt Cloud for the serving layer, lacks native caching (relying on the underlying warehouse for query execution), and is less suited for high-concurrency embedded applications.

The Structural Tradeoff of Standalone Products

All three standalone products share the same architectural limitation: they exist as a separate layer of infrastructure that must integrate with your data catalog, your governance system, and your query engine.

This means:

Another system to operate. You deploy it, monitor it, upgrade it, and debug it.
Governance is a separate concern. Access policies defined in your catalog or warehouse must be replicated or synced with the semantic layer. Any gap is a security risk.
No native execution. Standalone semantic layers define metrics but do not execute queries. They translate user requests into SQL and send that SQL to an external engine. If the engine and the semantic layer disagree on the data model, you get wrong results.
Sync lag. When you change a table schema, add a column, or update governance rules, the semantic layer must be updated separately. Until it is, your definitions are stale.

For teams with a single data warehouse, a strong DevOps practice, and a primary use case that matches one of these products, standalone semantic layers work well. For teams managing federated data across multiple sources, or teams building AI-driven analytics, the gap between "definition" and "execution" creates friction that compounds over time.

Path 2: The Semantic Layer as a Platform Feature

The alternative is to build the semantic layer into the same platform that manages your data catalog, governs access, and executes queries. This is the approach Dremio takes.

In Dremio, the semantic layer is not a separate product you bolt on. It is a native set of capabilities (views, wikis, labels, lineage, knowledge graph) that are integrated with the Open Catalog (built on Apache Polaris), the MPP query engine (built on Apache Arrow), and the governance system (Fine-Grained Access Control, row-level security, column masking).

This matters because the three activities that define a semantic layer, defining metrics, governing access, and executing queries, all happen in the same system. There is no handoff, no sync, no governance gap.

How Dremio's Semantic Layer Works

Dremio's AI Semantic Layer is built from five components that work together: views, wikis, labels, lineage, and the knowledge graph.

Views (Virtual Datasets)

Views are the foundation. A view is a SQL-defined virtual dataset that encapsulates business logic: joins, filters, calculations, and transformations. You write the SQL once, and every consumer (BI tool, AI agent, Python notebook) queries the view instead of the raw tables.

Dremio recommends a three-layer architecture for views:

Preparation Layer. One view per source table. Handles type casting, column renaming, null handling. A direct 1:1 mapping of raw data into clean, standardized form.
Business Layer. Shared business logic. This is where you define "active customer" (customers with at least one order in the last 90 days, excluding test accounts), "revenue" (order_amount minus refunds, in USD), and every other metric that needs a single definition.
Application Layer. Tailored datasets for specific consumers. A marketing dashboard view joins customer demographics with campaign performance. An AI agent view exposes the most commonly asked metrics with rich column-level documentation.

Because views are virtual, they do not copy or move data. They execute against the underlying data at query time, using Dremio's federated query engine to pull from S3, PostgreSQL, Snowflake, MongoDB, or any connected source. Change the underlying data, and the view reflects it immediately.

Wikis

Wikis are Markdown-formatted documentation attached directly to spaces, sources, folders, tables, views, and columns. They serve two audiences: human analysts browsing the catalog, and AI agents generating SQL.

A wiki for a view called analytics.customer_health might contain:

## Customer Health Score

Composite metric combining purchase frequency, support ticket volume,
and NPS survey responses over the trailing 90 days.

**Owner:** Customer Success team
**Refresh:** Updated daily by the ETL pipeline
**Filters:** Excludes test accounts (account_type != 'test')
**Churn definition:** Score below 30 for two consecutive months

Dremio can also auto-generate wiki content. The platform samples table data, analyzes column distributions, and produces context-rich descriptions using generative AI. This is particularly valuable for large data estates where manually documenting hundreds of tables is impractical.

Labels

Labels classify and organize data assets. You tag a table as PII, Finance, Marketing, Approved, or Draft. Labels serve two purposes: they improve discoverability (semantic search returns results filtered by label), and they integrate with governance rules (all PII-labeled columns automatically apply masking policies).

Like wikis, labels can be AI-suggested. Dremio analyzes column names, data patterns, and content to recommend labels like contains-email or likely-PII.

Lineage

Dremio automatically tracks the flow of data from source to view to consumer. You can see which raw tables feed into which business views, and which dashboards or AI queries consume those views.

Lineage is critical for impact analysis. Before changing the schema of a source table, you can trace all downstream dependencies and understand exactly what will break. Without automated lineage, this analysis requires manually reading SQL definitions and hoping you did not miss one.

Knowledge Graph

The knowledge graph is the newest addition to Dremio's semantic layer. It operates at a higher level than individual wikis and labels, building a connected graph of entity relationships, metric definitions, and usage patterns.

The knowledge graph works in three ways:

Pattern detection. It analyzes query patterns across your organization to detect implicit definitions. If 80% of queries that reference "active customers" use the same WHERE clause (last_order_date > CURRENT_DATE - INTERVAL '90' DAY AND account_type != 'test'), the knowledge graph surfaces that pattern as a candidate definition.
User-defined context. You can provide business context as structured markdown files. These files define entities, relationships, and business rules that the knowledge graph ingests and makes available to AI agents.
Relationship mapping. The knowledge graph connects related entities (customers are related to orders, orders contain products, products belong to categories) and exposes those relationships to AI agents, enabling more accurate multi-table SQL generation.

Semantic Search

Semantic search lets users and AI agents discover data assets using natural language. Instead of browsing a schema tree looking for a table called dwh_fact_cust_ord_line_item, you search for "customer orders by product category" and find the relevant view, complete with its wiki documentation and labels.

Semantic search indexes wikis, labels, column names, table descriptions, and view definitions. It is the entry point for both human exploration and AI agent data discovery.

Why the Integrated Approach Changes Everything for AI

The reason the platform-versus-product distinction matters more now than it did five years ago is AI. Specifically, AI agents that generate SQL from natural language questions.

An AI agent that receives the question "What was our customer churn rate by region last quarter?" needs three things to produce an accurate answer:

Context. What does "churn rate" mean in this organization? What table contains the data? Which columns are relevant? What filters should be applied? The semantic layer's wikis, labels, views, and knowledge graph provide this context.
Access. Can this user see the churn data? Are there row-level filters based on their role? Are any columns masked? The governance system enforces these rules.
Execution speed. The user expects an answer in seconds, not minutes. The query engine needs to be fast enough for interactive use.

In a standalone semantic layer architecture, these three capabilities live in three different systems: the semantic layer product provides context, the data catalog or warehouse provides governance, and a separate query engine provides execution. The AI agent must coordinate across all three, and any mismatch between them produces wrong answers, security violations, or slow responses.

In Dremio's architecture, all three are co-located. The AI Agent reads the wikis, labels, and view definitions from the semantic layer, generates SQL that respects governance rules, and executes the query on the built-in MPP engine. The entire loop happens within a single governed platform.

Dremio's MCP Server extends this to external AI tools. ChatGPT, Claude Desktop, or any custom agent that supports the Model Context Protocol can connect to Dremio and query through the same governed semantic layer. The external AI agent receives the same business context, respects the same governance rules, and gets the same fast query execution as the built-in AI Agent.

The semantic layer teaches the AI your business language so it generates the right SQL, not generic SQL.

Platform vs Product: A Side-by-Side Comparison

Dimension	Standalone Products (AtScale, Cube, dbt)	Platform-Integrated (Dremio)
Infrastructure	Separate server or service to deploy	Built into the platform
Governance	Must integrate with external catalog and warehouse	Native FGAC, row-level security, column masking
Query execution	Depends on external warehouse or engine	Built-in MPP engine (Apache Arrow)
Metric definitions	YAML files, code, or GUI-based models	SQL views in the catalog
AI readiness	Requires separate MCP adapter or API integration	Native AI Agent + MCP Server
Data access	Single warehouse or requires federation setup	Federated queries across 30+ sources
Performance optimization	Pre-aggregation (Cube) or warehouse-dependent (dbt)	Reflections (autonomous, transparent acceleration)
Sync lag	Possible lag between definition changes and enforcement	Real-time; definitions and execution are the same system
Best for	Teams with a single warehouse and specific tooling needs	Teams with diverse data sources, AI-driven analytics, or multi-engine requirements

When a Standalone Product Fits

You use a single data warehouse (Snowflake, BigQuery) and your semantic layer needs are limited to consistent BI metrics
Your team is already deeply invested in dbt and wants metrics alongside transformations
You are building customer-facing embedded analytics and need Cube's pre-aggregation performance
You have heavy Excel/MDX requirements that only AtScale supports

When the Platform Approach Fits

Your data lives across multiple sources (S3, PostgreSQL, Snowflake, MongoDB) and you need federated access
You want governance rules defined once and enforced everywhere, including for AI agents
You are building or planning AI-driven analytics (AI Agent, MCP, natural language querying)
You want to eliminate the operational overhead of managing a separate semantic layer product
You need the semantic layer, the catalog, and the query engine to operate as a single governed system

Building Your Semantic Layer: A Practical Framework

If you are starting from scratch or migrating from an ad-hoc metric landscape, here is a practical sequence:

Step 1: Identify your top 10 metrics. Not all metrics need to be in the semantic layer on day one. Start with the metrics that cause the most confusion: revenue, churn, active users, cost per acquisition, NPS. These are the metrics where two teams have two different SQL queries and two different numbers.

Step 2: Build the layered view architecture. For each metric, create the three-layer view stack in Dremio. Preparation views clean the source data. Business views encode the agreed-upon logic. Application views tailor the output for specific consumers.

Step 3: Add wikis and labels. Document each view and its columns. Define what the metric means, who owns it, how it is calculated, and what filters are applied. Tag columns with labels like PII, Finance, or Approved.

Step 4: Configure governance. Apply Fine-Grained Access Control: row-level security for multi-tenant data, column masking for sensitive fields, role-based access for views. These rules are enforced at query time for every consumer, including AI agents.

Step 5: Connect AI interfaces. Enable the Dremio AI Agent for your team. Set up the MCP Server for external AI tools. The wikis and labels you added in Step 3 become the context that makes AI-generated SQL accurate.

Step 6: Expand. Add the next 10 metrics. Build knowledge graph definitions for complex entity relationships. Let autonomous Reflections learn from query patterns and accelerate the most common queries automatically.

The semantic layer is not a one-time project. It is a living system that grows with your organization's data needs. Start small, prove value on the metrics that matter most, and expand from there.

Try Dremio Cloud free for 30 days to build your semantic layer on top of your existing data sources with zero data movement and native AI agent support.

DEV Community