DEV Community

Cover image for Why Dremio's Value Is Unique to Apache Iceberg Lakehouses and Agentic Analytics
Alex Merced
Alex Merced

Posted on

Why Dremio's Value Is Unique to Apache Iceberg Lakehouses and Agentic Analytics

Most data teams have already made two decisions, even if they haven't written them down yet. The first is that Apache Iceberg will be the table format their analytical data lives in. The second is that AI agents will be querying that data, not just dashboards and analysts. The Apache Iceberg lakehouse and agentic analytics aren't separate initiatives. They're two halves of the same architecture, and the teams that treat them that way will get to trusted AI years ahead of the teams that don't.

Here's the problem. The path between "we run a warehouse and some databases" and "agents answer business questions against governed Iceberg tables" is full of blockers. Migration risk. Table maintenance. Semantic context for AI. Mountains of unstructured documents. Most vendors solve one of these and leave you to stitch together the rest from three or four other products.

Dremio is built to take you through all four. Its federated query engine lets you start before you migrate anything. Its autonomous management runs the Iceberg lakehouse for you. Its AI Semantic Layer, built-in AI Agent, MCP server, and CLI give agents governed access with real business meaning. And its AI Functions turn PDFs sitting in object storage into Iceberg tables with a single SQL statement.

This post walks through why the Iceberg lakehouse and agentic analytics matter, what blocks teams from getting there, and how Dremio removes each blocker in order.

This post walks through why the Iceberg lakehouse and agentic analytics matter, what blocks teams from getting there, and how Dremio removes each blocker in order.

Why You Want an Apache Iceberg Lakehouse and Agentic Analytics

Start with the lakehouse half. The argument for storing your analytical data in Apache Iceberg tables on your own object storage comes down to three things: interoperability, cost, and control.

Interoperability is the big one. Iceberg is an open table format with a published spec and a REST catalog standard. When your tables live in Iceberg, any compliant engine can read and write them. Dremio, Spark, Flink, Trino, Snowflake, and dozens of other tools all speak Iceberg now. That means you pick the best engine for each workload instead of the engine your storage vendor forces on you. Your streaming pipeline can write with Flink while your BI layer queries with Dremio, and both see the same consistent snapshots. No exports. No copies. No format conversion tax.

Cost follows directly from that. Object storage like S3, ADLS, or GCS costs a fraction of proprietary warehouse storage, and you only pay for it once. The traditional pattern of copying the same data into a warehouse, a BI extract, and three departmental marts multiplies your storage bill and your governance surface at the same time. One Iceberg copy on cheap object storage, queried in place by whatever engine needs it, collapses that sprawl. You also escape the lock-in math where leaving a vendor means re-platforming years of accumulated tables.

Control is the quieter benefit. Iceberg gives you warehouse-grade features (ACID transactions, schema evolution, partition evolution, time travel) on files you own, in buckets you control, governed by catalogs built on open standards like Apache Polaris. Your data stays in your storage. That's not a slogan. It's a negotiating position.

Now the agentic half. Agentic analytics is what happens when AI agents query and act on enterprise data directly instead of waiting for a human to build a dashboard. The payoff is a quicker and far more democratized path to insight. A product manager asks a question in plain language and gets a chart in seconds. An agent monitors revenue anomalies overnight and files a summary before anyone logs in. Amazon's SCOT Finance Analytics team saw what this direction looks like in practice with Dremio, cutting query times from 60 seconds to 4 to 6 seconds and eliminating 60 hours of work per project across more than 1,000 users. When the interface to data becomes a question instead of a ticket queue, the number of people who can get answers grows by an order of magnitude.

Iceberg is what makes agentic analytics safe to run at that scale. Agents generate far more queries than humans do, with far more variety. They need a substrate that's consistent (so two agents never see two versions of the truth), cheap to scan (because exploratory query volume explodes), and rich in metadata (because snapshot and partition statistics are what let engines and optimizers answer fast without rescanning everything). Iceberg's snapshot isolation, metadata tree, and open access model check every box. Proprietary formats check none of them, because every new agent framework needs a new integration into the walled garden.

It's worth being specific about which Iceberg features carry the load, because "open table format" undersells what the spec actually provides. Snapshot isolation means every query, human or agent, reads a consistent point-in-time view of a table even while writers commit. Hidden partitioning means consumers write natural predicates like WHERE order_date > '2026-01-01' and the format handles partition pruning, so agents don't need tribal knowledge about physical layout to write fast queries. Schema and partition evolution mean tables adapt to changing business needs without rewrites or broken readers. Time travel means an agent's answer from last Tuesday can be reproduced exactly, which turns out to matter enormously when an AI-generated number ends up in a board deck and someone asks where it came from. And the Iceberg REST catalog specification means catalogs and engines interoperate through a standard API rather than one-off connectors.

None of these are exotic features. They're the table stakes of a trustworthy analytical substrate. The difference is that Iceberg delivers them in the open, on your storage, for every engine at once, where warehouses deliver them inside one vendor's walls.

So the destination is clear: data in Iceberg, agents on top. The question is how you get there without a two-year replatforming project. That's where most teams stall, and it's where Dremio's design choices start to matter.

The Four Blockers Between You and the Agentic Lakehouse

Talk to any team that's attempted this move and the same four problems come up.

First, migration itself. Your data lives in a warehouse, a handful of operational databases, and a pile of Parquet folders. Moving it all to Iceberg means rewriting pipelines while hundreds of dashboards and downstream consumers keep depending on the old locations. Big-bang cutovers fail often enough that most architects won't sign off on them, and for good reason.

Second, ongoing management. An Iceberg lakehouse isn't a set-it-and-forget-it system. Streaming and frequent writes create thousands of small files. Metadata bloats. Old snapshots pile up. Someone has to schedule compaction, clustering, and vacuum jobs, and someone has to build and babysit the materialized views that keep dashboards fast.

Third, business meaning for AI. An agent pointed at raw tables named tbl_cust_ord_v3 will hallucinate joins and invent metric definitions. Agents need a semantic layer with documented, governed definitions, plus tooling to query it. Buying a separate semantic layer product and building custom agent tooling on top is a six-month project before the first useful answer.

Fourth, unstructured data. Contracts, invoices, support tickets, and scanned documents hold answers your agents need, but they're not rows in a table. The traditional fix is a separate OCR and extraction pipeline with its own infrastructure, its own failure modes, and its own team.

Dremio addresses each of these in sequence. Let's take them one at a time.

Problem 1: Migrating Your Data to the Lakehouse Without Breaking Anything

The standard migration playbook is brutal. Stand up the new platform, rebuild every pipeline, repoint every dashboard, run both systems in parallel for months, and pray the numbers match. Conventional modernization projects routinely run 6 to 18 months before users see any value, and the riskiest moment is the cutover itself.

Dremio replaces that playbook with two capabilities working together: Zero-ETL Federation and the semantic layer.

Zero-ETL Federation means Dremio queries data where it currently lives. Connect your existing PostgreSQL, SQL Server, Oracle, Snowflake, MongoDB, S3 buckets, and 35+ other source types, and Dremio presents them all behind one SQL interface. A single query can join a customer table still sitting in your warehouse with clickstream events already landed in Iceberg, and the person running it never knows the difference. Dremio pushes predicates and partial work down to each source so federated queries stay efficient rather than dragging full tables across the network.

The semantic layer is where the migration strategy actually lives. On top of those federated sources, you build virtual views in Dremio that model every one of your use cases: a raw layer of views that standardize each source, a business layer that applies logic and joins, and an application layer that serves specific dashboards, reports, and agents. Your BI tools, notebooks, and AI agents all connect to these views, never to the physical sources underneath.

That indirection is the whole trick. Once every consumer reads from views, the physical location of the data becomes an implementation detail you can change whenever you want. The migration pattern looks like this:

  1. Point a raw view at the legacy source (say, raw.orders reading from PostgreSQL) and build your business views on top of it.
  2. Migrate that one dataset to an Apache Iceberg table on object storage on your own schedule, validating row counts and values while the legacy path keeps serving production.
  3. Update the SQL definition of raw.orders to select from the new Iceberg table instead of PostgreSQL.

Every subsequent query, from every dashboard and every agent, now runs against Apache Iceberg. No consumer changed a connection string. No downtime window was negotiated. No end user noticed anything except that queries got faster. Then you move to the next dataset. Week one might be orders, week three might be customers, and the warehouse drains incrementally while production never blinks.

In SQL terms the swap is almost anticlimactic. Before the migration, the raw view reads from the legacy source:

CREATE OR REPLACE VIEW raw.orders AS
SELECT order_id, customer_id, amount, order_date
FROM postgres_prod.public.orders;
Enter fullscreen mode Exit fullscreen mode

After you've landed and validated the Iceberg copy, you redefine the same view:

CREATE OR REPLACE VIEW raw.orders AS
SELECT order_id, customer_id, amount, order_date
FROM lakehouse.sales.orders;
Enter fullscreen mode Exit fullscreen mode

Same name, same columns, same downstream views, new physical home. The next query against any view built on raw.orders resolves to the Iceberg table. If validation later turns up a discrepancy, rollback is the same one statement pointed back at PostgreSQL. Compare that to a traditional cutover, where rollback means a war room.

During the transition, federation also means you're never stuck half-migrated. A query can join the already-migrated lakehouse.sales.orders Iceberg table against a payments table still in PostgreSQL, and it works exactly like a join between two Iceberg tables. The mixed state that kills most migrations is just another Tuesday for a federated engine.

Reflections make this migration phase faster than it has any right to be. A Reflection is a precomputed, optimized materialization that Dremio's optimizer substitutes into queries automatically, with no SQL changes from the user. Here's the detail most people miss: Dremio stores Reflections as Apache Iceberg tables on your data lake, even when the anchor dataset is a federated source like PostgreSQL or MongoDB. So during migration, a Reflection on a slow legacy source gives your users Iceberg-backed performance before you've migrated a single byte of that source. Dremio uses Iceberg to speed up your non-Iceberg data. The rest of the industry uses proprietary formats to speed up Iceberg. That inversion tells you a lot about where Dremio's focus sit.

There's a useful side effect, too. Those Reflections are themselves Iceberg tables built from your legacy sources, which means your acceleration layer doubles as a dress rehearsal for the migration. You learn how your data behaves in Iceberg while the source of truth is still the old system.

Apache Iceberg Migration with Dremio

Problem 2: Managing the Lakehouse So It Doesn't Manage You

Migration gets you to Iceberg. Staying fast on Iceberg is a different job, and historically it's been a thankless one. Tables fragment into small files as writes accumulate. Partition layouts drift away from query patterns. Snapshots and orphan files inflate storage. And acceleration turns into a part-time career: deciding which materialized views to build, scheduling their refreshes, rewriting queries to hit them, and tearing them down when workloads shift.

Dremio's answer is to make the lakehouse autonomous. The platform watches activity through its Active Metadata system, which continuously analyzes query patterns, data relationships, and usage trends, and then it acts on what it learns without waiting for a human.

On the storage side, Dremio runs Automated Table Optimization for Iceberg tables in its Open Catalog: compaction to merge small files into well-sized ones, clustering to physically reorganize data layouts around real access patterns, and vacuum to expire old snapshots and remove orphan files. These run as background maintenance jobs. You don't size them, and you don't get paged when a streaming table quietly accumulates 40,000 tiny files, because Dremio already merged them.

On the acceleration side, the Reflections you used during migration get a serious upgrade once your data is in Iceberg:

Autonomous Reflections remove the design work entirely. Dremio analyzes your query workload over a rolling seven-day window, figures out which materializations would help, then creates, refreshes, and drops Reflections on its own. It targets queries that take at least a second and skips ones already served by cache, so it spends compute exactly where users feel pain. No one on your team decides what to materialize anymore. The platform does, and it revises that decision as workloads change perfect for a world where agent patterns are changing faster than manual acceleration can keep up with.

Live Reflections kill the staleness problem. Because Iceberg exposes table changes through snapshots, Dremio detects when an anchor table changes (polling as often as every 10 seconds) and triggers a refresh immediately. Scheduled refreshes against unchanged data get recognized as redundant and skipped, so you stop burning compute to rebuild things that didn't change.

Incremental Refresh makes those updates cheap. Dremio reads Iceberg's snapshot metadata to identify exactly which records were added, modified, or deleted since the last refresh, and processes only that delta instead of rebuilding the whole materialization. On a 10-billion-row table where last night's load touched 0.2% of rows, that's the difference between minutes and hours of compute.

Then there's the caching stack underneath. The query plan cache stores the physical plan of executed queries, so repeated queries (the lifeblood of BI dashboards) skip compilation and go straight to execution. The results cache goes further: deterministic queries on unchanged Iceberg data return prior results instantly, spooled as Arrow files to distributed storage and shared across coordinators and clients, whether the query arrives over the console, JDBC, ODBC, REST, or Arrow Flight. And the Columnar Cloud Cache (C3) keeps frequently accessed columnar data on local NVMe at the executor nodes, cutting up to 90% of object storage I/O costs and turning cloud-storage latency into local-disk speed.

Stack it up and the operational picture changes shape. Compaction, clustering, vacuum, materialization design, refresh scheduling, and cache management all move from your team's backlog to the platform's job description. Your engineers stop juggling materialized views and start shipping data products. Dremio's claim of 10x data engineering productivity is aggressive, but the mechanism behind it is concrete: the platform absorbed an entire category of recurring work.

Dremio's claim of 10x data engineering productivity is aggressive, but the mechanism behind it is concrete: the platform absorbed an entire category of recurring work.

Warehouse Speed on Iceberg, Because Dremio Is Iceberg-Native

A reasonable skeptic asks: can an engine reading open files on object storage really match a warehouse that controls its own proprietary format? With Dremio the answer is yes, and the reason is architectural rather than a bag of tricks. Apache Iceberg is the engine's first-class format. Dremio is Iceberg-native top to bottom.

That phrase gets thrown around loosely, so let's be precise about what it means here. Most platforms bolted Iceberg support onto an engine designed for something else. They read Iceberg by converting it, mirroring it, or treating it as an external table with reduced features, and you pay a performance tax at the boundary. Dremio took the opposite path. Its query engine reads Iceberg's metadata tree directly for planning, prunes partitions and files from Iceberg statistics before touching any data, executes on Apache Arrow's columnar in-memory format (which was co-created by Dremio founders and founding engineers along with project like Apache Drill, Apache Parquet and Apache Calcite) with LLVM code generation, and writes its own acceleration structures, the Reflections, as Iceberg tables. There is no translation layer because there's nothing to translate. Iceberg in, Arrow through, Iceberg out.

The numbers Dremio puts behind this: 20x performance on Iceberg tables at the lowest cost, up to 100x faster queries with Reflections, and sub-second response for interactive workloads. Shell processes 6 to 8 billion records in minutes for production forecasting on this stack, with more than 100 concurrent forecasting models running at enterprise scale.

The strategic point matters more than any single benchmark. Because Dremio's speed comes from Iceberg plus Arrow plus caching rather than from a proprietary format, every performance investment you make stays portable. Your fast tables are still just Iceberg tables that Spark, Flink, or any future engine can read. You never face the choice between performance and openness, which is exactly the choice proprietary-first platforms are designed to force.

Problem 3: Building AI Agents With Solid Business Meaning

Performance and migration are solvable engineering problems. The harder blocker for agentic analytics is meaning. An LLM agent handed raw schema names will guess, and it will guess confidently. Ask it for "monthly active customers" against undocumented tables and you'll get an answer. You just won't get the same answer twice, and neither will the agent your finance team runs.

The fix is a semantic layer: governed views, documented definitions, consistent metrics, lineage, and business vocabulary that both humans and agents read from the same place. And here's where the typical buying pattern goes wrong. Teams assemble a catalog from one vendor, a semantic layer from another, an agent framework from a third, then spend two quarters writing glue code so the agent can actually use the other two. Every integration is a seam where context leaks and governance breaks.

Dremio's position is that none of that should be a separate purchase. The AI Semantic Layer, the AI Agent, the MCP server, and the CLI are all parts of the same platform, sharing the same definitions and the same access controls.

Start with the AI Semantic Layer itself. It's virtual, built from SQL views rather than copies, which means it spans every source Dremio federates. That's worth pausing on, because it breaks a boundary every other semantic layer respects. A semantic layer tied to one warehouse can only give meaning to data inside that warehouse. Dremio's semantic layer gives one consistent set of definitions across your warehouse, your operational databases, your Iceberg lakehouse, and your object storage at the same time. "Monthly Revenue" means one thing whether the underlying bytes sit in Snowflake, PostgreSQL, or an Iceberg table on S3. Wikis document datasets and columns. Labels group related objects. Lineage tracks how every view derives from its sources. And Dremio uses generative AI to help maintain all of it, sampling tables to draft wiki descriptions and labels so the catalog becomes a living encyclopedia for the business rather than a documentation graveyard.

On top of that context sits the embedded Dremio AI Agent, built into the console and ready out of the box. It's a conversational interface that does real analytical work: it runs semantic search across the catalog (names, wikis, labels, metadata) to find the right datasets, writes and executes SQL grounded in the semantic layer's definitions, generates visualizations you can catalog and revisit, detects patterns and returns narrative insights alongside the charts, explains and optimizes existing SQL, and diagnoses slow jobs. It also helps with the unglamorous work that makes data teams effective: drafting documentation for datasets and working out the SQL to capture the data models you describe in plain language. Every action respects the privileges of the logged-in user, every tool call is auditable in the chat window, and none of it required you to integrate anything.

The same capabilities extend to agents you build or already use. The Dremio MCP Server exposes the platform through the Model Context Protocol, the open standard for connecting LLMs to tools. Each Dremio Cloud project includes its own built-in MCP server, so Claude, ChatGPT, Gemini, LangChain agents, or your custom agentic application can discover datasets, search the semantic layer for context, fetch schemas, and run governed SQL through tools like RunSqlQuery, GetSchemaOfTable, and RunSemanticSearch. You don't host a connector or design custom tooling. The agent inherits the user's identity and access controls automatically, so an agent can never see data its human couldn't.

For locally running and terminal-based agents, there's the Dremio CLI, an AI-agent-first command line interface built for coding agents like Claude Code and Codex, and equally at home with local agent runtimes like Claude Cowork, OpenClaw, or Hermes. The CLI covers queries, catalog operations, schemas, Reflections, jobs, and access management, with input validation designed for the reality that an AI will be constructing the commands. Pair it with Dremio's published agent skills and your coding agent becomes a competent lakehouse operator in an afternoon.

Two more pieces complete the agentic picture, and both are easy to underestimate until an agent program scales.

The first is governance that travels with the agent. Every path into Dremio (the embedded agent, MCP, the CLI, plain SQL) enforces the same fine-grained and role-based access controls, with OAuth tokens flowing through credential vending all the way to the underlying sources. An agent acting for a regional manager sees that region's rows and nothing else, not because someone wrote agent-specific policy, but because the agent literally is that user from the platform's perspective. When the compliance team asks how you govern AI access to customer data, the answer is one sentence: the same way you govern human access, in the same system, with audit trails on every query.

The second is performance under agent-scale load. Agents probe. They run schema discovery, sample data, try a query, refine it, and try again, generating a long tail of similar-but-not-identical queries that would flatten a manually tuned acceleration layer. This is precisely the workload Autonomous Reflections were built for: Dremio's analysis targets clusters of similar queries with slight variations, exactly the shape agent traffic takes, and the results cache absorbs the identical repeats. Sub-second answers aren't a luxury for agents. An agent that waits 40 seconds per query takes minutes per reasoning loop, and the experience dies. The acceleration stack from Problem 2 is what makes the agent experience from Problem 3 feel instant.

Now connect this back to the migration story, because this is the part that changes project plans. Dremio's semantic layer abstracts where data is physically stored. The AI Agent, the MCP server, and the CLI all operate on the semantic layer, not on storage. Which means agentic analytics works on day one, against your federated sources, before you've migrated anything to Iceberg. Your agents answer questions over data still sitting in PostgreSQL and Snowflake using the same governed definitions they'll use after the move. The Iceberg migration stops being a prerequisite for agentic analytics and becomes a performance and cost upgrade that happens underneath agents already in production. Most platforms make you finish the boring project before starting the exciting one. Dremio lets you run them in parallel, and the early agent wins are usually what get the migration funded.

Dremio's Agentic Analytics Feature Set

Problem 4: Unstructured Data Without a Separate OCR Pipeline

Somewhere in your object storage right now there's a folder of PDFs that matters more than half your tables. Invoices. Contracts. Inspection reports. Resumes. Support transcripts. Industry estimates put 80 to 90% of enterprise data in unstructured form, and almost none of it participates in analytics, because getting it into rows traditionally requires a separate extraction stack: OCR services, document parsers, orchestration, error handling, and a pipeline team to keep it all running.

Dremio's answer is to make documents queryable with SQL. The platform embeds LLM calls directly into the engine as AI Functions: AI_GENERATE, AI_CLASSIFY, AI_COMPLETE, and the table function LIST_FILES. No Python service, no external orchestration, no data leaving your governed environment.

LIST_FILES is the bridge. Point it at a directory in connected storage (S3, ADLS, GCS) and it returns the files as rows, each with metadata plus a file struct you can hand to the other functions. It handles PDFs, images, Word documents, text files, and scanned documents through multimodal vision models. AI_GENERATE then extracts whatever you ask for, and its WITH SCHEMA clause forces the LLM to return typed, named fields rather than a blob of prose.

Put them together and an extraction pipeline collapses into one statement:

CREATE TABLE gold.invoices AS
SELECT
  file['path'] AS source_file,
  invoice_data.vendor_name,
  invoice_data.invoice_number,
  invoice_data.total_amount
FROM (
  SELECT
    file,
    AI_GENERATE(
      ROW('Extract vendor name, invoice number, and total amount from this invoice.', file)
      WITH SCHEMA ROW(
        vendor_name VARCHAR,
        invoice_number VARCHAR,
        total_amount DECIMAL(12,2)
      )
    ) AS invoice_data
  FROM TABLE(LIST_FILES('@company_s3/invoices/2025'))
  WHERE file['path'] LIKE '%.pdf'
);
Enter fullscreen mode Exit fullscreen mode

Read what that statement actually does. It scans a folder of invoice PDFs in S3, extracts three typed fields from each document, and materializes the results as a governed Apache Iceberg table. The documents become rows. The rows become part of the semantic layer. The semantic layer feeds your agents and dashboards. A workload that used to mean standing up a document-processing service now ships in a SQL Runner tab before lunch.

The other functions round out the toolkit. AI_CLASSIFY constrains the model to one value from a list you supply, which makes it reliable for sentiment labeling, document triage, and routing. AI_COMPLETE handles free-form generation like summaries and descriptions. Model providers are pluggable (OpenAI, Anthropic, Google Gemini, AWS Bedrock, Azure OpenAI, or Dremio's hosted model), and neither Dremio nor the providers train on your data.

A few production habits make this scale well. Materialize extraction results with CTAS so you pay for each LLM call once instead of on every dashboard refresh. Layer Reflections on the output tables so downstream queries run at interactive speed with zero additional LLM cost. And use workload management rules to route AI-function queries to a dedicated engine so a big extraction job never slows your BI traffic. All three are configuration, not architecture.

Dremio working with Unstructured Data

One Platform Instead of a Stack of Point Products

Step back and notice what you didn't have to buy in any of the four solutions above.

You didn't buy a separate virtualization product for migration, then a separate semantic layer to give the data meaning, then a separate catalog to govern Iceberg, then a separate table-maintenance service, then an agent framework, then a text-to-SQL vendor, then a document AI platform. That seven-product stack is a real architecture being sold to real companies right now, and every seam in it is a place where definitions drift, permissions diverge, and projects die in integration.

Dremio bundles the whole path into one platform with one security model. The federated query engine, the Iceberg-native lakehouse with its Open Catalog powered by Apache Polaris, the AI Semantic Layer, the embedded AI Agent, the MCP server, the CLI, and the AI Functions all share the same views, the same wikis and labels, and the same fine-grained access controls. When the agent answers a question, it's reading the same governed definition your BI dashboard reads. When AI_GENERATE writes an Iceberg table, that table lands in the same catalog the rest of your data lives in, with the same lineage and the same permissions.

The consolidation shows up on the invoice, too. Every point product in that stack carries its own license, its own infrastructure, and its own specialist headcount, and the integration work between them is paid for in engineering quarters. A single platform on open storage flips the cost structure: one Iceberg copy on object storage instead of duplicated marts, C3 trimming up to 90% of I/O costs, autonomous features replacing manual tuning labor, and a 99.97% uptime SLA on the managed service so reliability isn't another thing your team builds. Lowest cost is part of Dremio's stated value proposition, and the architecture is why the claim holds: you're not paying anyone to store your data twice or to glue your own products together.

There's also a credibility angle that matters for anything built on open standards. Dremio co-created Apache Arrow and Apache Polaris and is a key contributor to Apache Iceberg. The claim "the only lakehouse built natively on Apache Iceberg, Polaris, and Arrow" isn't marketing applied after the fact. The company helped write the standards the platform runs on, which is the strongest assurance you can get that "open" won't quietly become "open, but" three renewals from now.

The Value, End to End

Run the whole arc back through the lens of a team that starts today.

Day one, you connect Dremio to your existing sources and build views. Analysts get one SQL interface across everything, and the built-in AI Agent starts answering natural-language questions against data that hasn't moved an inch. Agentic analytics is live before any migration begins, because the semantic layer abstracts storage.

Over the following months, you migrate dataset by dataset to Apache Iceberg using the view swap pattern. You update a view definition, every downstream query silently shifts to Iceberg, and no consumer experiences downtime. Reflections (stored as Iceberg tables even for legacy sources) keep everything fast through the transition.

As tables land in Iceberg, the platform takes over the operations work. Automated compaction, clustering, and vacuum keep storage healthy. Autonomous Reflections design and manage your acceleration layer from observed query patterns. Live and incremental refresh keep materializations current for pennies. The plan cache, results cache, and C3 squeeze latency and I/O cost out of every repeated workload. Your engine runs Iceberg as its first-class format, so you get up to 20x performance on Iceberg tables and up to 100x with Reflections without surrendering openness, and any other Iceberg engine can still read every table.

Meanwhile your agents multiply. The embedded AI Agent serves analysts in the console. The MCP server plugs Claude, ChatGPT, Gemini, and your custom applications into the same governed context. The CLI puts the lakehouse in reach of coding agents and local runtimes. And AI Functions keep folding the unstructured world (the PDFs, the scans, the contracts) into Iceberg tables those agents can query.

That's the agentic lakehouse: open Iceberg storage you own, a platform that manages itself, and AI agents with real business meaning, reached incrementally instead of through a leap of faith. Each of the four classic blockers (migration risk, maintenance burden, missing context, unstructured data) turns out to be a feature of fragmented architectures rather than a law of nature. Put the engine, the lakehouse, and the agent layer in one platform and the blockers mostly dissolve.

The honest caveat is that no platform removes the need for judgment. You still decide what your business metrics mean, which datasets deserve curation first, and where federation should give way to migrated Iceberg storage for heavy workloads. What Dremio removes is everything between those decisions and their execution.

Here's a concrete way to test the argument. Connect a database and an S3 bucket, build one view that joins them, and ask the AI Agent a business question about the result. That single exercise demonstrates federation, the semantic layer, and agentic analytics in under an hour, on data you haven't migrated. If you want to see what an Apache Iceberg lakehouse with built-in agentic analytics feels like before committing to a migration plan, start a free Dremio trial at dremio.com/get-started.

If the test holds up, the rollout sequence writes itself. Curate wikis and labels on your ten most-asked-about datasets first, because curating semantic context is the most valuable hour an agent program can spend. Hand the MCP connection to one team that already lives in Claude or ChatGPT and let their usage teach you what context is missing. Pick the slowest, most expensive workload in your warehouse as the first view-swap migration candidate, since that's where Iceberg plus Reflections pays back fastest. Then let Autonomous Reflections and Automated Table Optimization run for two weeks and compare your engineering backlog before and after. Each step is reversible, each one delivers value on its own, and none of them requires the others to finish first.

Dremio end-to-end

links:

Top comments (0)