Salesforce Data Cloud Zero Copy: A Practical Guide
If you've spent any time around Salesforce Data Cloud over the past year, you've probably heard the phrase "zero copy" thrown around at every Dreamforce keynote, partner webinar, and LinkedIn thought leader post. At first I rolled my eyes a bit. Every cloud vendor has its buzzwords, and "zero copy" sounded like another one. But after working with it on a couple of real customer projects, I'll admit it: this is actually one of the more interesting things Salesforce has shipped in the data space, and most teams aren't using it the way they should be.
So let's go through what zero copy actually is, when you should use it, and the parts the marketing pages tend to skip over.
What Zero Copy Really Means in Data Cloud
Zero copy data federation lets Data Cloud (now also called Data 360 in some places) read data sitting in your Snowflake, Databricks, BigQuery, or Redshift warehouse without first ETL-ing that data into Salesforce. The data physically stays in the source system. Data Cloud queries it in place.
That's the whole pitch. No nightly extract jobs. No duplicate copies. No reconciliation when the warehouse number disagrees with the Data Cloud number because someone's pipeline broke at 2am.
In traditional CDP setups, you'd ingest your customer data from your warehouse into the CDP, transform it, then activate it. Each step is a copy of the data. With zero copy, Data Cloud effectively becomes a query layer on top of your existing warehouse for the federated tables, and only stores its own metadata about how those tables map to your customer profile.
If you're new to some of these terms, the team over at salesforcedictionary.com has a pretty solid glossary that covers Data Cloud objects, DMOs, and the federation concepts in plain English. Worth bookmarking if you're getting up to speed.
The Two Flavors: Query Federation vs File Federation
This is where I see most people get confused. Zero copy isn't one feature, it's two, and they behave differently.
Query federation is the original version. Data Cloud sends a SQL query to your warehouse, the warehouse runs the query and returns the result. Your warehouse does all the compute. That means you're paying for warehouse credits every time someone in Data Cloud runs a calculated insight, segment, or activation that touches a federated table.
File federation is the newer approach, which uses open table formats - Apache Iceberg, the Iceberg REST Catalog, and Parquet files sitting in S3 or Azure. Data Cloud reads the Parquet files directly. Your warehouse isn't even in the loop for the read. Right now this works with Snowflake, Databricks, IBM, and any generic Iceberg catalog.
In practice, if your data team already publishes Iceberg tables (and a lot of modern data platforms do), file federation is faster, cheaper, and more flexible. If you're an all-in Snowflake shop and don't want to deal with Iceberg, query federation is the friendlier path. The downside of file federation is the setup is more involved - you've got catalog credentials, storage credentials, and a few extra steps to get right.
When Zero Copy Is the Right Call
I've found zero copy works best in three situations.
The first is when you've already got a mature data warehouse with curated customer data. You've spent two years building dbt models, you've got data quality checks running, and your analytics team is the source of truth. Re-ingesting all of that into Data Cloud is a waste. Federate it.
The second is governance and residency. If you have data that legally cannot leave your warehouse - certain healthcare data, EU records under GDPR localization rules, or data that's contractually scoped - zero copy lets Data Cloud reason over the data without actually moving it. That's a much shorter conversation with your legal team than explaining a new copy of the data living in Salesforce's tenant.
The third is cost. Storage in Data Cloud isn't free, and large historical fact tables - clickstream, transactions, telemetry - are expensive to keep duplicated. Federate the cold and warm data, ingest only the hot data you actively need to score or segment on.
When Zero Copy Will Bite You
Now the part the marketing slides leave out. Zero copy has real limitations and you need to know them before you architect around it.
Audience and segmentation latency. As of right now, only a limited number of Data Cloud audiences can include BYOL (bring-your-own-lake) datasets, and those audiences only refresh every 12 hours. If your use case is real-time activation - "send a journey when this customer's account balance changes" - federation isn't the answer for that signal. Ingest it.
Query volume ceilings. Some customers have hit ceilings around the multi-million-record mark when querying federated datasets. Your warehouse can handle a billion-row scan. Data Cloud's federated query planner is more conservative. Test with realistic volumes before you commit your architecture to it.
Feature gaps. A bunch of Data Cloud features assume the data lives natively in Salesforce. Identity resolution, certain calculated insights, some Einstein features - they may not work, or work fully, on federated DMOs. The gap is closing every release, but check the current state for your specific use case.
Latency sensitivity. Even file federation has a network hop. If you're trying to do millisecond-level lookups for an Agentforce agent or a real-time Marketing Cloud journey, native ingestion is still faster. Federation is great for analytical and batch-style activations, less great when a customer is sitting on a page waiting.
Getting It Set Up
The actual setup is more straightforward than the documentation makes it sound. Here's the rough sequence I run through:
- In your warehouse, create a service user with read-only access to the tables you want to expose. Don't reuse a human's account, and don't grant more than you need.
- In Data Cloud Setup, go to Data Federation and create a new connection. Pick the source type (Snowflake, Databricks, etc.) and provide the credentials.
- Choose query or file federation based on what your warehouse supports and what your team is comfortable operating.
-
Map external tables to DMOs (Data Model Objects). This is the step where you tell Data Cloud "this Snowflake table called
prod_analytics.customers_v2is a Customer DMO, and these columns map to these standard fields." Spend time here. Bad mappings will haunt you for months. - Validate with a small dataset before you point any segments or insights at it. Run a calculated insight, check the results, compare to a query you ran directly in the warehouse. Numbers should match.
A small tip: name your federated DMOs with a prefix like FED_ so it's obvious to admins down the line which DMOs are native and which are federated. When something behaves weirdly six months from now, that prefix is going to save someone an afternoon of confused debugging.
How This Plays With Agentforce
Worth flagging because it's the question I get most often. If you're building Agentforce agents that need customer context, federated data is queryable through the same DMOs your agent topics reference. The agent doesn't know or care whether the data is native or federated. It just sees the DMO.
That said, remember the latency point above. If your agent is in a real-time conversation, you don't want it stalling for two seconds while a federated query runs against a warehouse that's busy with an analytics workload. Either pre-aggregate the data your agent actually needs into a native DMO, or ingest the latency-sensitive fields and federate the rest.
If you're new to how Agentforce reads from Data Cloud or how DMOs feed into agent topics, salesforcedictionary.com has a few useful entries on the agent side of the architecture too. The terminology between the two products can get tangled and a glossary helps.
A Quick Decision Framework
Here's how I think about whether to federate or ingest a given dataset:
- High volume, low latency requirement, lives in warehouse already? Federate it. File federation if your warehouse supports Iceberg, query federation otherwise.
- Real-time activations, Einstein scoring, or identity resolution? Ingest it. The feature support and latency are better natively.
- One-time historical lookback, regulatory data, or analytics-style segmentation? Federate. You'll save storage cost and avoid duplicating the source of truth.
- Mixed? That's most real customers. Ingest the hot signals, federate the cold history. You don't have to pick one for the whole org.
What's Next
Zero copy is going to keep maturing. The early version was query federation only with Snowflake. We now have Iceberg-based file federation, multiple warehouse support, and reverse-direction sharing where Salesforce data shows up in Snowflake and Databricks too. The next 12 months are going to be about closing the feature parity gaps - making sure things like identity resolution, more Einstein models, and faster audiences all work transparently against federated data.
If you're standing up Data Cloud right now, my honest advice: don't default to ingesting everything. Start with a federation-first mindset, identify the small set of fields that genuinely need to live in Data Cloud for latency or feature reasons, and federate the rest. You'll have a leaner, cheaper, and more maintainable architecture for it.
Have you tried zero copy on a real project yet? Hit any of the limitations I mentioned, or others I missed? Drop a comment - I'm curious how teams are actually using it in production versus how the slides suggest they should be.
For more Salesforce terminology, definitions, and quick reference guides, check out salesforcedictionary.com - it's become my go-to lookup when onboarding new team members onto the Data Cloud stack.
Top comments (0)