Separation of Storage and Compute, Part Deux

#iceberg

Disclaimer: These opinions are mine and mine alone, not a reflection on my employer.

A bit of history

The first "Separation of Storage and Compute" revolution started in 2010s, as all database platforms learned to scale up and down their compute independently from the underlying storage.

The first to the market with separation of storage and compute Spark in 2009 and SparkSQL in 2014, building on the Hadoop platform and promising extreme portability with the ability to run SQL against virtually any storage medium. While SparkSQL was open source and while it showed the world that decoupling storage and compute was possible, Spark failed to deliver in the categories of ease of use and ubiquity.

Snowflake launched in 2014 around the same time as SparkSQL, and a delivered better, more user-friendly experience - with built-in separation of storage and compute. Snowflake is internally architected as a managed data lake that "feels" like a traditional database - with auto spin-down of compute on by default. This means users can pay just for the compute they needed, with dirt-cheap storage backed by S3. But unlike Spark, Snowflake is not open source and can only be run via paid account with Snowflake. This soon becomes a non-negligible cost optimization problem for its customers, and now Snowflake compute costs represent the primary limiting factor to Snowflake data warehouse scalability.

SQL Server, Redshift, and others similarly follow along in the "separation and storage and compute" revolution, and the benefits of this first generation of compute and storage isolation meant that you as a user could equally scale up and down according to your query needs, with a "near-zero always on cost". If you need a lot of compute, you can spin it up on demand and run it only as long as you need it. For workloads that scale linearly, you can run 100 CPUs for 1 minute instead of running 1 CPU for 100 minutes - literally 100x performance for the same cost! And because there's no limit to how many compute workers you can spin up in the Cloud, contention between queries is non-existent and it's literally impossible for the warehouse to become overloaded or "too small" for your scaling requirements.

The second revolution is here

Today we are seeing a second revolution of separation of storage and compute, which is separation of the vendor you use for compute and the vendor you use for storage. Now, you can connect your BI tool to your lake without spinning up a Snowflake cluster. You can migrate from Databricks to Snowflake and back without paperwork and without any vendor lock-in.

As an analogy, consider Apple Music with its DRM, vs DRM-free MP3s that you own free and clear. Every music app can read your MP3s but no other music app can play your iTunes purchases. This limits your freedom and makes you think twice about where and how to buy your music. And similarly now for data: vendor lock in is a serious challenge for data engineers, and the lack of portability between architectures can easily become a multi-million dollar migration initiative for companies that want to move between platforms.

This is not a knock on Apple or commercial Data Warehouse vendors - it's just an allegory for the revolution we are seeing today with the widespread acceptance and adoption of Iceberg. Now, your lake lives in the cloud and it's readable and writable by every tool in your toolbox. Meaning, you can store data in Iceberg with the peace of mind that every tool you buy off the shelf will be able to read and write to it freely.

Understanding the impact

To understand the impact of this new paradigm, just consider your situation: where you currently have vendor lock-in, and where you are paying vendors just for "SELECT *" access to your data...

Is your org fully dependent on MS SQL and SQL Azure? No problem - it can read from Iceberg.
Are you a hard-core Spark enthusiast? Again, no problem - you can read and write in Iceberg, reading or writing with Spark on Iceberg, using whichever commodity hardware or commercial service you prefer in that moment.
Does your BI tool want to query data every day at 5am, but you don't want to pay Snowflake just for being an intermediate "SELECT *" compute passthrough? No problem. Just bypass Snowflake entirely and have your BI tool read directly to Iceberg.

What it means for database service providers

In short, for database service providers like Snowflake, MS SQL, Redshift, and Spark - they have to make the case that they are the compute you want to use. Emphasis on great UI/UX, great performance, and great features will make all the difference. And even still, they can no longer rely on being the only or even the primary query interface for their existing users. They should expect that their users will increasingly mix-and-match, and that their users will (smartly) bypass them in simple "SELECT *" use cases where there's no reason to pay for the compute spin-up. Wherever write operations are expensive or lacking features, they should expect users to mix-and-match write-providers as well, leaning on services - or self-managed commodity compute - that can write data cheaper or more effectively.

What it means for application service providers

While the rest of this article might seem like a "race to the bottom" in terms of pricing, there's another huge and positive impact that can be attained in this second revolution. That is, now every application, web service, service provider, and startup can now provide a direct, fast, cheap, and best-in-class data architecture for their users. Rather than leaning entirely on REST APIs, which are slow, cumbersome, and expensive to build+maintain, they can offer their users their own personal data lake. Free to query how the user likes, "zero-copy interoperable" with every major DB platform, and easily scalable down to zero and up to near-infinite concurrency.

This last part is what gets me personally very excited about Iceberg and other data lake storage providers that transcend vendor lock-in.

What do you think?

Do you share my enthusiasm, or do you think this is just more complexity in an already complex space? Are you worried about the race to the bottom, or are you excited (like me) that we'll soon all be free from vendor lock-in?