Disclaimer: These opinions are mine and mine alone, not a reflection on my employer. While I may offer criticism below to specific data platforms, please trust me when I say I truly love all of them. If there's criticism here, please interpret it as "room to grow" and coming from a place of love.
A bit of history
The first "Separation of Storage and Compute" happened in the 2010s, as all database platforms learned to scale up and down their compute independently from the underlying storage. Spark promised this but didn't deliver the necessary portability and ease of use, likely due to conflicting incentives between open source ubiquity and monetary incentives from the sponsor company Databricks. Snowflake delivered better, with auto spin-down of compute on by default. This meant users could pay just for the compute they needed, with dirt-cheap storage backed by S3. The benefits of the first separation of compute and storage meant that you could equally scale up and down according to your query needs, with a "near-zero always on cost".
The second revolution is already here
Today we are seeing a second revolution of separation of storage and compute, which is separation of the vendor you use for compute and the vendor you use for storage. Now, you can connect your BI tool to your lake without spinning up a Snowflake cluster. You can migrate from Databricks to Snowflake and back without paperwork and without any vendor lock-in.
As an analogy, consider Apple Music with its DRM, vs DRM-free MP3s that you own free and clear. Every music app can read your MP3s but no other music app can play your iTunes purchases. This is not a knock on Apple - it's just an allegory for the revolution we are seeing today with the widespread acceptance and adoption of Iceberg. Now, your lake lives in the cloud and it's readable and writable by every tool in your toolbox.
Understanding the impact
To understand the impact of this new paradigm, just consider your situation: where you currently have vendor lock-in, and where you are paying vendors just for "SELECT *" access to your data...
- Is your org fully dependent on MS SQL and SQL Azure? No problem - it can read from Iceberg.
- Are you a hard-core Spark enthusiast? Again, no problem - you can read and write in Iceberg, reading or writing with Spark on Iceberg, using whichever commodity hardware or commercial service you prefer in that moment.
- Does your BI tool want to query data every day at 5am, but you don't want to pay Snowflake just for being an intermediate "SELECT *" compute passthrough? No problem. Just bypass Snowflake entirely and have your BI tool read directly to Iceberg.
What it means for database service providers
In short, for database service providers like Snowflake, MS SQL, Redshift, and Spark - they have to make the case that they are the compute you want to use. Emphasis on great UI/UX, great performance, and great features will make all the difference. And even still, they can no longer rely on being the only or even the primary query interface for their existing users. They should expect that their users will increasingly mix-and-match, and that their users will (smartly) bypass them in simple "SELECT *" use cases where there's no reason to pay for the compute spin-up. Wherever write operations are expensive or lacking features, they should expect users to mix-and-match write-providers as well, leaning on services - or self-managed commodity compute - that can write data cheaper or more effectively.
What it means for application service providers
While the rest of this article might seem like a "race to the bottom" in terms of pricing, there's another huge and positive impact that can be attained in this second revolution. That is, now every application, web service, service provider, and startup can now provide a direct, fast, cheap, and best-in-class data architecture for their users. Rather than leaning entirely on REST APIs, which are slow, cumbersome, and expensive to build+maintain, they can offer their users their own personal data lake. Free to query how the user likes, "zero-copy interoperable" with every major DB platform, and easily scalable down to zero and up to near-infinite concurrency.
This last part is what gets me personally very excited about Iceberg and other data lake storage providers that transcend vendor lock-in.
What do you think?
Do you share my enthusiasm, or do you think this is just more complexity in an already complex space? Are you worried about the race to the bottom, or are you excited (like me) that we'll soon all be free from vendor lock-in?
Top comments (0)