Iceberg, Delta Lake, and Hudi, oh my!

Matt Fuller — Thu, 22 Jun 2023 16:01:07 +0000

The rising popularity of the data lakehouse has led many to try to compare the merits of the open table formats underpinning this architecture: Apache Iceberg, Delta Lake, and Apache Hudi. If you look between the lines, the conversation is mostly driven by hype, making it hard to parse reality from marketing jargon.

This article isn’t going to solve that problem. Instead, the goal is to introduce you to a new way of thinking about table formats – as a use case-level choice rather than an organization-level decision.

Choosing a table format

When deciding between table formats, it’s important to understand the similarities and differences that may impact performance and scalability.

For example, Iceberg is currently the only table format with partition evolution support. This allows the partitioning scheme of a table to be changed without requiring a rewrite of the table, and it enables queries to be optimized by all partition schemes.

On the other hand, Iceberg’s streaming support is lagging behind Delta Lake and Hudi. So the question to pick a table format becomes – which is more important to your business? Partitioning or streaming?

Now, any seasoned data engineer knows that it’s not that simple. You don’t just have a single type of data in your systems or a single way you’re looking to interact with that data. Instead, you’re dealing with streaming pipelines, batch jobs, ad hoc queries, and more – all at the same time. And you don’t get to control what is added to that mix in the future.

All of these factors make the binary decision – partitioning or streaming, Iceberg or Delta Lake – almost impossible to get right at the organization-level. But most vendors require you to do just that.

Starburst’s approach

With Starburst, everything is built with openness in mind. We designed Starburst Galaxy to be interoperable with nearly any data environment, including first-class support for all modern open table formats.

This means that you can use the table format that is right for each of your workloads and change it when new needs emerge. You don’t need to worry about limited support for external tables or being locked into an old table format when new ones come along (and it will).

How it works

We wanted to make it as easy as possible to write to and read from different table formats, so we built Great Lakes connectivity – an under-the-hood process that abstracts away the details of using different table formats and file types.

This connectivity is built into Starburst Galaxy, and is available to all users that are working with the following data sources:

Amazon S3
Azure Data Lake Storage
Google Cloud Storage

To create a table with one of these formats, you simply provide a “type” in the table ddl. Here is a simple example of creating an Iceberg table:

CREATE TABLE customer(
name varchar,
address varchar,
WITH (type='iceberg');

That’s it! An Iceberg table has been created.

To read a table using Great Lakes connectivity, you simply issue a SQL select query against it:

SELECT * FROM customer;

Again… that’s it! End users shouldn’t need to worry about file types or table formats, they just want to query their data.

Inside Starburst's hackathon: Into the great wide open(AI)

Matt Fuller — Wed, 21 Jun 2023 15:50:12 +0000

Generative AI has been taking the tech world by storm lately. Indie developers to large enterprises are experimenting with its impact not only in day-to-day jobs but also in driving new feature innovations.

Starburst is no different. We’re always looking for ways to improve our technology and make the lives of data engineers easier. The current buzz around OpenAI lined up perfectly with our yearly Hack-a-Trino, and our team came up with some pretty cool concepts.

Check out the top three projects below and let us know which one is your favorite by voting on r/dataengineering. Who knows you might just see the winning concept become a reality.

Project A: Automatic Data Classification (LLM-based tagging)
Project B: Trino AI Functions
Project C: No Code Querying with ChatGPT

Project A: Automatic Data Classification

Built by Alex Breshears, Cody Zwiefelhofer, David Shea, Elvis Le

What is it? Automatic Data Classification is a feature that analyzes and predicts the content of registered data sources based on a subset of the data itself. This project uses OpenAI’s LLM APIs to take samples of datasets and propose potential tags to data admins.

Tags in Starburst Galaxy allows users to associate attributes with one or more catalogs, schemas, tables, views, or columns. Tags can be combined with attribute-based access control policies to ensure that each role has the appropriate access rights to perform actions on entities in the system.

Why did we build it? This model automatically identifies data structures within the data. As such, it takes less time for data admins to learn what kind of data has been ingested and appropriately mark that data – increasing productivity and accuracy of tags.

Architectural Considerations: A critical requirement for this feature is that the data itself must be analyzed - not just the metadata. There are many reasons for this but the most obvious is columns are rarely named as what they are.

Project B: Trino AI Functions

Built by Karol Sobczak, Mainak Ghosh

What is it? Trino AI Functions (aka LLMs in the context of complex aggregations) uses ChatGPT and Hugging Face to translate natural language text into complex aggregation logic that can be executed directly on the Trino engine.

Why did we build it? SQL is heavily used for data analysis and at the same time the democratization of ML means that customers now want to do more with Trino queries. Trino AI Functions will help customers write expressive queries that can do language translation, fraud detection, sentiment analysis and other NLP tasks.

Architectural Considerations: One powerful feature of LLMs is that you can ask models to provide you a structured answer from unstructured data like JSON documents.

Project C: No Code Querying with ChatGPT

Built by Lukas Grubwieser and Rob Anderson

What is it? No code querying with ChatGPT is a feature that would let Galaxy users write natural language queries against data sources connected to Starburst Galaxy.

Why did we build it? This would allow business analysts to ask questions of Galaxy without having to write complex SQL or understand the underlying data architecture. It brings Starburst one step closer to our vision of data democratization.

Architectural Considerations: This project requires a three-part architecture: a frontend, OpenAI engine (with backend server), and Starburst. The frontend takes the question and OpenAI translates it to SQL which is then processed by Starburst.

Honorary Mention: Accelerating Cyber Analytics with Aderas, Starburst, and ChatGPT

While this next effort wasn’t part of our internal hackathon, we thought it was too cool not to share. Our partner Aderas built an POC of an insider threat investigation model using Starburst and ChatGPT for cyber security analytics.

Check out the demo:

Starburst’s Learnings on Gen AI

While we built a lot of cool things during the hackathon, we also learned a lot. We documented a couple of our team’s key learnings below:

It’s really shown us how quickly we can generate new features and ideas that traditionally were a lot harder for a business like ours to innovate on.
LLM definitely won’t solve everything, but it’s a good starting point.
It’s been fun to iterate on, but we’re also largely waiting for our model to become more correct (since we think correctness matters!) around the query generating. However, it’s exceptionally cool to see what the newer models are capable of, in terms of generating syntactically correct ANSI SQL.

DEV Community: Matt Fuller