DEV Community: Starburst

What data engineers can learn from lean manufacturing

Isaac Obezo — Tue, 12 Sep 2023 14:15:11 +0000

If you read my previous blog post, you’ll know that I’ve had a long career in data engineering – 20 years, maybe more. The vast majority of work that I have done can be summed up as moving data from source A to target B, in order to ready it for analytics. I know that this might be a vast oversimplification, but the root of everything I have done in the past can be reduced to that simple process.

I find myself thinking about this more often because working at Starburst has changed how I think about my work and my role. Starburst offers a shift – a shift left to a more lean model of delivering data for analytics bringing the compute to the data, instead of the data to the compute.

The lean model is well known in the field of manufacturing, and I find it to be an interesting parallel with Starburst’s approach to data engineering. The lean principle sees waste as anything that doesn’t add value that the customers are willing to pay for.

Lean Manufacturing vs. Lean Data Engineering

Before lean manufacturing, manufacturers would:

Build a warehouse – hoping that it is the right size and shape to allow them to scale their production
Move raw materials in the warehouse – requiring lots of money up front in order to build up a reasonable amount of stock
Build the “product” from the raw materials based off of forecasting – fulfilling quotas and sales projections instead of actual orders
All of this is done, BEFORE the first order. (We did understand what the customer wanted, ahead of time, correct?)

Lean manufacturing changed the world. The new process was based on demand to minimize waste, meaning manufactures would:

Find a scalable warehouse solution – one that’s small but can grow with you
“Pull” raw materials based on demand – keeping upfront costs down instead of stockpiling
Build products only when an order is placed – responding to demand instead of predicting demand

This new approach made manufactures quicker and more efficient. They could now change their product and their supply based on shifting consumer preferences or fluctuating demand.

The world of data engineering is very similar to traditional manufacturing. Data engineers, typically:

Buy a costly data warehousing solution
Pipeline as much data as possible into the warehouse, based off of assumptions about the data and how it will be used (good and bad)
Serve the warehouse data to end users – and hope it was what they needed

You’ll notice that in traditional data engineering, much like traditional manufacturing, customers are at the end of the process – after we transform those raw data materials into a product or service.

Starburst Galaxy takes the lean approach:

Connect to data sources
Land data ONLY when you need to transform or process that data
Serve up prepared data to end users via data products

Putting Concept to Practice

Let’s take a moment to think about this with an example, let’s revisit how to set up Trino for dbt.

In the demo provided by Michiel De Smet and Przemek Denkiewicz, they build a 360 view of the Jaffle shop customer using dbt and Starburst Galaxy. When we think of how we would have managed this in a more traditional manner, we think of the following steps we as data engineers would have had to perform:

Land Your Relational Data: We would need to pull in the need to move the relational database to the warehouse, and make sure that we are landing the data as it needs to be landed for analytics. Sometimes we will need to cast fields as a timestamp, boolean, or integer because we have to understand the data as we move it to the warehouse. Sometimes we miss the mark, but more often (as data engineers) we are exactly right, it is that sometimes that scares me a little.
Land Your Semi-Structured Data: Well we landed the data, and it was relational, which is such a blessing when moving data to the warehouse. But now we have to manage JSON, the enemy of structured data. We have this click data, and we need to now manage that data. In this specific case the JSON data is not too unwieldy, but this is not true for all semi-structured data. When data engineers get to semi-structured data, we have to make a lot of decisions about the data. In a normal world, we are sharing the grief of these decisions with a group of peers. So at least we have this going for us. But we still have to make a decision, and we still have to rationalize how we relate the data to the customer 360 story.
Quality Check: We missed something, some assumption is wrong (you will hear a lot of this from me), and now we have to rebuild the models that we assumed were right, from that terrible JSON data and rectify the assumptions we (as a community) have made.

The above steps were my life for 20 years. Repeat and repeat.

However, with Starburst Galaxy, there’s a new path. Instead of warehousing the data, fixing the data, warehousing the new data, and then fixing the data again, Michiel De Smet and Przemek show us how to connect directly to the original source of information, pull what we need to understand, and deal with misconceptions at that point of understanding. This allows us to respond to changes faster and connect that data to the business quickly and effectively. The response to problems does not become a story of reconfiguring the warehouse, but responding to a connector.

Because we are pulling data on demand, instead of pushing to a warehouse, we are able to respond to the changing topology of requests that our data customers have. As a data engineer, I am focused on building the right analytic models, instead of maintaining pipelines. I do enjoy maintaining a positive relationship with pipelines, but these relationships can become complicated quickly. And focusing on pipelines, and warehousing data, takes time and energy away from providing those curated models.

So should you adopt the lean approach to data engineering?

On my team here at Starburst, I’ve seen that adopting the lean approach to data engineering can offer a range of significant benefits. Embracing the lean principles has led to streamlined workflows, reduced costs, quicker response times, and enhanced overall data quality.

It has required me to change how I think about data engineering and how we deliver value to our customers away from the warehousing of data and to the pulling of data on demand.

How Starburst’s data engineering team builds resilient telemetry data pipelines

Isaac Obezo — Thu, 24 Aug 2023 14:17:13 +0000

In a previous life, I had the privilege of being part of a team that brought a data warehousing solution to my company. It’s a tale as old as time. Data was stashed and stored everywhere. We needed to build a bridge between where that data was and push it to where it was not, so we adopted a data warehouse.

And it worked well – until it didn’t. I’ll go into more details on my cloud data warehouse learnings in a future series, but for my first Starburst post, I wanted to start simple – how we are using Starburst at Starburst.

As the data engineering team, our goal is to provide a data analytics platform that enables self-service, provides clarity and insights, is trustworthy, and that quality and testing are built into the data for the end users (or systems). That’s a fancy way of saying we are obsessed with being data-driven and giving more people access to quality data.

As with any good data engineering platform, our first step was to build out a data curation lifecycle. For this initiative, we chose to use Starburst Galaxy paired with dbt. The rest of this blog will talk about our experience with Starburst Galaxy and dbt compared to the traditional warehousing model.

Designing the curation lifecycle

When architecturing the data curation lifecycle, my team wanted to be able to organize the data and the curation layers by Topic. Topic is the category or bucket of information that the data models represent. In this blog, Telemetry is our Topic, which means the the data curation lifecycle stages for Telemetry would become:

Telemetry_Source – where data is landed
Telemetry_Stage – where data is structured
Telemetry_Curation – where data is served

Sometimes this is referred to as the medallion architecture where you have bronze (raw data), silver (cleaned data), and gold (curated data) tables. At Starburst, we like to use land, structure, and consume to communicate the same paradigms which perfectly pairs with the our Data Lifecycle source, stage, and Data Product (AKA curation)

Now that we have a methodology for categorizing data, the next step is building out the curations. But we want to create an efficient and scalable process that will allow us to iterate quickly, deploy, document, test, as well as provide the best possible data to the curation layer.

Ideally, we want to be able to deploy from our repositories on GIT, with a CI/CD (continuous integration, continuous deployment) strategy, and provide lineage, test results, data dictionaries and catalogs as part of this process. Our team goal was to build as much self-service as possible into the curation layer as possible, from a single source of truth (GIT).

The traditional warehousing approach

In my previous life, I would’ve built these pipelines by pulling the data from a source of information and then landing that data (either raw or after some in-stream transformations) into a large warehouse. Pretty straightforward. Then, I would’ve implemented a service or orchestration layer, to help manage those connections. Only at that point would I be able to start asking questions of data that would normally span domains of information.

The problem with this approach is that warehousing pipelines are fragile and prone to human error. There are a couple common breakpoints and inefficiencies I ran into over and over again.:

Incorrect data structure assumptions: As the data engineer, I would need to make determinations on the structure of the data. Oftentimes, I chose to land this data as blobs or varchar and chose to figure out the rest downstream. Most of the time this worked and the assumptions I made were right but other times not, leading to a poor quality pipeline.
Time spent on pipeline maintenance: Data pipelines require maintaining, updating, validating, and managing connectors. If anything breaks, you have to take the time to understand what the pipeline does, what transformations were performed, and what downstream systems it feeds. I often referred to this pipeline work as “data archeology”. The pipeline needed to be deciphered, the origins discovered, the transformations vetted and understood. For me, this was hard enough to do with my own pipelines, let alone deconstructing another data engineer’s work.
Short-sighted naming conventions: In a traditional warehouse, the reality is data has to live within the same schema which means table and view names take on extra importance. Is a table always prefaced with “tbl_” or a view with “vx_”? We are constrained by our own ability to name things that are informative and forward compatible. This is a lot to manage, even without tracking lineage, writing documentation, creating data dictionaries. The organization of data to build our analytic models alone can be a quagmire. This isn’t meant to be a list of complaints against warehousing pipelines. I really thought these problems were just part of the job. Not only do we work with data, we work with the entire ecosystem (and all its good or bad) as well.

Starburst Galaxy and dbt

After being introduced to Starburst, I realized that a lot of the day-to-day pipeline pains I had classified as inevitable could be fixed because of how Starburst operates.

Starburst Galaxy, Starburst’s fully managed data lake analytic platform, acts as a non-destructive, abstraction layer on top of your data sources. We can develop a strategy, hate it, and develop a new strategy for organization without a lengthy migration process. While optimized for data lake analytics, Galaxy works across any data source from data warehouses to operational databases to storage solutions as simple as Google sheets.

In our deployment of Starburst Galaxy, we arranged our telemetry data in our new data catalog “telemetry-catalog” which is connected to our AWS S3 data lake that contains all the curation stages either as S3 objects or external tables.

In order to move our lake data through the curation stages, we embraced dbt. With dbt, we are able to build the models, build the dependencies, create tests and document our models. Said differently, we are building our telemetry pipelines with dbt and using Galaxy to manage our connections. This means that instead of switching between many systems, we can manage both source and destination data through a single system – Galaxy.

Specifically, we use Galaxy’s federation capabilities (aka cross-schema joins) to create staged views – or materialized if necessary – from source information, meaning we are able to build the curation layer without creating any pipelines.

With dbt, we are able to connect to Galaxy and run all of our operations through Galaxy and define them with our dbt_project YAML. Managing the different objects, the transformations, and curation models through dbt and Galaxy means that we can also build out a complete lineage of the model, with comments and tests, and track the dependencies that each of our curation models are built from.

Here’s an example of what this looks like for our telemetry data:

As you might have figured out, I’m a big fan of dbt and Galaxy. After building our data models with dbt and Galaxy, we’ve seen the following benefits:

Increased data quality: Testing is easy to define and build into the process with dbt. Tests are able to be run as part of our CI/CD pipelines, with results. And the artifacts that are generated by dbt allow data engineers to have a copy of the tests as complete SQL code. Quality is a feature of good curations.
Domain-oriented approach: We are able to work within domains (or topics) instead of spending time trying to decipher the tools and processes. This allows us to bring a higher level of quality to the data, be more efficient with our workflows, and always be improving self-service with team members that are domain experts. We’ve found Galaxy’s data products and comment capabilities especially useful here.
More time spent building vs. maintaining: Galaxy handles the integration for the models, so my team can spend more time building new models. With Galaxy, if the catalog exists and is assigned to the right permissions and clusters, the model is integrated within the data ecosystem. While dbt defines the model, Galaxy serves up the model.
More agility and autonomy: Ultimately, data engineers want to move quickly, without being discouraged by many thousands of dependencies. Galaxy and dbt, with a good and clear strategy, allows the DE team to work as autonomously as possible and asynchronously. We are a global company after all.

All of the above makes it easier for my team to scale. Scalability is something I have struggled with in previous work, because data engineers can become specialists quickly. As mentioned previously, pipelines can become black boxes, we then spend our time understanding where and what needs to be done. This type of workload does not scale, it is impossible to scale this work … and I like to take vacations.

Galaxy manages the connections, but dbt really delivers a method to build a workflow that we can scale. We are able to treat many of our curation projects as a normal development lifecycle, and leverage the rigor from that workflow to build scalable solutions.

Concluding thoughts

I am an imperfect engineer, and often fail at tasks. I like to tell my team that “I really try and do nothing that my future self will want to kill me for”.

The reality is that, like my previous data warehouse implementation, not every strategy will work. Instead, my goal is to create an environment where analytical models are constantly being improved upon and the work that drives those models can change when change is (inevitably) necessary.

Starburst Galaxy allows us to create, explore, understand and build pipelines, but more importantly it allows us to change and not destroy when we realize our approaches are not quite working.

With dbt and Galaxy, my role has evolved from maintaining many different systems, configurations, and SQL into managing a single dbt project and spending more time building new projects. Often I find myself having eureka moments with Starburst Galaxy and how breaking some of our old paradigms allows us, the data engineers, to build and grow a data driven culture within Starburst and everywhere else.

Iceberg Time Travel & Rollbacks in Trino

Tom Nats — Thu, 17 Aug 2023 18:54:46 +0000

I originally posted this on Starburst's blog, as part of a series I've been publishing on Iceberg.

Time Travel

Time travel in Trino using Iceberg is a handy feature to “look back in time” at a table’s history. As we covered in this blog, each change to an Iceberg table creates a new “snapshot” which can be referred to by using standard SQL.

As you can see from the diagram below, a new snapshot is created for the table creation, insert and update.

To see the snapshots on a table, you can use the handy metadata table that exists for each table:

SELECT * FROM "customer_iceberg$snapshots";

The above snapshot table shows the create, insert and update operations on the customer_iceberg table. You can see what type of operation was performed and when it was executed.

To select a certain snapshot, you use the “for version as of” syntax. In the following two examples, we show the customer name before and after an update:

SELECT custkey,name FROM customer_iceberg FOR VERSION AS OF 5043425904354141100 where custkey = 2732;

SELECT custkey,name FROM customer_iceberg FOR VERSION AS OF 3117754680069542695 where custkey = 2732;

You can also specify a timeframe to retrieve an older snapshot of a table. For example, the following query brings back the data for the first snapshot on or before a given timestamp:

SELECT custkey,name FROM s3lakehouse.demo_tpch.customer_iceberg FOR TIMESTAMP AS OF TIMESTAMP '2022-09-18 07:18:09.002 America/New_York' where custkey = 2732;

Rolling back

Another great feature of Iceberg is the ability to roll back a table to a previous snapshot. Sometimes this is used when a row was accidentally deleted or updated. As long as the snapshot exists, (it hasn’t been cleaned up yet) then you can roll back to any existing snapshot.

For example, in the scenario above, if I wanted to roll back to the state of the table before the update on the customer, then I would issue the following command:

CALL iceberg.system.rollback_to_snapshot('demo_tpch', 'customer_iceberg', 5043425904354141100)

Then we can query the table again to see the customer’s name was “rolled back” to the previous version before the update:

SELECT custkey,name FROM s3lakehouse.demo_tpch.customer_iceberg where custkey = 2732;

Both time travel and rolling back are database functions that are now available in a modern data lake architecture. This is a game changer as it allows database type functionality to objects stores which were not available until now.

Iceberg Schema Evolution in Trino

Tom Nats — Wed, 09 Aug 2023 13:34:00 +0000

I originally posted this on Starburst's blog, as part of a series I've been publishing on Iceberg.

Schema evolution simply means the modification of tables as business rules and source systems are modified over time. Trino’s Iceberg connector supports different modifications to tables, including changing the table name itself, as well as, column and partition changes.

Much like a database, you perform “alters” to Iceberg tables to modify their structure. Since Iceberg is just a table format which contains metadata about a table, modifying the table is rather trivial.

Table Changes

Rename a table

alter table customer_iceberg rename to customer_iceberg_new;

Note: This is a change to the table name in the metastore and no changes will be made in the storage. So, the location s3:///customer_iceberg will remain the same.

Column Changes

Adding a column:

alter table customer_iceberg add column tier varchar(1);

Rename a column:

alter table customer_iceberg rename column address to fulladdress;

Partition Changes

Oftentimes a table is initially partitioned by a column or set of columns, only later it’s discovered this may not be optimal. With Iceberg, you can modify the partition columns at any time.

For example, initially this table is partitioned by month:

create table orders_iceberg with (partitioning=ARRAY['month(orderdate)']) as select * from tpch.sf1.orders;

After reviewing query a patterns, it’s determined that partitioning by day would perform better as a majority of queries are filter by certain days. A simple alter table statement as seen below will modify the partitioning on this table from month to day:

alter table orders_iceberg SET PROPERTIES partitioning = ARRAY['day(orderdate)'];

After new data is inserted into the table, you will see a change in the data directory where the table data is stored:

Notice the orderdate_month is now orderdate_day. Note that queries that filter by day will partition prune at the partition day level but the existing monthly partitions will still need to be searched. If you would like to have the entire table partitioned by day then you could recreate the table using a CTAS (create table as) from the existing table.

Example to create a new table partitioned by day from the existing table:

create table orders_iceberg_new with (partitioning=ARRAY['day(orderdate)']) (as select * from orders_iceberg)

Schema evolution in Trino’s Iceberg connector is very powerful and easy to use. These types of functions were not available in Hive and database veterans will be very happy to see them added to the data lake landscape.

Iceberg DML & Maintenance in Trino

Tom Nats — Mon, 07 Aug 2023 12:39:00 +0000

I originally posted this on Starburst's blog, as part of a series I've been publishing on Iceberg.

One key feature of the Apache Iceberg connector is Trino’s ability to modify data that resides on object storage. As we all know, storage like Amazon S3 is immutable which means it cannot be modified. This was a challenge in the Hadoop era where data needed to be modified or removed at the individual row level. Trino allows for full DML (data manipulation language) using the Iceberg connector which means full support for update, delete and merge.

Since Iceberg is a table format, when DML commands are issued, transaction logs are created and stored in the same storage, so clients that read from the same table will reflect these changes in subsequent queries.

Insert

Inserts are one of the most used commands used in a modern data lake. Data is constantly being added and as you would expect, the Iceberg connector for Iceberg supports a standard insert statement:

insert into customer_iceberg values (90000,'Testing','33 Main',3,'303-867-5309',323,'MACHINERY','Testing Iceberg');

Note that update, deletes and merges should be run in serial and or batch against a single table. This is to ensure there are no conflicts.

Update

Updates in Trino with the Iceberg connector act just like ordinary updates. If there are current select statements executing against this table, they will see the data in the previous snapshot to ensure read integrity.

update customer_iceberg set name = 'Tim Rogers' where custkey = 2732;

Delete

Delete statements are rarely used in a modern data lake, instead “soft” deletes are most common which means the row is updated to be flagged as deleted. There is usually a status column or something similar to filter out reading this data by select queries. In the event a row or set of rows need to be deleted, this can be done with a typical delete statement:

delete customer_iceberg where custkey = 2732;

Note: A delete statement doesn’t actually physically delete the data off the storage. In order to ensure the data has been removed, an expire_snapshots procedure needs to be executed with a date that is less than when the delete statement was ran.

Merge

The merge statement is a very handy tool to add logic based operations to a sql statement. Merge is often used when you have new or modified data that is staged in a table first. A good example is customer data that is being pulled from an operational system. CDC (change data capture) data is extracted from a CRM system into a staging table in S3. Or with Trino, a merge can use an existing table from the source system.

To use merge, you can either stage data that needs to be inserted or updated into your target table or you can use data directly from the source table(s).

Examples:

Example 1: If there are rows that don’t exist in the target table, insert them. This is a very basic merge statement. The customer_land table below could be a staged table in object storage like S3 or it could be from a source system such as MySQL or SQL Server:

MERGE INTO s3lakehouse.blog.customer_base AS b
USING (select * from s3lakehouse.blog.customer_land) AS l
ON (b.custkey = l.custkey)
WHEN NOT MATCHED
      THEN INSERT (custkey, name, state, zip, cust_since,last_update_dt)
            VALUES(l.custkey, l.name, l.state, l.zip, l.cust_since,l.last_update_dt);

Example 2: With merge, we can issue a single statement to insert new rows and update existing ones:

MERGE INTO s3lakehouse.blog.customer_base AS b
USING s3lakehouse.blog.customer_land AS l
ON (b.custkey = l.custkey)
WHEN MATCHED and b.name != l.name
THEN UPDATE
SET name = l.name ,
          state = l.state,
          zip = l.zip,
          cust_since = l.cust_since
WHEN NOT MATCHED
      THEN INSERT (custkey, name, state, zip, cust_since,last_update_dt)
            VALUES(l.custkey, l.name, l.state, l.zip, l.cust_since,l.last_update_dt);

This statement will insert new rows where the custkey doesn’t exist in the target table. It will update rows in the target table if the custkey matches and the name has changed. Of course in real-world situations, there will be numerous columns that are checked to see if they have changed to issue an update. I chose name for this simple example but you can see the power of merge and why it’s a game changer for a modern data lake.

Example 3: Slowly Changing Dimension (SCD Type 2)

MERGE INTO s3lakehouse.blog.customer_base as b
USING
( SELECT null as custkey_match, custkey, name, state, zip, cust_since, last_update_dt,'Y' as active_ind,current_timestamp as end_dt
FROM s3lakehouse.blog.customer_land
UNION ALL
SELECT
custkey as custkey_match,custkey, name, state, zip, cust_since, last_update_dt,active_ind,end_dt
FROM s3lakehouse.blog.customer_base
WHERE custkey IN
(SELECT custkey FROM s3lakehouse.blog.customer_land where active_ind = 'Y')
) as scdChangeRows
ON (b.custkey = scdChangeRows.custkey and b.custkey = scdChangeRows.custkey_match)
WHEN MATCHED and b.active_ind = 'Y' THEN
UPDATE SET end_dt = current_timestamp,active_ind = 'N'
WHEN NOT MATCHED THEN
        INSERT (custkey, name, state, zip, cust_since,last_update_dt,active_ind,end_dt)
            VALUES(scdChangeRows.custkey, scdChangeRows.name, scdChangeRows.state, scdChangeRows.zip,
                    scdChangeRows.cust_since,scdChangeRows.last_update_dt,'Y',null);

A SCD Type 2 simply means we insert new rows and we also “end date” existing rows and insert a new row. This allows history to be maintained in a single table. This is a data warehousing technique that has been around for a long time. The ability to perform this feature in a data lake is new though and opens up a choice to provide data warehousing features right out of a cloud storage.

There is a lot going on in this merge so we’ll cover a few points. We first select data from the landing table and union it with our base table while only pulling active rows. From there, we insert any new rows in addition to any modified rows. Lastly, we update or “end date” the old row by setting the active_ind to ‘N’ and providing a date in the end_dt column.

Optimize

As your Iceberg tables grow and have many operations performed against them, it’s a good idea to optimize them from time to time. The optimize command not only makes small files larger for better performance, it also cleans up the metadata which improves queries due to less metadata that needs to be read.

To scan the table for small files and make them larger, you simply issue the following command:

alter table <table> execute optimize;

This will look for any files under 100MB and combine them into larger ones. You can also choose the file size if 100MB:

ALTER TABLE <table> EXECUTE optimize(file_size_threshold => '10MB')

If your Iceberg table becomes very large and the optimize command above is taking too long to run, you can just optimize the files that have arrived recently:

alter table <table> execute optimize where $file_modified_time > <yesterday>

This will look for files that have arrived since yesterday and optimize them. On a very active table where lots of changes are taking place, this will greatly reduce the amount of time the optimize command takes.

For tables that are being modified constantly, it’s a good idea to optimize at regular intervals.

We’ve covered one of the most powerful features of Iceberg with Trino which allows database type updates/deletes/merges to be performed on your modern data lake. This opens the door to increasing the use cases and reliability of the data lake on a variety of clouds and platforms.

Iceberg Partitioning and Performance Optimizations in Trino

Tom Nats — Tue, 01 Aug 2023 12:43:00 +0000

I originally posted this on Starburst's blog, as part of a series I've been publishing on Iceberg.

Partitioning

Partitioning is used to narrow down the scope of the data that needs to be read for a query. When dealing with big data, this can be crucial for performance and can be the difference between getting a query that takes minutes or even hours down to seconds!

One of the advantages of Apache Iceberg is how it handles partitions. One of the biggest drawbacks from using Hive based tables was the method on how you had to partition your data.

For example, most tables that you would plan to partition have some sort of date or timestamp that indicates when the row of data was created. Example table:

For Hive, if you wanted to partition by day, you would have to break out the created_ts column into year, month and day. Then, you would have to teach your users to always include these columns in their queries even if they wanted to query on created_ts.

create table hive.orders (event_id, integer, created_ts timestamp, metric integer, year varchar, month varchar, day varchar);

With Iceberg, you simply partition the data on created_ts using day and end users would query this table just like they would in a database. Here is an example:

-- create iceberg table partitioned by day on the created_ts column
create table orders_iceberg 
(event_id integer, created_ts timestamp(6),metric integer)
with (type='iceberg',partitioning=ARRAY['day(created_ts)']);
-- insert rows
insert into orders_iceberg values (1,timestamp '2022-09-10 10:45:38.527000',5.5);
insert into orders_iceberg values (1,timestamp '2022-09-11 03:12:23.522000',5.5);
insert into orders_iceberg values (1,timestamp '2022-09-12 10:46:13.516000',5.5);
insert into orders_iceberg values (1,timestamp '2022-09-13 04:34:05.577000',5.5);
insert into orders_iceberg values (1,timestamp '2022-09-14 09:10:23.517000',5.5);
-- query the table only looking for certain days
select * from orders_iceberg where created_ts BETWEEN date '2022-09-10' AND date '2022-09-12';

The data in S3 for example looks like this:

Trino is smart enough to read the Iceberg Manifest List and then only look at files that meet the partition requirement of the query. In the example above, it would only be 2022-09-10 and 2022-09-11. A list of functions to partition by can be found here.

Note: Trino’s Iceberg implementation includes the timezone for the timestamp data type (timestamp(6)). This was a conscious decision based on industry standard of supporting timezones within timestamp data types that Hive didn’t support.

Although we’ll cover this in a separate schema evolution blog, you aren’t stuck with this partitioning scheme. At any time, you can modify your partition column. For example, if we decided that partitioning on day is too granular, we can modify the table to now be partitioned by month:

alter table orders_iceberg SET PROPERTIES partitioning = ARRAY['month(created_ts)'];

New data will be created in directories named: created_ts_month=2022-09 for example. The existing data will remain partitioned by day unless the table is recreated.

Performance and Optimizations

When it comes to performance, Iceberg can be a very performant table format. This is because metadata is stored about all of the files that “belong” to a table for a given snapshot in time along with statistics about each one which helps with “file skipping”. This is a fancy term for files that do not need to be read based on the query that is issued.

With partitioning, the field of files is narrowed down even further by first only looking at the metadata for files after partition pruning is completed then looking at the metadata of the remaining files. When data is ordered by columns that appear in a where clause, this can greatly improve the performance of selective queries.

The manifest file contains information about the different files that belong to the table. Each entry has the location of the file in addition to statistics such as the minimum and maximum value for each column, the number of nulls and other useful information. Trino will use this metadata about each file to determine if the file needs to be read. If the data is sorted by “id” and a where clause has predicate similar to: where id = 5 then this query will see a large performance improvement because only a handful of files (if not just one) will need to be read.

Optimizing for performance

Iceberg includes some file management features that help with performance. Traditional data lakes have use cases where there is constant data being ingested. This data is written in small files because of the need to have it available to be queried immediately. This can hurt performance in any system that needs to read a bunch of small files especially in cloud storage. Iceberg includes an optimize feature that combines small files into larger ones ensuring maximum performance when it comes to querying.

The idea here is you want to ingest data as fast as possible, making it available for queries even though it might not be of the highest performance, then offer the ability to combine those files into larger ones at a given interval.

To scan the table for small files and make them larger, you simply issue the following command:

alter table <table> execute optimize;

This will look for any files under 100MB and combine them into larger ones. You can also choose the file size of 100MB:

ALTER TABLE <table> EXECUTE optimize(file_size_threshold => '10MB')

If your Iceberg table becomes very large and the optimize command above is taking too long to run, you can just optimize the files that have arrived recently:

ALTER TABLE <table> EXECUTE optimize where "$file_modified_time" > current_date - interval '1' day;

Example:

Network events are streamed in 1 minute intervals. Small files are dropped into an S3 bucket using the Iceberg API and the data is available immediately using standard SQL. Based on the volume of data and the files created, the optimize command can be run at given intervals to consolidate these smaller files into larger ones. This will greatly improve the performance of of subsequent queries against this table.

Cleaning up snapshots

From time to time, older snapshots of tables should be cleaned up. These older snapshots contain previous states of the table which are no longer needed.

There are two operations that clean up old snapshots and data. One is “expire_snapshots” and the other is “remove_orphan_files.

expire_snapshots
This function removes snapshots that are older than the value provided during the execution. An example is the below command that will remove snapshots that are older than 7 days:

ALTER TABLE <table> EXECUTE expire_snapshots(retention_threshold => ‘7d’)

[remove_orphan_files](https://trino.io/docs/current/connector/iceberg.html#remove-orphan-files)
This function removes files that are left on storage when a query is unable to complete for a variety of reasons. This doesn’t happen too often but it’s a good idea to include this when you run snapshot cleanups. A similar alter table statement is used as shown in the this example:

ALTER TABLE test_table EXECUTE remove_orphan_files(retention_threshold => ‘7d’)

As you can see, Iceberg + Trino brings some very exciting features along with tremendous performance advantages to complete your modern data lake.

Introduction to Apache Iceberg in Trino

Tom Nats — Wed, 26 Jul 2023 13:00:00 +0000

I originally posted this on Starburst's blog, as part of a series I've been publishing on Iceberg.

TL;DR

Apache Iceberg is an open source table format that brings database functionality to object storage such as S3, Azure’s ADLS, Google Cloud Storage and MinIO. This allows an organization to take advantage of low-cost, high performing cloud storage while providing data warehouse features and experience to their end users without being locked into a single vendor.

What is Apache Iceberg?

Apache Iceberg is a table format, originally created by Netflix, that provides database type functionality on top of object stores such as Amazon S3. Iceberg allows organizations to finally build true data lakehouses in an open architecture, avoiding vendor and technology lock-in.

The excitement around Iceberg began last year and has greatly increased in 2022. Most of the customers and prospects I speak with on a weekly basis are either considering migrating their existing Hive tables to it or have already started. They are excited a true open source table format has been created with many engines both open source and proprietary jumping on board.

Advantages of Apache Iceberg

One of the best things about Iceberg is the vast adoption by many different engines. In the diagram below, you can see many different technologies can work the same set of data as long as they use the open-source Iceberg API. As you can see, the popularity and work that each engine has done is a great indicator of the popularity and usefulness that this exciting technology brings.

With more and more technologies jumping on board, Iceberg isn’t a passing fad. It has been growing in popularity, not only because of how useful it is, but also because it’s truly an open source table format, many companies have contributed and helped improve the specification making it a true community based effort.

Here is a list of the many features Iceberg provides:

Choose your engine: As you can see from the diagram above, there are many engines that support Iceberg. This offers the ultimate flexibility to own your own data and choose the engine that fits your use cases.
Avoid Data Lock-in: The data Iceberg and these engines work on, is YOUR data in YOUR account which avoids data lock-in.
Avoid Vendor Lock-out: Iceberg metadata is always available to all engines. So you can guarantee consistency, even with multiple writers.
DML (modifying table data): Modifying data in Hadoop was a huge challenge. With Iceberg, data can easily be modified to adhere to use cases and compliance such as GDPR.
Schema evolution: Much like a database, Iceberg supports full schema evolution including columns and even partitions.
Performance: Since Iceberg stores a table state in a snapshot, the engine simply needs to read the metadata in that snapshot then start retrieving the data from storage saving valuable time and reduced cloud object store retrieval costs.
Database feel: Partitioning is performed on any column and end users query Iceberg tables just like they would a database.

Iceberg Architecture

Iceberg is a layer of metadata over your object storage. It provides a transaction log per table very similar to a traditional database. This log keeps track of the current state of the table including any modifications. It also keeps a current “snapshot” of the files that belong to the table and statistics about them in order to reduce the amount of data that is needed to be read during queries greatly, improving performance.

Snapshots

Everytime a modification to an Iceberg table is performed, (insert, update, delete, etc.) a new snapshot of the table is created. When an Iceberg client (let’s say Trino) wants to query a table, the latest snapshot is read and the files that “belong” to that snapshot are read. This makes a very powerful feature called time travel available because the table at any given point contains a set of snapshots over time which can be queried with the proper syntax.

Under the covers, Iceberg uses a set of Avro-based files to keep track of this metadata. A Hive compatible metastore is used to “point” to the latest metadata file that has the current state of the table. All engines that want to interact with the table first get the latest “pointer” from the metastore then start interacting with Iceberg metadata files from there.

Here is a very basic diagram of the different files that are created during a CTAS (create table as select):

Metadata File Pointer (fp1) – This is an entry in a Hive compatible metastore (AWS Glue for example) that points to the current metadata file. This is the start to any query against an Iceberg table.

Metadata File (mf1) – A json file that contains the latest version of a table. Any changes made to a table create a new metadata file. The contents of this file are simply lists of manifest list files with some high level metadata.

Manifest List (ml1) – List of manifest files that make up a snapshot. This also includes metadata such as partition bounds in order to skip files that do not need to be read for the query.

Manifest File (mf1) – Lists a set of files and metadata about these files. This is the final step for a query as only files that need to be read are determined using these files saving valuable querying time.

Here is a sample table named customer_iceberg that was created on S3:

customer_iceberg-a0ae01bc83cb44c5ad068dc3289aa1b9/
  data/
    20221005_142356_18493_dnvqc-43a7f422-d402-41d8-aab3-38d88f9a8810.orc
    20221005_142356_18493_dnvqc-548f81e0-b9c3-4015-99a7-d0f19416e39c.orc
  metadata/
    00000-8364ea6c-5e89-4b17-a4ea-4187725b8de6.metadata.json
    54d59fe-8368-4f5e-810d-4331dd3ee243-m0.avro
    snap-2223082798683567304-1-88c32199-6151-4fc7-97d9-ed7d9172d268.avro

Table directory – This is the name of the table with a unique uuid in order to support table renames.

Data directory – This holds the Orc, Parquet or Avro files and could contain subdirectories depending on if the table is partitioned.

Metadata directory – This directory holds the manifest files as covered above.

Again, this might be too nitty-gritty for the average user but the point is a tremendous amount of thought and work has been put into Iceberg to ensure it can handle many different types of analytical queries along with real-time ingestion. It was built to fill the gap between low-cost, cloud object stores and the demanding processing engines such as Trino and Spark.

Partitioning

Using partitions in Iceberg is just like with a database. Most data you ingest into your data lake has a timestamp and partitioning by that column is very easy:

Example – partition by month from a timestamp column:

create table orders_iceberg
with (partitioning=ARRAY['month(create_date),region'])

Querying using a standard where clause against the partitioned column will result in partition pruning and much higher performance:

select * from orders_iceberg
WHERE CAST(create_date AS date) BETWEEN date '1993-06-01' AND date '1993-11-30';

Trino's Iceberg Support

Trino has full support for Iceberg with a feature matrix listed below:

Create Table
Modify Table (update/delete/merge)
Add/Drop/Modify table column
Rename table
Rollback to previous snapshot
View support (includes AWS Glue)
Time travel
Maintenance (Optimize/Expire Snapshots)
Alter table partition
Metadata queries

Using Iceberg in Trino is very easy. There is a dedicated connector page located here.

If you're new to Trino, Starburst Galaxy's free tier is the easiest and fastest way to test out the power of Trino and Iceberg.

Iceberg, Delta Lake, and Hudi, oh my!

Matt Fuller — Thu, 22 Jun 2023 16:01:07 +0000

The rising popularity of the data lakehouse has led many to try to compare the merits of the open table formats underpinning this architecture: Apache Iceberg, Delta Lake, and Apache Hudi. If you look between the lines, the conversation is mostly driven by hype, making it hard to parse reality from marketing jargon.

This article isn’t going to solve that problem. Instead, the goal is to introduce you to a new way of thinking about table formats – as a use case-level choice rather than an organization-level decision.

Choosing a table format

When deciding between table formats, it’s important to understand the similarities and differences that may impact performance and scalability.

For example, Iceberg is currently the only table format with partition evolution support. This allows the partitioning scheme of a table to be changed without requiring a rewrite of the table, and it enables queries to be optimized by all partition schemes.

On the other hand, Iceberg’s streaming support is lagging behind Delta Lake and Hudi. So the question to pick a table format becomes – which is more important to your business? Partitioning or streaming?

Now, any seasoned data engineer knows that it’s not that simple. You don’t just have a single type of data in your systems or a single way you’re looking to interact with that data. Instead, you’re dealing with streaming pipelines, batch jobs, ad hoc queries, and more – all at the same time. And you don’t get to control what is added to that mix in the future.

All of these factors make the binary decision – partitioning or streaming, Iceberg or Delta Lake – almost impossible to get right at the organization-level. But most vendors require you to do just that.

Starburst’s approach

With Starburst, everything is built with openness in mind. We designed Starburst Galaxy to be interoperable with nearly any data environment, including first-class support for all modern open table formats.

This means that you can use the table format that is right for each of your workloads and change it when new needs emerge. You don’t need to worry about limited support for external tables or being locked into an old table format when new ones come along (and it will).

How it works

We wanted to make it as easy as possible to write to and read from different table formats, so we built Great Lakes connectivity – an under-the-hood process that abstracts away the details of using different table formats and file types.

This connectivity is built into Starburst Galaxy, and is available to all users that are working with the following data sources:

Amazon S3
Azure Data Lake Storage
Google Cloud Storage

To create a table with one of these formats, you simply provide a “type” in the table ddl. Here is a simple example of creating an Iceberg table:

CREATE TABLE customer(
name varchar,
address varchar,
WITH (type='iceberg');

That’s it! An Iceberg table has been created.

To read a table using Great Lakes connectivity, you simply issue a SQL select query against it:

SELECT * FROM customer;

Again… that’s it! End users shouldn’t need to worry about file types or table formats, they just want to query their data.

Inside Starburst's hackathon: Into the great wide open(AI)

Matt Fuller — Wed, 21 Jun 2023 15:50:12 +0000

Generative AI has been taking the tech world by storm lately. Indie developers to large enterprises are experimenting with its impact not only in day-to-day jobs but also in driving new feature innovations.

Starburst is no different. We’re always looking for ways to improve our technology and make the lives of data engineers easier. The current buzz around OpenAI lined up perfectly with our yearly Hack-a-Trino, and our team came up with some pretty cool concepts.

Check out the top three projects below and let us know which one is your favorite by voting on r/dataengineering. Who knows you might just see the winning concept become a reality.

Project A: Automatic Data Classification (LLM-based tagging)
Project B: Trino AI Functions
Project C: No Code Querying with ChatGPT

Project A: Automatic Data Classification

Built by Alex Breshears, Cody Zwiefelhofer, David Shea, Elvis Le

What is it? Automatic Data Classification is a feature that analyzes and predicts the content of registered data sources based on a subset of the data itself. This project uses OpenAI’s LLM APIs to take samples of datasets and propose potential tags to data admins.

Tags in Starburst Galaxy allows users to associate attributes with one or more catalogs, schemas, tables, views, or columns. Tags can be combined with attribute-based access control policies to ensure that each role has the appropriate access rights to perform actions on entities in the system.

Why did we build it? This model automatically identifies data structures within the data. As such, it takes less time for data admins to learn what kind of data has been ingested and appropriately mark that data – increasing productivity and accuracy of tags.

Architectural Considerations: A critical requirement for this feature is that the data itself must be analyzed - not just the metadata. There are many reasons for this but the most obvious is columns are rarely named as what they are.

Project B: Trino AI Functions

Built by Karol Sobczak, Mainak Ghosh

What is it? Trino AI Functions (aka LLMs in the context of complex aggregations) uses ChatGPT and Hugging Face to translate natural language text into complex aggregation logic that can be executed directly on the Trino engine.

Why did we build it? SQL is heavily used for data analysis and at the same time the democratization of ML means that customers now want to do more with Trino queries. Trino AI Functions will help customers write expressive queries that can do language translation, fraud detection, sentiment analysis and other NLP tasks.

Architectural Considerations: One powerful feature of LLMs is that you can ask models to provide you a structured answer from unstructured data like JSON documents.

Project C: No Code Querying with ChatGPT

Built by Lukas Grubwieser and Rob Anderson

What is it? No code querying with ChatGPT is a feature that would let Galaxy users write natural language queries against data sources connected to Starburst Galaxy.

Why did we build it? This would allow business analysts to ask questions of Galaxy without having to write complex SQL or understand the underlying data architecture. It brings Starburst one step closer to our vision of data democratization.

Architectural Considerations: This project requires a three-part architecture: a frontend, OpenAI engine (with backend server), and Starburst. The frontend takes the question and OpenAI translates it to SQL which is then processed by Starburst.

Honorary Mention: Accelerating Cyber Analytics with Aderas, Starburst, and ChatGPT

While this next effort wasn’t part of our internal hackathon, we thought it was too cool not to share. Our partner Aderas built an POC of an insider threat investigation model using Starburst and ChatGPT for cyber security analytics.

Check out the demo:

Starburst’s Learnings on Gen AI

While we built a lot of cool things during the hackathon, we also learned a lot. We documented a couple of our team’s key learnings below:

It’s really shown us how quickly we can generate new features and ideas that traditionally were a lot harder for a business like ours to innovate on.
LLM definitely won’t solve everything, but it’s a good starting point.
It’s been fun to iterate on, but we’re also largely waiting for our model to become more correct (since we think correctness matters!) around the query generating. However, it’s exceptionally cool to see what the newer models are capable of, in terms of generating syntactically correct ANSI SQL.