DEV Community: Isaac Obezo

What data engineers can learn from lean manufacturing

Isaac Obezo — Tue, 12 Sep 2023 14:15:11 +0000

If you read my previous blog post, you’ll know that I’ve had a long career in data engineering – 20 years, maybe more. The vast majority of work that I have done can be summed up as moving data from source A to target B, in order to ready it for analytics. I know that this might be a vast oversimplification, but the root of everything I have done in the past can be reduced to that simple process.

I find myself thinking about this more often because working at Starburst has changed how I think about my work and my role. Starburst offers a shift – a shift left to a more lean model of delivering data for analytics bringing the compute to the data, instead of the data to the compute.

The lean model is well known in the field of manufacturing, and I find it to be an interesting parallel with Starburst’s approach to data engineering. The lean principle sees waste as anything that doesn’t add value that the customers are willing to pay for.

Lean Manufacturing vs. Lean Data Engineering

Before lean manufacturing, manufacturers would:

Build a warehouse – hoping that it is the right size and shape to allow them to scale their production
Move raw materials in the warehouse – requiring lots of money up front in order to build up a reasonable amount of stock
Build the “product” from the raw materials based off of forecasting – fulfilling quotas and sales projections instead of actual orders
All of this is done, BEFORE the first order. (We did understand what the customer wanted, ahead of time, correct?)

Lean manufacturing changed the world. The new process was based on demand to minimize waste, meaning manufactures would:

Find a scalable warehouse solution – one that’s small but can grow with you
“Pull” raw materials based on demand – keeping upfront costs down instead of stockpiling
Build products only when an order is placed – responding to demand instead of predicting demand

This new approach made manufactures quicker and more efficient. They could now change their product and their supply based on shifting consumer preferences or fluctuating demand.

The world of data engineering is very similar to traditional manufacturing. Data engineers, typically:

Buy a costly data warehousing solution
Pipeline as much data as possible into the warehouse, based off of assumptions about the data and how it will be used (good and bad)
Serve the warehouse data to end users – and hope it was what they needed

You’ll notice that in traditional data engineering, much like traditional manufacturing, customers are at the end of the process – after we transform those raw data materials into a product or service.

Starburst Galaxy takes the lean approach:

Connect to data sources
Land data ONLY when you need to transform or process that data
Serve up prepared data to end users via data products

Putting Concept to Practice

Let’s take a moment to think about this with an example, let’s revisit how to set up Trino for dbt.

In the demo provided by Michiel De Smet and Przemek Denkiewicz, they build a 360 view of the Jaffle shop customer using dbt and Starburst Galaxy. When we think of how we would have managed this in a more traditional manner, we think of the following steps we as data engineers would have had to perform:

Land Your Relational Data: We would need to pull in the need to move the relational database to the warehouse, and make sure that we are landing the data as it needs to be landed for analytics. Sometimes we will need to cast fields as a timestamp, boolean, or integer because we have to understand the data as we move it to the warehouse. Sometimes we miss the mark, but more often (as data engineers) we are exactly right, it is that sometimes that scares me a little.
Land Your Semi-Structured Data: Well we landed the data, and it was relational, which is such a blessing when moving data to the warehouse. But now we have to manage JSON, the enemy of structured data. We have this click data, and we need to now manage that data. In this specific case the JSON data is not too unwieldy, but this is not true for all semi-structured data. When data engineers get to semi-structured data, we have to make a lot of decisions about the data. In a normal world, we are sharing the grief of these decisions with a group of peers. So at least we have this going for us. But we still have to make a decision, and we still have to rationalize how we relate the data to the customer 360 story.
Quality Check: We missed something, some assumption is wrong (you will hear a lot of this from me), and now we have to rebuild the models that we assumed were right, from that terrible JSON data and rectify the assumptions we (as a community) have made.

The above steps were my life for 20 years. Repeat and repeat.

However, with Starburst Galaxy, there’s a new path. Instead of warehousing the data, fixing the data, warehousing the new data, and then fixing the data again, Michiel De Smet and Przemek show us how to connect directly to the original source of information, pull what we need to understand, and deal with misconceptions at that point of understanding. This allows us to respond to changes faster and connect that data to the business quickly and effectively. The response to problems does not become a story of reconfiguring the warehouse, but responding to a connector.

Because we are pulling data on demand, instead of pushing to a warehouse, we are able to respond to the changing topology of requests that our data customers have. As a data engineer, I am focused on building the right analytic models, instead of maintaining pipelines. I do enjoy maintaining a positive relationship with pipelines, but these relationships can become complicated quickly. And focusing on pipelines, and warehousing data, takes time and energy away from providing those curated models.

So should you adopt the lean approach to data engineering?

On my team here at Starburst, I’ve seen that adopting the lean approach to data engineering can offer a range of significant benefits. Embracing the lean principles has led to streamlined workflows, reduced costs, quicker response times, and enhanced overall data quality.

It has required me to change how I think about data engineering and how we deliver value to our customers away from the warehousing of data and to the pulling of data on demand.

How Starburst’s data engineering team builds resilient telemetry data pipelines

Isaac Obezo — Thu, 24 Aug 2023 14:17:13 +0000

In a previous life, I had the privilege of being part of a team that brought a data warehousing solution to my company. It’s a tale as old as time. Data was stashed and stored everywhere. We needed to build a bridge between where that data was and push it to where it was not, so we adopted a data warehouse.

And it worked well – until it didn’t. I’ll go into more details on my cloud data warehouse learnings in a future series, but for my first Starburst post, I wanted to start simple – how we are using Starburst at Starburst.

As the data engineering team, our goal is to provide a data analytics platform that enables self-service, provides clarity and insights, is trustworthy, and that quality and testing are built into the data for the end users (or systems). That’s a fancy way of saying we are obsessed with being data-driven and giving more people access to quality data.

As with any good data engineering platform, our first step was to build out a data curation lifecycle. For this initiative, we chose to use Starburst Galaxy paired with dbt. The rest of this blog will talk about our experience with Starburst Galaxy and dbt compared to the traditional warehousing model.

Designing the curation lifecycle

When architecturing the data curation lifecycle, my team wanted to be able to organize the data and the curation layers by Topic. Topic is the category or bucket of information that the data models represent. In this blog, Telemetry is our Topic, which means the the data curation lifecycle stages for Telemetry would become:

Telemetry_Source – where data is landed
Telemetry_Stage – where data is structured
Telemetry_Curation – where data is served

Sometimes this is referred to as the medallion architecture where you have bronze (raw data), silver (cleaned data), and gold (curated data) tables. At Starburst, we like to use land, structure, and consume to communicate the same paradigms which perfectly pairs with the our Data Lifecycle source, stage, and Data Product (AKA curation)

Now that we have a methodology for categorizing data, the next step is building out the curations. But we want to create an efficient and scalable process that will allow us to iterate quickly, deploy, document, test, as well as provide the best possible data to the curation layer.

Ideally, we want to be able to deploy from our repositories on GIT, with a CI/CD (continuous integration, continuous deployment) strategy, and provide lineage, test results, data dictionaries and catalogs as part of this process. Our team goal was to build as much self-service as possible into the curation layer as possible, from a single source of truth (GIT).

The traditional warehousing approach

In my previous life, I would’ve built these pipelines by pulling the data from a source of information and then landing that data (either raw or after some in-stream transformations) into a large warehouse. Pretty straightforward. Then, I would’ve implemented a service or orchestration layer, to help manage those connections. Only at that point would I be able to start asking questions of data that would normally span domains of information.

The problem with this approach is that warehousing pipelines are fragile and prone to human error. There are a couple common breakpoints and inefficiencies I ran into over and over again.:

Incorrect data structure assumptions: As the data engineer, I would need to make determinations on the structure of the data. Oftentimes, I chose to land this data as blobs or varchar and chose to figure out the rest downstream. Most of the time this worked and the assumptions I made were right but other times not, leading to a poor quality pipeline.
Time spent on pipeline maintenance: Data pipelines require maintaining, updating, validating, and managing connectors. If anything breaks, you have to take the time to understand what the pipeline does, what transformations were performed, and what downstream systems it feeds. I often referred to this pipeline work as “data archeology”. The pipeline needed to be deciphered, the origins discovered, the transformations vetted and understood. For me, this was hard enough to do with my own pipelines, let alone deconstructing another data engineer’s work.
Short-sighted naming conventions: In a traditional warehouse, the reality is data has to live within the same schema which means table and view names take on extra importance. Is a table always prefaced with “tbl_” or a view with “vx_”? We are constrained by our own ability to name things that are informative and forward compatible. This is a lot to manage, even without tracking lineage, writing documentation, creating data dictionaries. The organization of data to build our analytic models alone can be a quagmire. This isn’t meant to be a list of complaints against warehousing pipelines. I really thought these problems were just part of the job. Not only do we work with data, we work with the entire ecosystem (and all its good or bad) as well.

Starburst Galaxy and dbt

After being introduced to Starburst, I realized that a lot of the day-to-day pipeline pains I had classified as inevitable could be fixed because of how Starburst operates.

Starburst Galaxy, Starburst’s fully managed data lake analytic platform, acts as a non-destructive, abstraction layer on top of your data sources. We can develop a strategy, hate it, and develop a new strategy for organization without a lengthy migration process. While optimized for data lake analytics, Galaxy works across any data source from data warehouses to operational databases to storage solutions as simple as Google sheets.

In our deployment of Starburst Galaxy, we arranged our telemetry data in our new data catalog “telemetry-catalog” which is connected to our AWS S3 data lake that contains all the curation stages either as S3 objects or external tables.

In order to move our lake data through the curation stages, we embraced dbt. With dbt, we are able to build the models, build the dependencies, create tests and document our models. Said differently, we are building our telemetry pipelines with dbt and using Galaxy to manage our connections. This means that instead of switching between many systems, we can manage both source and destination data through a single system – Galaxy.

Specifically, we use Galaxy’s federation capabilities (aka cross-schema joins) to create staged views – or materialized if necessary – from source information, meaning we are able to build the curation layer without creating any pipelines.

With dbt, we are able to connect to Galaxy and run all of our operations through Galaxy and define them with our dbt_project YAML. Managing the different objects, the transformations, and curation models through dbt and Galaxy means that we can also build out a complete lineage of the model, with comments and tests, and track the dependencies that each of our curation models are built from.

Here’s an example of what this looks like for our telemetry data:

As you might have figured out, I’m a big fan of dbt and Galaxy. After building our data models with dbt and Galaxy, we’ve seen the following benefits:

Increased data quality: Testing is easy to define and build into the process with dbt. Tests are able to be run as part of our CI/CD pipelines, with results. And the artifacts that are generated by dbt allow data engineers to have a copy of the tests as complete SQL code. Quality is a feature of good curations.
Domain-oriented approach: We are able to work within domains (or topics) instead of spending time trying to decipher the tools and processes. This allows us to bring a higher level of quality to the data, be more efficient with our workflows, and always be improving self-service with team members that are domain experts. We’ve found Galaxy’s data products and comment capabilities especially useful here.
More time spent building vs. maintaining: Galaxy handles the integration for the models, so my team can spend more time building new models. With Galaxy, if the catalog exists and is assigned to the right permissions and clusters, the model is integrated within the data ecosystem. While dbt defines the model, Galaxy serves up the model.
More agility and autonomy: Ultimately, data engineers want to move quickly, without being discouraged by many thousands of dependencies. Galaxy and dbt, with a good and clear strategy, allows the DE team to work as autonomously as possible and asynchronously. We are a global company after all.

All of the above makes it easier for my team to scale. Scalability is something I have struggled with in previous work, because data engineers can become specialists quickly. As mentioned previously, pipelines can become black boxes, we then spend our time understanding where and what needs to be done. This type of workload does not scale, it is impossible to scale this work … and I like to take vacations.

Galaxy manages the connections, but dbt really delivers a method to build a workflow that we can scale. We are able to treat many of our curation projects as a normal development lifecycle, and leverage the rigor from that workflow to build scalable solutions.

Concluding thoughts

I am an imperfect engineer, and often fail at tasks. I like to tell my team that “I really try and do nothing that my future self will want to kill me for”.

The reality is that, like my previous data warehouse implementation, not every strategy will work. Instead, my goal is to create an environment where analytical models are constantly being improved upon and the work that drives those models can change when change is (inevitably) necessary.

Starburst Galaxy allows us to create, explore, understand and build pipelines, but more importantly it allows us to change and not destroy when we realize our approaches are not quite working.

With dbt and Galaxy, my role has evolved from maintaining many different systems, configurations, and SQL into managing a single dbt project and spending more time building new projects. Often I find myself having eureka moments with Starburst Galaxy and how breaking some of our old paradigms allows us, the data engineers, to build and grow a data driven culture within Starburst and everywhere else.