Massy

Posted on May 12

The evolution of Data Engineering and the role of ELT tools

#dataengineering #airbyte #elt #etl

Data engineering has progressed rapidly in the past 3 decades. The warp speed changes in the field have created a significant knowledge gap for existing data Engineers, people interested in moving into a career in data engineering, data Scientists, machine learning engineers, BI & analytics teams, software & infrastructure teams as well as executives who want to better understand how data engineering fits into their companies.

In the data engineering space, a good deal of ceremony occurs around data movement and processing in order to be able to effectively support downstream use cases such as data science(AI/ML), Business Intelligence and Operational analytics in production. Therefore, it comes as no surprise that the data movement and processing practices and tools are at the forefront of the data engineering evolution.

This article begins by explaining the long established data movement and processing pattern known as ETL and the shift to the newer pattern known as ELT. It then covers the ELT process in detail and its benefits. Subsequent sections highlight the top ELT tools, describe the Airbyte approach to ELT and feature the exciting crucial data engineering predictions that might redefine the way we work with, process, and harness data throughout 2025 and beyond.

The evolution of Data Engineering

The Birth of Data Engineering

Data engineering as a practice has existed in some form since companies started doing things with data—such as predictive analysis, descriptive analytics, and reports. Before it came into sharp focus as a distinct field alongside the rise of data science in the 2010s, the practice had been branded in a whole host of different ways in the past, including as Database Administration, Data Analysis, Business Intelligence Engineering, Database Development, and more.

The birth of Data Engineering can arguably be traced back to data warehousing, originating as early as the 1970s, with the business data warehouse taking shape in the 1980s, and Bill Inmon officially coining the term "data warehouse" in 1989. The advent of database technology in this period saw enterprises employ online transactional processing (OLTP) systems, which offered efficient methods for storing, querying, and updating transactional and operational data, typically managed by relational database management systems (RDBMS). OLTP systems were designed for application-oriented data collection and maintaining the most current state of the enterprise, optimised for multiple, concurrent, and fast reads and writes, ensuring ACID properties (atomicity, consistency, isolation, durability).

With the ability to manage data logistics, the next logical step for enterprises was to leverage this data for insights and profitability. This led to the emergence of online analytical processing (OLAP) systems around the mid-1990s, which became the cornerstone of business intelligence and decision support. OLAP systems utilise a data warehouse (DW) or enterprise data warehouse (EDW) specifically designed for analytical processes, maintaining historical and commutative data and performing heavy operations like user-defined functions, aggregates, and complex joins for business analysis.

Emergence of ETL

To move data from OLTP systems (the current state) to OLAP systems (the historical data), ETL (Extract, Transform, Load) processes emerged. In its original form, still relevant today, ETL involves identifying and extracting relevant data from various sources, transforming it for cleansing and customisation, and finally loading it into a data warehouse. This often involves a Data Staging Area (DSA) where transformations take place before loading into fact and dimension tables. Early ETL processes faced challenges such as schema mapping, data cleansing and quality, complex transformations, and a lack of standardisation.

The Shift to ELT

A significant shift in data engineering occurred with the advent of the modern data stack, catalysed by the release of Amazon Redshift, a cloud-native massively parallel processing (MPP) / OLAP database, in October 2012. While ETL served as the primary method for data processing for decades, the evolution of cloud technologies led to the rise of Extract, Load, Transform (ELT) as a modern alternative. Traditionally, data was transformed before loading into the data warehouse because the warehouse was often too slow and constrained to handle heavy processing itself. Business intelligence (BI) tools also performed local data processing to circumvent warehouse bottlenecks, and data processing was centrally governed to avoid overwhelming the warehouse.

The cloud, a significant 21st-century innovation, revolutionised how data is extracted, loaded, and transformed. The cloud flips the on-premises model by offering rented hardware and managed services, allowing for dynamic scaling of resources. This scalability and the pay-as-you-go model of cloud data warehouses have made them accessible even to smaller companies.

In the ELT data warehouse architecture, data is moved more or less directly from production systems into a staging area within the data warehouse in its raw form. Transformations are then handled directly within the data warehouse, leveraging the massive computational power of cloud data warehouses and processing tools. This data is processed in batches, and the transformed output is written into tables and views for analytics. ELT is also popular today in streaming arrangements, where events are streamed and subsequently transformed within the data warehouse.

Understanding the ELT Process

The ELT process comprises three main stages: Extract, Load, and Transform.

Extract: This involves retrieving data from various source systems, which can be pull-based or push-based and may require reading metadata and schema changes. Data is extracted from sources like relational databases, CRM systems, cloud applications, or APIs – essentially the data intended for eventual analytics use. Accessing these diverse sources can be simplified by managed data connector platforms and frameworks like Airbyte, reducing the need for custom development. These tools automate pipeline creation and management, extracting data and loading it into data warehouses via user interfaces.
Load: Once extracted, data is loaded into a target data platform (data warehouse, data lake). Unlike ETL, ELT loads raw data immediately after extraction, making data available for analysis much faster. This step is efficient as it requires no upfront transformation. This process is often referred to as data ingestion – the movement of data from a source to a destination. Data integration, in contrast, combines data from disparate sources into a new dataset. Ingestion processes can be batch, micro-batch, or real-time.
Transform: In the final step, the raw data loaded into the data warehouse is transformed for analytical use. This often involves light transformations like casting data types, standardising time zones, and renaming fields, as well as heavy transformations that incorporate business logic, create materialisations, and join data. Data quality checks (QA) are also crucial during this stage.

Benefits of ELT

ELT offers several advantages, particularly in cloud-based environments where scalability, flexibility, and performance are paramount. Key benefits include:

Scalability: ELT leverages the vast computational power of modern cloud data warehouses, enabling effortless scaling of data pipelines to handle growing data volumes without performance bottlenecks.

Faster data availability: By loading raw data immediately, ELT makes data available for analysis much quicker than ETL, which is crucial for organisations needing near-real-time insights.

Cost efficiency: ELT reduces the need for expensive on-premise ETL tools and infrastructure by offloading processing to the cloud and utilising pay-as-you-go resources.

Flexibility: ELT allows for more flexibility in data transformation, as analysts can apply transformations iteratively to adapt to changing business requirements since raw data is readily available in the warehouse.

Simplified pipelines: The ELT process simplifies data pipelines by eliminating the need for upfront data transformation before loading, reducing complexity and improving overall pipeline management.

Adoption of software development best practices: Performing transformations last in the pipeline allows for code-based and version-controlled transformations, enabling features like easy recreation of historical transformations, code-based tests, CI/CD workflows, and documentation of data models like typical software code.

Top ELT Tools

The modern data stack, which facilitates the ELT workflow, comprises various tools that have become reasonably consistent over time. These can be broadly categorised as:

Ingestion: Tools like Airbyte, Fivetran and Stitch simplify the extraction and loading of data.

Warehousing/Lakehouse Platforms: Cloud data warehouses such as BigQuery, Databricks, Redshift and Snowflake serve as the primary storage and transformation environment.

Transformation: dbt (data build tool) has emerged as a popular tool specifically for the transformation step within the data warehouse.

BI: Tools like Looker, Mode, Periscope, Chartio, Metabase, and Redash are used for data visualisation and analysis.

Workflow Orchestration Tools: While not strictly ELT tools, orchestrators like Apache Airflow are essential for scheduling and managing the entire ELT pipeline.

As stated above, It is evident that some tools focus primarily on data integration (the EL part of ELT), while others, like dbt, focus on transformation (the T part). Some tools can also perform both ETL and ELT. Cloud vendors also have proprietary services for storage and databases, often bundled to work well together within their ecosystem. Examples include AWS Glue with Redshift, Databricks Workflows, Microsoft Fabric Data Factory, and BigQuery's Data Transfer Service and integration with Dataform.

Airbyte and the ELT Workflow

Airbyte is an open-source data integration platform designed to consolidate data from various sources into data warehouses, lakes, and databases. It plays the “EL” role in the ELT workflow. It is available in both self-managed and cloud versions. Airbyte simplifies self-serve data extraction from numerous API(550+), database, and file sources, offering predictable data loading into over 25+ destinations while managing typing and deduplication.

Airbyte enables users to build connectors using a no-code builder for HTTP APIs or a low-code CDK for REST APIs, significantly reducing development effort. Its unified platform ensures reliability across all data synchronisations, allowing control over schema propagation and flexible sync frequencies.

Airbyte also provides transformation capabilities as a critical part of the ELT process, allowing users to convert raw data into a more usable format after it has been loaded. This includes basic normalisation to convert JSON blobs into structured tables. Users can also implement custom transformations using SQL or integrate with dbt cloud for more complex transformations.

As highlighted above, it is clear that Airbyte strongly favours ELT over ETL.

The Future of Data Engineering and ELT

While nobody can predict the future, there’s a good perspective on the past, the present, and current trends. Below observations of ongoing developments and wild future speculation.

As more organizations shift towards cloud-based infrastructures, The modern data stack is and will continue to be the default choice of data architecture and ELT will continue to play a crucial role in data integration processes.
ELT tools will continue to mature, extending their coverage to more use cases to become more reliable foundational technologies, sparking the next wave of innovation in the modern data stack.
The ELT workflow and the specific tools are changing and evolving rapidly, but the core aim will remain the same: to reduce complexity and increase modularization. Plug-and-play modular tools with easy-to-understand pricing and implementation is the way of the future.
Batch transformations are overwhelmingly popular, but given the growing popularity of stream-processing solutions and the general increase in the amount of streaming data, the popularity of streaming transformations is expected to continue growing, perhaps entirely replacing batch processing in certain domains soon.

Conclusion

In this blog post, we've explored the early days of Data Engineering in which the Extract, Transform, Load (ETL) data processing framework was popular, then the adoption of the ELT framework which was mainly driven by cloud technology. We also learnt what the ELT framework consists of in detail, the benefits of ELT and Top ELT tools. Lastly, we covered how Airbyte fits into the ELT process and future data engineering predictions.

DEV Community