Extract Transform Load vs Extract Load Transform (ETL vs ELT)

#data #architecture #beginners #dataengineering

A data pipeline constitutes tools, steps and processes that automate movement of harvested data to a destination system. The destination system can be either for analysis or for storage of the data.
A pipeline can be broken down into 3 broad processes namely:

Extraction
Transfromation
Loading

Extraction

This is always the initial stage of any data pipeline. It is at this stage where data is gathered and harvested from different sources it is found. The data can be from more than one source. There are numerous sources of data and these include, but are not limited to:

Files - Files in different formats likes CSVs, excel sheets, PDFs can contain raw data that can extracted.
Web scraping data - Data can be sourced by scraping a single or multiple websites with the relevant data.
Databases - Databases are used by organisations to store data that they can consider raw data that they can use.

Transformation

Raw data from the extraction stage is always considered messy. Messy in this context can mean a lot of different things depending on how the intended data is meant to look. Messy can mean data is in an usable format(eg. numerical data written in words), or has values in a unit the user does not prefer(eg.imperial units instead of metric units) or the data is full of errors(like spelling errors) and many other ways. It is at this stage that the data is manipulated and transformed to the desired state by the user.
There are numerous processes that can be performed in this stage:

Data wrangling - This is the process of converting data from its raw form to a tidy format that has structure. Involves actions like converting names of countries to their abbreviations.
Data Cleaning - This is the process of removing data that does not meet some preset user defined rules for the task to ensure the data is accurate, consistent and relevant. An example would be removing data of people aged 60 and below if the scope of an analysis is on the 'elderly'.
Data transformation - It is the process of converting raw from on one format and structure to another to make sure it is easy to use , integrate or analyze.

Loading

This is the stage where data that has undergone transformation or also raw data is loaded into a system. The system can either be for storage like in data warehouses and databases or for analysis like in Power BI.

It is the order of the transformation and loading stage after extraction that determines whether a pipeline is ELT or ETL. In ETL, the methodology is that the transformation comes prior to the loading while in ELT, the loading precedes the transformation.

ETL: Extract → Transform → Load

ELT: Extract → Load → Transform

ETL

ETL methodology has been around for many years and has been the standard way of operating a data pipeline. This approach was majorly driven by the historic high prices of memory(storage). With storage being a scarce resource, organizations could not afford to store data they considered junk. Only data that was useful could be stored and therefore data had to be wrangled, cleaned and transformed before it was stored. Tools that use this approach are substantial in number and have been perfected over the years to become reliable and robust. Examples of these are Talend and Informatica.

ELT

With the boom of the internet in the 2000s, it became an avenue for providing services and the amount of data handled by organizations grew exponentially. Data became even more important for some of these organizations because it could be used to derive more insights that can potentially increase revenue. Data quickly became the new gold and organizations needed more of it. Luckily, by around 2010, memory had become so inexpensive that the potential advantage of not storing what was considered junk(but could be useful in future) was no longer compelling. With memory being justifiably cheap, the only problem was scaling resources to handle the large amounts of data pouring in. We were in luck again because the rise of the cloud solved this challenge. With cloud services, you could scale operations and resources like storage easily, with flexibility and efficiently. Consequently, organizations no longer needed to transform data beforehand hence the adoption of ELT.

So, when and in what situation does ETL make more sense than ELT and vice versa? To get there, we must understand the functional differences between ETL and ELT. The differences are as follows:

ETL	ELT
Raw data is lost after transformation	Raw data is always preserved for re-transformation
Reduces storage cost because raw data is not saved	Higher storage costs because raw data takes more space in storage
Older methodology and therefore has a lot of robust tools	New methodology and therefore needs powerful destination systems that can handle transformation - which are few in the market
Slower iteration because re-runs require full re-extraction	Allows iterate on transforms without re-ingestion
Stronger compliance by filtering sensitive data early	Sensitive data reaches the warehouse unfiltered
Schema changes break pipelines easily	Schema changes don't affect the data pipeline
Does not scale with cloud warehouse compute — external processing layers hit capacity ceilings as data volumes grow	Scale easily with cloud warehouse compute

Use cases for ETL and ELT

ETL	ELT
When working with regulated sensitive data eg. financial information, HIPAA etc. The data must be masked or drop sensitive fields before they enter storage	When you need to preserve raw data for auditing, debugging, or future re-modelling
When the cost of storing raw data outweighs the benefit of preserving it	When you want analysts and data teams to iterate on business logic without re-ingesting data
When your destination is a legacy relational database with limited compute (MySQL, PostgreSQL, on-premise data warehouses)	When you are on a modern cloud data warehouse (BigQuery, Snowflake, Redshift, Databricks) with scalable compute
Data volume is modest and transformations are stable, well-understood, and unlikely to change often	When data volumes are large and growing, and you favour ingestion speed over transformation speed

The choice of whether to use ETL or ELT is not a straight-forward answer. The choice is a function of organization needs, cost, the SOPs(standard operating procedures) of an organisation or institution, the existing regulation on data handled, destination of the data and many others. The merits and demerits of each system have to be weighed and the approach with the most benefits to the user or organization is picked. Organizations have also adapted by switching between ELT and ETL when conditions necessitate the switch. This has ensured flexibility and compliance in their operations.
It is worth noting that the modern approach is ELT. The industry has shifted decisively toward ELT for most modern data workloads since they are implemented on the cloud. ETL is however going to be around for a long time because it has its unique use cases and there is room for both paradigms to coexist.

DEV Community

Extract Transform Load vs Extract Load Transform (ETL vs ELT)

Extraction

Transformation

Loading

ETL

ELT

Use cases for ETL and ELT

Top comments (0)