A data pipeline constitutes the processes from which raw data is sourced from its source, manipulated and turned into format that is usable for the user and later loaded into a system where it can be analyzed or loaded for storage in its useful form. In the extraction stage, data is sourced from various sources like databases, files, websites through scraping etc. Transformation will involve changing data into a useful format by performing actions like data cleaning, data wrangling, data merging from different sources. Loading will involve channeling the data into a system where it can be stored or be used to transform the data.
These processes are what form the ETL and ELT process in data pipelines.
Extract → Transform → Load
Extract → Load → Transform
So what is the difference between ELT and ETL? The difference is in which process between loading or transformation follows the process of extraction.
This begs the question- when and in what situation does ETL make more sense than ELT and vice versa? But, to get there we must understand the functional differences between ETL and ELT. The differences are as follows:
| ETL | ELT |
|---|---|
| Raw data is lost after transformation | Raw data is always preserved for re-transformation |
| Reduces storage cost because raw data is not saved | Higher storage costs because raw data takes more space in storage |
| Older methodology and therefore has a lot of robust tools | New methodology and therefore needs powerful destination systems that can handle transformation - which are few in the market |
| Slower iteration because re-runs require full re-extraction | Allows iterate on transforms without re-ingestion |
| Stronger compliance by filtering sensitive data early | Sensitive data reaches the warehouse unfiltered |
| Schema changes break pipelines easily | Schema changes don't affect the data pipeline |
| Does not scale with cloud warehouse compute — external processing layers hit capacity ceilings as data volumes grow | Scale easily with cloud warehouse compute |
Use cases for ETL and ELT
| ETL | ELT |
|---|---|
| When working with regulated sensitive data eg. financial information, HIPAA etc. The data must be masked or drop sensitive fields before they enter storage | When you need to preserve raw data for auditing, debugging, or future re-modelling |
| When the cost of storing raw data outweighs the benefit of preserving it | When you want analysts and data teams to iterate on business logic without re-ingesting data |
| When your destination is a legacy relational database with limited compute (MySQL, PostgreSQL, on-premise data warehouses) | When you are on a modern cloud data warehouse (BigQuery, Snowflake, Redshift, Databricks) with scalable compute |
| Data volume is modest and transformations are stable, well-understood, and unlikely to change often | When data volumes are large and growing, and you favour ingestion speed over transformation speed |
ELT methodology has been around for a while and has been the standard way of operating a data pipeline. Tools that use this approach are substantial in number and have been perfected over the years to become reliable and robust. However, with the adoption of cloud based approach, ELT has gained traction because of the easiness of scaling in the cloud.
The choice of whether to use ETL or ELT is not a straight-forward answer. The choice is a function of user needs, cost, the SOPs(standard operating procedures) of an organisation or institution, the existing regulation on data handled, destination of the data and many others. The merits and demerits of each system have to be weighed and the approach with the most benefits to the user or organisation is picked.
Top comments (0)