Raphael Gutierrez

Posted on Mar 25, 2022

Considerations when performing ETL

#dataengineering #etl

ETL Process Image — ETL Process (c) Project SPARTA

Extract-transform-load (ETL) is an essential part of Data Value Chain and has been the go-to data pipeline procedure when performing data warehousing due to its advantages such as ease in development and having a schema that matches what businesses need.

ETL basically combines data from multiple data sources into a single, consistent data storage that is loaded into a data warehouse or other target system.

To deliver business requirements accurately through a successful ETL, there are few considerations needed to keep in mind.

What-When-How

Organizations deal with huge amount of data everyday and finding out what exact data is needed is vital in performing extraction and transformation.

Which data is important and will finally end up in the data warehouse?

It's important to know what is needed and what will be needed in the future according to the business requirement. This is to save memory (by also avoiding noise data) and to improve computing performance.

Knowing also when to extract is a key factor. When performing ETL, it should not affect the production environment. Avoid extracting during peak hours as it may slow down the system in the production, affecting user experience, and worst, may cause data loss. Instead, performing batch extraction after office hours is more advisable.

Lastly, an important rule in ETL is to not tap the production database. One must create an indirect route to the source (eg. extracting from dump files created from IT batch run). Also, identifying how to extract involves consultations. This includes reviewing data governance and privacy policies of the organization.

Data Dictionary & Mapping

It is not uncommon to work with a vast number of data just to answer a single business requirement. However, problems may arise if attributes have inconsistencies and the data is not prepared well.

A data dictionary is a centralized repository of metadata and metadata is data about data, as defined by Kelly Bourne in his book "Application Administrators Handbook." It includes data elements with detailed description of its format and relationships.

Keeping a data dictionary that contains a list of fields and definitions of a schema helps not to get lost when working with potentially hundreds of fields with conflicting, ambiguous, and sometimes, analogous names. When working with relational databases with multiple related tables, mapping out and documenting the attributes helps not to get confused.

Business Rules

According to Michael Eisner in his article on ProcessMaker, business rules are directives that define activities and help provide guidelines to organizations. They bring forth efficiency, consistency, predictability, and many other benefits

Knowing business rules especially in transformation phase helps identify which is which and what to do according to the limitations set in the database, since these rules impose some form of constraint on a specific aspect of the database, such as the elements within a field specification for a particular field or the characteristics of a given relationship.

These are some of my notes in Week 1 of Project SPARTA's SP701 "SQL for Data Engineering" course.

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more

Top comments (0)

Try REST API Generation for Snowflake

DevOps for Private APIs. Automate the building, securing, and documenting of internal/private REST APIs with built-in enterprise security on bare-metal, VMs, or containers.

Auto-generated live APIs mapped from Snowflake database schema
Interactive Swagger API documentation
Scripting engine to customize your API
Built-in role-based access control

DEV Community

Considerations when performing ETL

What-When-How

Data Dictionary & Mapping

Business Rules

The Next Generation Developer Platform

Top comments (0)

Try REST API Generation for Snowflake

Read next

RBAC Analytics: Key Metrics to Monitor

How to Increase API Performance

10 Best Practices for API Rate Limiting in 2025

Create a Server Driven CLI from your REST API

Okay