The Data Trinity

#dataengineering #etl #elt #datapipelines

A trinity is a reference of three(3). Three items working in perfect harmony/orchestration, seemingly as a single unit, to achieve a defined goal. In the modern world of data engineering, this is a holy grail that many are in search of.
In this case, we are considering the scope from source to business insight using data. The three most important components. These are:

Extraction and loading tools
Transform and storage tools
Visualization and Intelligence tools

We will be going through and discussing the options in each of the trinity components. What to consider and what to weigh.

Extraction and Loading Tools

This component of the trinity is mainly involved in taking data from source(ingestion) and "transporting" it to destination. A cool feature with most tools is scheduling ingestion. You can choose to ingest your data from source to destination daily, after x minutes or y hours. This allows you to focus on other more important tasks.

Within the data engineering context, Extraction refers to the process of obtaining raw data at source. The source can be any software artifact that holds or produces data. These include APIs, databases,software systems such as CRMs, transactional systems et al.

Loading - Dumping data into a destination artifact. The data can be raw data, if one is using the ELT approach or processed data if one is using the traditional ETL approach. The most common destination artifact is a data warehouse.

Considerations

When dealing with this component, the main considerations include data privacy, engineering resource management, costs and robustness of the tool used. Let us consider a practical example with real life tools.

If your company is price sensitive, you might opt for a free open source tool such as Airbyte but absorb the engineering costs of setup and maintenance. If your company wishes to focus their engineering resources on other tasks other than extraction and loading, they can use a managed service like Fivetran where they pay for the service.

The trade offs are purely situation dependent. Maybe you want an open source free tool, maybe you want a fully managed service, maybe you are bootstrapping and do not have funds, maybe you are a big company and do not mind paying.

Airbyte and Fivetran are some of the most common tools in this space within data engineering.

Transform and Storage Tools

This component of the trinity is mainly involved in processing your data according to your business needs and also storing the data.
The transform component is where business logic is domiciled. This is where you perform data transformations to generate business value from data. They could be as simple as filters and as complex as joins, rolling column computation, pivoting, etc. Most transformations will be done in SQL. This is because of its ability to perform data processing and also compatibility with the storage systems. Most of the storage systems will be designed to be SQL compatible and hence, SQL as the transformation language is the most common. A common tool in this space is Data Build Tool DBT A powerful tool that supercharges SQL by introducing other programming paradigms such as jinja templating, source definitions, hooks, variables and sanity checks/tests among others. The alternative would be stored procedures and SQL scripts but you would be losing out on all the great features of DBT.
I have had a go at it in my personal github which you can check over here. DBT Bon Voyage
Fivetran also offers an SQL based transformation interface where one can run their SQL scripts.

The storage component is simply where the data lives after extraction and loading. The most common artifact is a data warehouse or data lake depending on conceptual design.
The most common services in this category are AWS Redshift, Google cloud Big query and Snowflake.

Visualization and Intelligence Tools

This component of the trinity is mainly involved in synthesis and displaying of business value. This is where the business intelligence analyst, data analyst, data scientist and business executives operate. They use the extracted loaded and transformed data to answer business questions and produce actionable insights. The software artifacts in this level often have visualization, query and reporting tools.
Common tools in this level are business intelligence software such as Looker, PowerBi and Tableau among others.
The best and most robust features in any of these tools are on a paid tier and hence, costs are definitely a consideration. Whether you would want it on your private server or on a public cloud that the vendor can setup for you, is also a choice depending on your data policy.

Wrapping Up

By now you are aware of the data trinity in terms of tools. A well oiled machine that is commonly referred to as the modern data stack.

Airbyte/Fivetran --> Snowflake/Redshift/BigQuery [With DBT or custom SQL scripts running on top] --> Looker/PowerBI/Tableau

A great company that offers data consulting services through out the whole stack is SILVER CREEK INSIGHTS They have direct partnerships with Looker and Fivetran and hence, your entire data stack needs can and will be solved under the Silvercreek umbrella.
If you would like to interact further, you can find my personal site or LinkedIn

DEV Community

The Data Trinity

Extraction and Loading Tools

Considerations

Transform and Storage Tools

Visualization and Intelligence Tools

Wrapping Up

Top comments (0)

Read next

Boost Your Application's Performance with Redux RTK Query 🚀

LeetCode Challenge 13: Roman to Integer - JavaScript Solution 🚀

DevOps for Edge Computing

Tracardi Open-Source and Commercial Version