Data engineering has just been nominated as the profession most in-demand in 2022. The process of data engineering can be understood as extracting, transforming, and loading data (ETL). Data is loaded from the source, transformed, and loaded into a table at a data warehouse. This process is automated and repetitive so that clean, updated, and reliable data ultimately flows into a Dashboard and into other data products (e.g. a recommendation engine).
In recent years, dbt in particular has enjoyed growing popularity. Dbt is a software framework that sits in the middle of the ETL process. It represents the transformative layer after loading data from an original source. Dbt combines SQL with software engineering principles. In plain language, this means that SQL can now be used to develop data models, new views, and tables. DBT takes into account the following important software engineering frameworks:
- Dbt represents DRY Coding
- Dbt manages dependencies well. Through lineage capabilities, it is excellent for complex data warehouse structures and enables you to build out DAGs
- version control with Git
- incorporating advanced data tests including Data Freshness metrics
Dbt was able to build an entire job class with the title Analytics Engineer. I think this nails the importance but also the audience that is mostly using dbt, namely engineers. A new era of open source tools have been built that set up on top of dbt. Here are the hottest ones and** highest ranked projects on Github**:
Lightdash ( https://github.com/lightdash/lightdash )
Lightdash converts dbt models and makes it possible to define and easily visualize additional metrics via a visual interface. The front end helps to understand and extend the underlying SQL queries. Lightdash also visualizes business metrics and makes them shareable with the data team. It is also possible to integrate all data into another visualization tool.
re_data ( https://github.com/re-data/re-data )
Re_data is an abstraction layer that helps users monitor dbt projects and their underlying data. For example, you get alerts when a test failed or a data anomaly occurs in a dbt project and which underlying metric is affected. In addition, the lineage graph is also intuitively displayed. Re-data is one of two others frameworks focusing on the observability aspect of lengthy pipelines in dbt (check also out: open-metadata and Elementary).
Evidence ( https://github.com/evidence-dev/evidence )
Evidence is another tool for lightweight BI reporting. With Evidence you can build simple reports in “medium style” using SQL queries and Markdown. It is reminiscent of Jupyter Notebooks except that it is based on SQL instead of Python. You can also initiate SQL queries from the reports you create. I haven’t used the tool myself yet but it seems to be ideal for quick prototyped metrics in a report.
Kuwala ( https://github.com/kuwala-io/kuwala )
Kuwala is a data workspace that consolidates the Modern Data Stack and makes it usable for BI analysts and Engineers. Even though dbt is originally targeted at BI Analysts, dbt is mainly used by Engineers. This shifts a large amount of pipeline engineering effort to the IT department. With Kuwala, a BI analyst can intuitively build advanced data workflows using a drag-drop interface on top of the modern data stack without coding. Consequently, the BI Analyst can work more iteratively and maintain the complete workflow from source to metrics in a dashboard. Under the hood and Behind the Scenes, the dbt models are generated so that a more experienced engineer can customize the pipelines at any time. In addition, engineers can easily convert dbt models into reusable “drag and drop” components.
Fal-AI ( https://github.com/fal-ai/fal )
Fal helps to run Python scripts directly from the dbt project. For example, you can load dbt models directly into the Python context which helps to apply Data Science libraries like SKlearn and Prophet in the dbt models. This especially improves the data science capabilities within a data pipeline. What I extremely like about fal is that it extends dbt from a interesting angle.
Of course, there are not just 5 interesting projects out of 16K repos on Github using dbt. So what is your hottest repo for dbt?