DEV Community

Cover image for What is Data Profiling?
Preeti Hemant
Preeti Hemant

Posted on

2

What is Data Profiling?

Imagine this scenario...

You have started a new job as a Data Engineer. It's an exciting time, new data, new tools, new people! While you are familiarizing yourself, your manager who is also very excited to have you look at their backlog, sends a request across.

"One of our stakeholders has been waiting on this for a long time - could you look into this job that's resulting in a very large table in the data warehouse. The stakeholder would love an aggregated table with the data rolled up to a week."

What is one of the first things you do?

Data Profiling!

It's a systematic analysis of the data in a table to understand the structure and relationships - leading to conclusions on its usability.

The tools to extract this information are measures in descriptive statistics - min, max, mode, frequency, sum, count (You get the picture).
The next set of parameters to understand are data types, nulls, uniqueness which answer questions on missing data, duplicates and so on.

You could run this analysis on individual columns or do a cross-columns analysis.

At the end of this exercise, you should be able to make a few decisions
1) Is this data able to support the analysis/reporting use case?
2) Do the data quality issues originate from the source or from the ETL jobs?
3) Are there inconsistencies in the structure of the table?

So, this was a short introduction to the Data Profiling technique. For further reading, here are some links.

  1. http://www.kimballgroup.com/wp-content/uploads/2012/05/DT59SurprisingValue.pdf
  2. https://www.researchgate.net/publication/262221918_Data_Profiling_Revisited

Qodo Takeover

Introducing Qodo Gen 1.0: Transform Your Workflow with Agentic AI

Rather than just generating snippets, our agents understand your entire project context, can make decisions, use tools, and carry out tasks autonomously.

Read full post

Top comments (0)

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more