The 5 Most Common Data Quality Issues (and How Analysts Can Fix Them)

#ai #etl #data #datascience

Data analysts spend more time cleaning data than analyzing it. In fact, in most real-world projects, 60–80% of your time goes into preparing data for meaningful insights.
Poor data quality leads to incorrect conclusions, broken dashboards, and bad decisions which is why understanding common issues and knowing how to fix them is a core skill for every analyst.

Here are the five most common data quality problems and practical steps to solve each one.

1. Missing or Null Values

Missing data can distort metrics, create gaps in reports, or lead to inaccurate ML models.

Causes:
• Manual data entry errors
• Incomplete integrations
• System migration issues

How to fix it:

Identify missingness patterns using COUNT(*) in SQL or df.isna().sum() in Python.
Drop rows only when missing data is irrelevant.
Impute using averages, medians, or domain logic.
Use Power Query’s “Replace Errors” or “Fill Down” functions for structured fixes.

2. Inconsistent Formatting

You’ve probably seen this: “Kenya”, “kenya”, “K E N Y A”, or mismatched date formats in the same column.

Why it happens:
• Different data sources
• Manual inputs
• Lack of data validation rules

How to fix it:

Apply standard casing (upper/lower/title).
Convert all dates to a unified ISO format (YYYY-MM-DD).
Use Excel Power Query’s “Transform → Format” options.
In SQL, standardize with functions like UPPER(), TRIM(), or TO_DATE().

3. Duplicate Records
Duplicates inflate counts, break KPIs, and cause incorrect aggregations.

Why it happens:
• Multiple data entry points
• Poor primary key definition
• System sync issues

How to fix it:

Identify duplicates using ROW_NUMBER() windows in SQL.
Use Power Query’s “Remove Duplicates”.
Implement unique IDs early in the pipeline.
In Python, use df.drop_duplicates().

4. Outliers and Incorrect Values

Some values are valid extreme cases; others are simply errors (like a customer aged 600).

Why it happens:
• Typographical errors
• Faulty sensors or scraping issues
• Incorrect units (meters vs. feet)

How to fix it:

Visualize distributions using box plots or histograms.
Apply domain thresholds or rule-based logic.
Use interquartile ranges or z-scores for statistical outlier detection.
Create automated validations in Power BI or SQL.

5. Mixed Granularity

Data at different levels combined into one column or table — e.g., weekly and monthly data in the same dataset.

Why it happens:
• Data integration from multiple systems
• Poorly designed source tables

How to fix it:

Split datasets by granularity before analysis.
Create dimensional tables for dates, products, etc.
Aggregate or disaggregate consistently before joining.
Use a proper star schema when possible.

DEV Community

The 5 Most Common Data Quality Issues (and How Analysts Can Fix Them)

Top comments (0)