Nilofer 🚀

Posted on Jul 3

NEO Data Quality Auditor AI: Automated Data Quality Auditing, Bias Detection, and Lineage Tracking

#machinelearning #opensource #mlops #dataquality

60% of businesses cite poor data quality as the primary reason for AI failures. Dirty data leads to misleading insights, wasted resources, and failed ML models. Most teams lack easy-to-use tooling that surfaces what is wrong and what to do about it.

NEO Data Quality Auditor AI addresses that directly. It is an automated data quality auditing tool that detects inconsistencies, bias, missing values, and format issues in any CSV dataset, with a real-time monitoring dashboard, AI-powered cleaning suggestions, and data lineage tracking. Built autonomously by NEO.

Visual Reports

The repo ships with four SVG infographics: a data quality score gauge, a bias detection overview, a gender and age bias analysis, and an ethnicity bias breakdown.

What It Does

9 Quality Checks: Missing values, duplicates, data types, out-of-range, format violations, outliers, cardinality, cross-column consistency, distribution skew.

AI Cleaning Suggestions: A rule-based engine maps every detected issue to actionable steps with severity, effort level, and category. Each suggestion is filterable and expandable in the dashboard.

Bias Detection: Demographic parity ratio, disparate impact, and group fairness metrics across sensitive attributes. Results surface per-group statistics with representation rates and standard errors.

Data Lineage: An event-sourced audit trail tracks every analysis step with timestamps and column-level changes, exportable as JSON.

Real-Time Dashboard: An interactive Streamlit UI with a quality score gauge (0-100), drill-downs per check, and exportable reports.

The 9 Quality Checks

The 9 checks cover the most common classes of data quality problems.

Missing Values: Null/NaN counts and ratios per column with critical vs warning thresholds.
Duplicate Rows: Exact duplicate records with sample indices.
Data Type Inconsistencies: Mixed types within a column, for example strings in numeric columns.
Out-of-Range Values: Values outside expected min/max bounds.
Format/Pattern Violations: Invalid emails, phone numbers, and dates using regex patterns.
Outlier Detection: IQR method and Z-score analysis for numerical anomalies.
Uniqueness/Cardinality: High vs low cardinality detection.
Cross-Column Consistency: Logical contradictions between related columns.
Skewness/Kurtosis: Distribution shape analysis for numeric columns.

Bias Detection Metrics

Data Lineage

The lineage tracker logs every analysis step:

Load events: Original row count, column list, source filename
Quality check events: Per-check results with affected columns
Bias check events: Per-attribute bias metrics
Export: Download the full audit trail as JSON

Installation

# Clone or navigate to the project
cd NEO-Data-Quality-Auditor-AI

# Create a virtual environment
python -m venv venv
source venv/bin/activate  # Linux/Mac
# venv\Scripts\activate   # Windows

# Install dependencies
pip install -r requirements.txt

Usage

Start the Dashboard

source venv/bin/activate
streamlit run app.py

This opens a web browser with the interactive dashboard.

Walkthrough

Upload a CSV: Use the sidebar to upload any CSV file, or click "Use Demo Data" to load a pre-built dataset with intentional quality issues (missing values, outliers, format violations, demographic bias).

Overview Tab: See the overall data quality score (0-100), pass/fail breakdown per check, and summary metrics.

Quality Checks Tab: Drill into each of the 9 checks with interactive visualizations (bar charts for missing values, box plots for outliers, histograms for distributions).

Bias Report Tab: Review demographic parity ratios and disparate impact metrics across sensitive attributes like gender, ethnicity, and age.

Lineage Tab: View the full audit trail of every analysis step with timestamps and event type breakdown.

AI Suggestions Tab: Get actionable cleaning recommendations filtered by severity and effort level, with step-by-step instructions.

Command-Line Usage

The modules are also importable programmatically:

import pandas as pd
from auditor.quality_checks import run_all_quality_checks, calculate_quality_score
from auditor.bias_detector import run_all_bias_checks
from auditor.ai_suggestions import generate_suggestions

# Load data
df = pd.read_csv("your_data.csv")

# Run all 9 quality checks
results = run_all_quality_checks(df)
score = calculate_quality_score(results)
print(f"Quality Score: {score}")

# Run bias detection
bias_results = run_all_bias_checks(df)
print(f"Bias checks run: {len(bias_results)}")

# Get cleaning suggestions
suggestions = generate_suggestions(results, bias_results)
for s in suggestions:
    print(f"[{s['severity']}] {s['issue']} — Effort: {s['effort']}")

Project Structure

NEO-Data-Quality-Auditor-AI/
├── app.py                      # Streamlit dashboard (main entry point)
├── config.py                   # Thresholds and configuration
├── requirements.txt            # Python dependencies
├── docs/                       # SVG infographics
│   ├── quality_score.svg       # Data quality gauge (43/100)
│   ├── bias_overview.svg       # Bias detection overview
│   ├── gender_age_bias.svg     # Gender & age bias analysis
│   └── ethnicity_bias.svg      # Ethnicity bias details
├── auditor/
│   ├── __init__.py             # Package exports
│   ├── quality_checks.py       # 9 quality check functions
│   ├── bias_detector.py        # Demographic parity, disparate impact, fairness
│   ├── lineage_tracker.py      # Event-sourced data lineage tracker
│   ├── ai_suggestions.py       # Rule-based cleaning suggestion engine
│   └── demo_data.py            # Demo data generator with intentional issues
├── data/
│   └── demo_data.csv           # Generated demo dataset
└── README.md

Running Tests

source venv/bin/activate
python -c "
from auditor.quality_checks import run_all_quality_checks, calculate_quality_score
from auditor.bias_detector import run_all_bias_checks
from auditor.ai_suggestions import generate_suggestions
import pandas as pd

# Generate and test with demo data
from auditor.demo_data import generate_demo_data
generate_demo_data()
df = pd.read_csv('data/demo_data.csv')

# Run full audit
quality = run_all_quality_checks(df)
score = calculate_quality_score(quality)
bias = run_all_bias_checks(df)
suggestions = generate_suggestions(quality, bias)

print(f'Quality Score: {score}')
print(f'Bias checks: {len(bias)}')
print(f'Suggestions: {len(suggestions)}')
"

Dashboard Screenshots

Launch the dashboard with streamlit run app.py to see it live.

Overview: Quality score gauge (0-100), check pass/fail bar chart, summary metrics
Quality Checks: Per-check drill-down with interactive Plotly visualizations
Bias Report: Demographic parity ratio charts, group fairness breakdown
Lineage: Event timeline with event type distribution pie chart
AI Suggestions: Expandable suggestion cards with severity filters and effort labels

Configuration

Edit config.py to tune:

QUALITY_THRESHOLDS: Missing value thresholds, outlier Z-score cutoff, email/phone regex patterns
BIAS_PARAMS: Demographic parity range, min group size, sensitive column keywords
DASHBOARD_CONFIG: App title, colors, max file size

How I Built This Using NEO

This project was built using NEO. NEO is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.

The requirement was a data quality auditing platform that could detect inconsistencies, bias, and format issues in any CSV dataset and surface actionable cleaning recommendations through an interactive dashboard. NEO planned and produced the files in this repository: the Streamlit dashboard in app.py, the threshold configuration in config.py, five auditor modules covering quality checks, bias detection, lineage tracking, AI suggestions, and demo data generation, four SVG infographics under docs/, and the demo dataset in data/demo_data.csv.

The result is a fully working auditing platform that takes any CSV file in and returns a quality score across 9 checks, bias metrics across sensitive attributes, a complete event-sourced lineage trail, and prioritized cleaning suggestions, all from a single streamlit run app.py command.

How You Can Use This With NEO

Audit any CSV dataset before feeding it into an ML model.
Any CSV file can be uploaded directly to the dashboard and the platform runs all 9 quality checks, returning a quality score (0-100) with a pass/fail breakdown per check and interactive drill-downs. No configuration is required before the first run.

Detect demographic bias before training.
The bias_detector.py module computes demographic parity ratio, disparate impact, and group fairness metrics across sensitive attributes like gender, ethnicity, and age. Running this before training surfaces skewed representation that would otherwise propagate silently into model predictions.

Integrate quality checks programmatically into a data pipeline. run_all_quality_checks(), run_all_bias_checks(), and generate_suggestions() are all importable directly, as shown in the command-line usage section. Any pipeline that can run Python can call these functions and surface quality issues before data reaches the next stage.

Export the full lineage trail as JSON for compliance or auditing.
The event-sourced lineage tracker logs every analysis step with timestamps and column-level changes. The full audit trail is downloadable as JSON from the Lineage tab, making it usable as a compliance artifact without any additional tooling.

Final Notes

NEO Data Quality Auditor AI is a single-command auditing platform that takes any CSV file in and surfaces quality issues, bias metrics, cleaning recommendations, and a full lineage trail, all from streamlit run app.

The code is at https://github.com/dakshjain-1616/Data-Quality-Auditor-AI
You can also build with NEO in your IDE using the VS Code extension or Cursor.
You can use NEO MCP with Claude Code: https://heyneo.com/claude-code

DEV Community