DEV Community

Bob Oner
Bob Oner

Posted on

Build a Python Data Quality ETL Starter for Messy CSV, Excel, JSON, and API-Style Data

Small teams often receive data before it is ready for reporting.

It may come from a CSV export, an Excel file, a JSON payload, or an API-style response. The structure is usually close enough to be useful, but not clean enough to trust directly.

Common problems include:

  • inconsistent column names
  • missing values
  • duplicate rows
  • invalid emails
  • bad dates
  • numeric fields stored as messy text
  • nested JSON that needs to become a table
  • repeated manual cleanup before every report
  • no quality report for handoff

This article walks through a small open-source Python project I built to handle that kind of problem:

Data Quality ETL Starter

It is not a big data platform. It is not an Airflow or dbt project. It is not a production data warehouse. It is a small, reproducible starter workflow for data validation, cleaning, export, and reporting.

What this project builds

The project takes messy input data and runs it through a repeatable workflow:

messy CSV / Excel / JSON / mock API data
        ↓
read and flatten
        ↓
normalize columns
        ↓
validate expected schema rules
        ↓
clean duplicate rows and text values
        ↓
export cleaned CSV + SQLite
        ↓
generate data quality report
Enter fullscreen mode Exit fullscreen mode

The current version supports:

  • CSV input
  • Excel input
  • nested JSON input
  • mock API-style JSON input
  • column name normalization
  • Pydantic-based workflow and schema models
  • missing value, duplicate row, email, date, and number validation
  • cleaned CSV output
  • SQLite output by default
  • optional PostgreSQL export
  • Markdown and JSON data quality reports
  • CLI execution
  • pytest tests
  • Docker-based execution

The main goal is not to build a complex platform. The goal is to show how a small data workflow can be structured, tested, documented, and adapted.

Why this matters for small-team data workflows

Many small-team data problems are not "big data" problems.

They are repeatability problems.

For example:

  • a sales team exports a customer list every week
  • an operations team receives Excel files from different sources
  • a founder wants a simple CSV-to-report workflow
  • an analyst needs JSON payloads flattened into reporting tables
  • a freelancer needs to hand off a data cleanup script that another person can run

A one-off script can solve one file once.

A small workflow is more useful because it makes the steps explicit:

  1. What data did we receive?
  2. What rules did we expect?
  3. What changed during cleaning?
  4. What output files were created?
  5. What warnings should the next person review?

That is the reason this project writes both cleaned data and a quality report.

Project structure

The repository keeps the workflow small and modular:

data-quality-etl-starter/
├── data/
│   ├── input/
│   ├── expected/
│   └── output/
├── docs/
├── screenshots/
├── src/dq_etl_starter/
├── tests/
├── Dockerfile
├── docker-compose.yml
├── pyproject.toml
└── README.md
Enter fullscreen mode Exit fullscreen mode

The most important source modules are:

src/dq_etl_starter/
├── readers.py       # read CSV, Excel, and JSON files
├── mock_api.py      # simulate API-style JSON without network calls
├── normalize.py     # normalize columns and flatten JSON
├── validate.py      # validate expected columns and simple schema rules
├── clean.py         # trim text values and drop duplicates
├── exporters.py     # export cleaned data to CSV, SQLite, or PostgreSQL
├── report.py        # generate Markdown and JSON reports
├── models.py        # Pydantic models for workflow contracts
└── cli.py           # command-line entry point
Enter fullscreen mode Exit fullscreen mode

This separation is intentional. Each step can be tested, replaced, or extended without turning the project into one long script.

Setup

Clone the repository:

git clone https://github.com/OnerGit/data-quality-etl-starter.git
cd data-quality-etl-starter
Enter fullscreen mode Exit fullscreen mode

Create a virtual environment:

python -m venv .venv
Enter fullscreen mode Exit fullscreen mode

Activate it on macOS or Linux:

source .venv/bin/activate
Enter fullscreen mode Exit fullscreen mode

Activate it on Windows PowerShell:

.venv\Scripts\activate
Enter fullscreen mode Exit fullscreen mode

Install dependencies and the local package:

pip install -r requirements.txt
pip install -e .
Enter fullscreen mode Exit fullscreen mode

The editable install step is useful because the source code uses a src/ layout.

Input data examples

The project includes sample inputs under data/input/.

The examples are designed to represent common data workflow formats:

data/input/messy_customers.csv
data/input/messy_orders.xlsx
data/input/nested_customers.json
data/input/mock_api_orders.json
Enter fullscreen mode Exit fullscreen mode

Here is an example of the kind of messy source data the workflow is designed to handle:

Raw messy data

The mock API file does not call a real external API. It simulates an API-style JSON response so the workflow remains reproducible and does not require API keys.

That is useful for a starter project because anyone can run it locally without creating accounts, setting secrets, or depending on an external service.

Run the CSV workflow

The CSV example reads a messy customer file, validates it against an expected schema, cleans it, exports the result, and generates reports.

python -m dq_etl_starter.cli run \
  --input data/input/messy_customers.csv \
  --input-type csv \
  --schema data/expected/customer_schema.json \
  --output-dir data/output/csv \
  --db-target sqlite \
  --table-name cleaned_customers
Enter fullscreen mode Exit fullscreen mode

Expected outputs:

data/output/csv/cleaned_customers.csv
data/output/csv/etl_output.sqlite
data/output/csv/quality_report.md
data/output/csv/quality_report.json
Enter fullscreen mode Exit fullscreen mode

CLI workflow run

The important point is that the workflow does not only create a cleaned file. It also records what happened.

Run the Excel workflow

Excel exports are common in small business and operations workflows.

Run the sample Excel input:

python -m dq_etl_starter.cli run \
  --input data/input/messy_orders.xlsx \
  --input-type excel \
  --schema data/expected/order_schema.json \
  --output-dir data/output/excel \
  --db-target sqlite \
  --table-name cleaned_orders
Enter fullscreen mode Exit fullscreen mode

The output pattern is the same:

data/output/excel/cleaned_orders.csv
data/output/excel/etl_output.sqlite
data/output/excel/quality_report.md
data/output/excel/quality_report.json
Enter fullscreen mode Exit fullscreen mode

Keeping the CSV and Excel output folders separate makes it easier to compare runs without overwriting previous reports.

Run nested JSON data

JSON is useful for APIs and application exports, but nested JSON is not always reporting-ready.

This project supports a --records-path option so the workflow can extract a list of records from a nested payload.

python -m dq_etl_starter.cli run \
  --input data/input/nested_customers.json \
  --input-type json \
  --records-path data.customers \
  --schema data/expected/customer_schema.json \
  --output-dir data/output/json \
  --db-target sqlite \
  --table-name cleaned_customers_json
Enter fullscreen mode Exit fullscreen mode

Nested JSON flattened output

This step demonstrates a practical pattern: convert nested data into a tabular structure before validation, cleaning, and reporting.

Run mock API-style data

The mock API workflow uses a local JSON file that looks like an API response.

python -m dq_etl_starter.cli run \
  --input data/input/mock_api_orders.json \
  --input-type mock-api \
  --records-path data.orders \
  --schema data/expected/order_schema.json \
  --output-dir data/output/mock_api \
  --db-target sqlite \
  --table-name cleaned_api_orders
Enter fullscreen mode Exit fullscreen mode

This is intentionally not a real API integration in the current version.

For a client project, this layer could later be replaced with a real API reader that handles authentication, pagination, retries, and rate limits. For a starter project, using a local mock payload keeps the workflow easy to run and test.

How validation works

The project uses schema files under data/expected/ to define what the workflow expects.

For example, a customer schema can describe expected columns and simple column rules. The workflow then checks the raw data before cleaning.

Validation can detect issues such as:

  • missing expected columns
  • unexpected columns
  • missing values
  • duplicate rows
  • invalid email formats
  • invalid date values
  • invalid number values

The validation report uses source-oriented row references for CSV-style inputs. For example, if the source CSV has one header line and five data rows, a warning on Row 6 points to the fifth data record in the source file.

Pydantic is used for the workflow and reporting contracts. The row-level cleaning and validation remain DataFrame-based because CSV, Excel, JSON, and API-style datasets often have changing columns.

The validation step does not try to solve every business rule. That is a deliberate choice.

In real client work, every dataset has different rules. A small starter should make validation easy to extend rather than hardcoding too many assumptions.

How cleaning works

After validation, the workflow applies simple cleaning steps:

  • trim text values
  • normalize selected text fields
  • drop duplicate rows
  • keep the cleaned data in a DataFrame
  • export the cleaned result

The project keeps cleaning intentionally conservative.

It does not silently invent missing values. It does not guess complex business logic. It focuses on cleaning steps that are easy to explain and review.

That matters for handoff. When someone else receives the output, they should be able to understand what the script changed and what still needs human review.

Output files

Each run can create four main outputs:

cleaned data as CSV
cleaned data in SQLite
quality report as Markdown
quality report as JSON
Enter fullscreen mode Exit fullscreen mode

For example:

data/output/csv/cleaned_customers.csv
data/output/csv/etl_output.sqlite
data/output/csv/quality_report.md
data/output/csv/quality_report.json
Enter fullscreen mode Exit fullscreen mode

The Markdown report is useful for humans. The JSON report is useful if another tool needs to consume the result later.

A typical report includes:

  • raw row count
  • cleaned row count
  • column list
  • missing values by column
  • duplicate row count
  • missing expected columns
  • unexpected columns
  • validation issues
  • output file paths

Data quality report

The cleaned output can then be reviewed directly or used in a later reporting workflow.

Cleaned output

This turns the workflow from "a script produced a file" into "a repeatable run produced data and a reviewable report."

Inspect SQLite output

SQLite is the default database target because it is local, portable, and easy to inspect.

After running the CSV workflow, you can open the SQLite file:

sqlite3 data/output/csv/etl_output.sqlite
Enter fullscreen mode Exit fullscreen mode

Then inspect the tables:

.tables
SELECT * FROM cleaned_customers LIMIT 5;
Enter fullscreen mode Exit fullscreen mode

This is useful when the cleaned data should later feed a dashboard, internal tool, or reporting process.

Optional PostgreSQL export

The project also includes optional PostgreSQL export support.

This is useful when cleaned data needs to be loaded into a shared database instead of a local SQLite file. SQLite remains the default local target.

Start PostgreSQL with Docker Compose:

docker compose up -d postgres
Enter fullscreen mode Exit fullscreen mode

Run the workflow with PostgreSQL as the target:

DATABASE_URL=postgresql+psycopg://dq_user:dq_password@localhost:5432/dq_demo \
python -m dq_etl_starter.cli run \
  --input data/input/messy_customers.csv \
  --input-type csv \
  --schema data/expected/customer_schema.json \
  --output-dir data/output/postgres \
  --db-target postgres \
  --table-name cleaned_customers
Enter fullscreen mode Exit fullscreen mode

For Windows PowerShell, set the environment variable separately:

$env:DATABASE_URL="postgresql+psycopg://dq_user:dq_password@localhost:5432/dq_demo"

python -m dq_etl_starter.cli run `
  --input data/input/messy_customers.csv `
  --input-type csv `
  --schema data/expected/customer_schema.json `
  --output-dir data/output/postgres `
  --db-target postgres `
  --table-name cleaned_customers
Enter fullscreen mode Exit fullscreen mode

I would still treat PostgreSQL as an optional extension for this starter. SQLite is enough for the default local workflow.

Run tests

Run the test suite with:

pytest
Enter fullscreen mode Exit fullscreen mode

Passing tests

The tests cover the workflow pieces that matter most for a small ETL starter:

  • reading input files
  • normalizing data
  • validating data
  • cleaning rows
  • exporting data
  • generating reports
  • running the CLI

Tests are important here because data cleanup scripts are easy to break when formats change.

A small test suite makes the workflow safer to modify.

Run with Docker

The project can also run inside Docker.

Build the image:

docker build -t data-quality-etl-starter .
Enter fullscreen mode Exit fullscreen mode

Run the container:

docker run --rm -v "${PWD}/data/output:/app/data/output" data-quality-etl-starter
Enter fullscreen mode Exit fullscreen mode

On Windows PowerShell, the same command format can be used:

docker run --rm -v "${PWD}/data/output:/app/data/output" data-quality-etl-starter
Enter fullscreen mode Exit fullscreen mode

Docker run

Docker is useful for handoff because it reduces local environment differences. A reviewer or client can run the workflow without manually recreating the same Python environment.

What I intentionally kept out of v0.1

This project is intentionally small.

The current version does not include:

  • a FastAPI service
  • user authentication
  • scheduled jobs
  • a web dashboard
  • Airflow orchestration
  • dbt models
  • cloud deployment
  • real external API authentication
  • complex business-specific validation rules

FastAPI would be a natural future layer, but it is not part of the v0.1 core.

The CLI workflow remains the source of truth. A later API layer could expose endpoints for upload, validation, and report retrieval, but that should come after the core workflow is stable.

Keeping the first version small makes the project easier to run, review, test, and adapt.

How this maps to freelance client work

This kind of starter maps well to small Python data workflow tasks.

Examples include:

  • cleaning CSV exports before reporting
  • converting Excel files into normalized CSV output
  • flattening JSON into tables
  • validating required columns before import
  • producing simple data quality reports
  • loading cleaned data into SQLite or PostgreSQL
  • preparing data for dashboards or internal tools
  • turning a manual weekly cleanup process into a repeatable command

For freelance work, the value is not only the code.

The value is the handoff:

  • clear commands
  • sample inputs
  • predictable outputs
  • validation warnings
  • reports
  • tests
  • Docker support
  • documented limitations

That is what makes a small automation project easier for another person to trust.

Next steps

The next improvements I would consider are:

  • add more schema rule types
  • improve report formatting
  • add richer error messages
  • add logging
  • add run IDs for report history
  • add a real API reader example
  • add a FastAPI wrapper around the CLI workflow
  • add more PostgreSQL examples
  • add CI for automated testing

The important constraint is to keep the project practical. It should remain small enough for a developer, analyst, or client to understand without needing a full data platform.

GitHub repository

The full project is available here:

https://github.com/OnerGit/data-quality-etl-starter

If you work with messy CSV, Excel, JSON, or API-style data, this kind of starter can be a useful base for building repeatable data cleaning and reporting workflows.

Top comments (0)