Bob Oner

Posted on Jun 3

Build a Python Data Quality ETL Starter for Messy CSV, Excel, JSON, and API-Style Data

#python #etl #dataengineering #pandas

Small teams often receive data before it is ready for reporting.

It may come from a CSV export, an Excel file, a JSON payload, or an API-style response. The structure is usually close enough to be useful, but not clean enough to trust directly.

Common problems include:

inconsistent column names
missing values
duplicate rows
invalid emails
bad dates
numeric fields stored as messy text
nested JSON that needs to become a table
repeated manual cleanup before every report
no quality report for handoff

This article walks through a small open-source Python project I built to handle that kind of problem:

Data Quality ETL Starter

It is not a big data platform. It is not an Airflow or dbt project. It is not a production data warehouse. It is a small, reproducible starter workflow for data validation, cleaning, export, and reporting.

What this project builds

The project takes messy input data and runs it through a repeatable workflow:

messy CSV / Excel / JSON / mock API data
        ↓
read and flatten
        ↓
normalize columns
        ↓
validate expected schema rules
        ↓
clean duplicate rows and text values
        ↓
export cleaned CSV + SQLite
        ↓
generate data quality report

The current version supports:

CSV input
Excel input
nested JSON input
mock API-style JSON input
column name normalization
Pydantic-based workflow and schema models
missing value, duplicate row, email, date, and number validation
cleaned CSV output
SQLite output by default
optional PostgreSQL export
Markdown and JSON data quality reports
CLI execution
pytest tests
Docker-based execution

The main goal is not to build a complex platform. The goal is to show how a small data workflow can be structured, tested, documented, and adapted.

Why this matters for small-team data workflows

Many small-team data problems are not "big data" problems.

They are repeatability problems.

For example:

a sales team exports a customer list every week
an operations team receives Excel files from different sources
a founder wants a simple CSV-to-report workflow
an analyst needs JSON payloads flattened into reporting tables
a freelancer needs to hand off a data cleanup script that another person can run

A one-off script can solve one file once.

A small workflow is more useful because it makes the steps explicit:

What data did we receive?
What rules did we expect?
What changed during cleaning?
What output files were created?
What warnings should the next person review?

That is the reason this project writes both cleaned data and a quality report.

Project structure

The repository keeps the workflow small and modular:

data-quality-etl-starter/
├── data/
│   ├── input/
│   ├── expected/
│   └── output/
├── docs/
├── screenshots/
├── src/dq_etl_starter/
├── tests/
├── Dockerfile
├── docker-compose.yml
├── pyproject.toml
└── README.md

The most important source modules are:

src/dq_etl_starter/
├── readers.py       # read CSV, Excel, and JSON files
├── mock_api.py      # simulate API-style JSON without network calls
├── normalize.py     # normalize columns and flatten JSON
├── validate.py      # validate expected columns and simple schema rules
├── clean.py         # trim text values and drop duplicates
├── exporters.py     # export cleaned data to CSV, SQLite, or PostgreSQL
├── report.py        # generate Markdown and JSON reports
├── models.py        # Pydantic models for workflow contracts
└── cli.py           # command-line entry point

This separation is intentional. Each step can be tested, replaced, or extended without turning the project into one long script.

Setup

Clone the repository:

git clone https://github.com/OnerGit/data-quality-etl-starter.git
cd data-quality-etl-starter

Create a virtual environment:

python -m venv .venv

Activate it on macOS or Linux:

source .venv/bin/activate

Activate it on Windows PowerShell:

.venv\Scripts\activate

Install dependencies and the local package:

pip install -r requirements.txt
pip install -e .

The editable install step is useful because the source code uses a src/ layout.

Input data examples

The project includes sample inputs under data/input/.

The examples are designed to represent common data workflow formats:

data/input/messy_customers.csv
data/input/messy_orders.xlsx
data/input/nested_customers.json
data/input/mock_api_orders.json

Here is an example of the kind of messy source data the workflow is designed to handle:

The mock API file does not call a real external API. It simulates an API-style JSON response so the workflow remains reproducible and does not require API keys.

That is useful for a starter project because anyone can run it locally without creating accounts, setting secrets, or depending on an external service.

Run the CSV workflow

The CSV example reads a messy customer file, validates it against an expected schema, cleans it, exports the result, and generates reports.

python -m dq_etl_starter.cli run \
  --input data/input/messy_customers.csv \
  --input-type csv \
  --schema data/expected/customer_schema.json \
  --output-dir data/output/csv \
  --db-target sqlite \
  --table-name cleaned_customers

Expected outputs:

data/output/csv/cleaned_customers.csv
data/output/csv/etl_output.sqlite
data/output/csv/quality_report.md
data/output/csv/quality_report.json

The important point is that the workflow does not only create a cleaned file. It also records what happened.

Run the Excel workflow

Excel exports are common in small business and operations workflows.

Run the sample Excel input:

python -m dq_etl_starter.cli run \
  --input data/input/messy_orders.xlsx \
  --input-type excel \
  --schema data/expected/order_schema.json \
  --output-dir data/output/excel \
  --db-target sqlite \
  --table-name cleaned_orders

The output pattern is the same:

data/output/excel/cleaned_orders.csv
data/output/excel/etl_output.sqlite
data/output/excel/quality_report.md
data/output/excel/quality_report.json

Keeping the CSV and Excel output folders separate makes it easier to compare runs without overwriting previous reports.

Run nested JSON data

JSON is useful for APIs and application exports, but nested JSON is not always reporting-ready.

This project supports a --records-path option so the workflow can extract a list of records from a nested payload.

python -m dq_etl_starter.cli run \
  --input data/input/nested_customers.json \
  --input-type json \
  --records-path data.customers \
  --schema data/expected/customer_schema.json \
  --output-dir data/output/json \
  --db-target sqlite \
  --table-name cleaned_customers_json

This step demonstrates a practical pattern: convert nested data into a tabular structure before validation, cleaning, and reporting.

Run mock API-style data

The mock API workflow uses a local JSON file that looks like an API response.

python -m dq_etl_starter.cli run \
  --input data/input/mock_api_orders.json \
  --input-type mock-api \
  --records-path data.orders \
  --schema data/expected/order_schema.json \
  --output-dir data/output/mock_api \
  --db-target sqlite \
  --table-name cleaned_api_orders

This is intentionally not a real API integration in the current version.

For a client project, this layer could later be replaced with a real API reader that handles authentication, pagination, retries, and rate limits. For a starter project, using a local mock payload keeps the workflow easy to run and test.

How validation works

The project uses schema files under data/expected/ to define what the workflow expects.

For example, a customer schema can describe expected columns and simple column rules. The workflow then checks the raw data before cleaning.

Validation can detect issues such as:

missing expected columns
unexpected columns
missing values
duplicate rows
invalid email formats
invalid date values
invalid number values

The validation report uses source-oriented row references for CSV-style inputs. For example, if the source CSV has one header line and five data rows, a warning on Row 6 points to the fifth data record in the source file.

Pydantic is used for the workflow and reporting contracts. The row-level cleaning and validation remain DataFrame-based because CSV, Excel, JSON, and API-style datasets often have changing columns.

The validation step does not try to solve every business rule. That is a deliberate choice.

In real client work, every dataset has different rules. A small starter should make validation easy to extend rather than hardcoding too many assumptions.

How cleaning works

After validation, the workflow applies simple cleaning steps:

trim text values
normalize selected text fields
drop duplicate rows
keep the cleaned data in a DataFrame
export the cleaned result

The project keeps cleaning intentionally conservative.

It does not silently invent missing values. It does not guess complex business logic. It focuses on cleaning steps that are easy to explain and review.

That matters for handoff. When someone else receives the output, they should be able to understand what the script changed and what still needs human review.

Output files

Each run can create four main outputs:

cleaned data as CSV
cleaned data in SQLite
quality report as Markdown
quality report as JSON

For example:

data/output/csv/cleaned_customers.csv
data/output/csv/etl_output.sqlite
data/output/csv/quality_report.md
data/output/csv/quality_report.json

The Markdown report is useful for humans. The JSON report is useful if another tool needs to consume the result later.

A typical report includes:

raw row count
cleaned row count
column list
missing values by column
duplicate row count
missing expected columns
unexpected columns
validation issues
output file paths

The cleaned output can then be reviewed directly or used in a later reporting workflow.

This turns the workflow from "a script produced a file" into "a repeatable run produced data and a reviewable report."

Inspect SQLite output

SQLite is the default database target because it is local, portable, and easy to inspect.

After running the CSV workflow, you can open the SQLite file:

sqlite3 data/output/csv/etl_output.sqlite

Then inspect the tables:

.tables
SELECT * FROM cleaned_customers LIMIT 5;

This is useful when the cleaned data should later feed a dashboard, internal tool, or reporting process.

Optional PostgreSQL export

The project also includes optional PostgreSQL export support.

This is useful when cleaned data needs to be loaded into a shared database instead of a local SQLite file. SQLite remains the default local target.

Start PostgreSQL with Docker Compose:

docker compose up -d postgres

Run the workflow with PostgreSQL as the target:

DATABASE_URL=postgresql+psycopg://dq_user:dq_password@localhost:5432/dq_demo \
python -m dq_etl_starter.cli run \
  --input data/input/messy_customers.csv \
  --input-type csv \
  --schema data/expected/customer_schema.json \
  --output-dir data/output/postgres \
  --db-target postgres \
  --table-name cleaned_customers

For Windows PowerShell, set the environment variable separately:

$env:DATABASE_URL="postgresql+psycopg://dq_user:dq_password@localhost:5432/dq_demo"

python -m dq_etl_starter.cli run `
  --input data/input/messy_customers.csv `
  --input-type csv `
  --schema data/expected/customer_schema.json `
  --output-dir data/output/postgres `
  --db-target postgres `
  --table-name cleaned_customers

I would still treat PostgreSQL as an optional extension for this starter. SQLite is enough for the default local workflow.

Run tests

Run the test suite with:

pytest

The tests cover the workflow pieces that matter most for a small ETL starter:

reading input files
normalizing data
validating data
cleaning rows
exporting data
generating reports
running the CLI

Tests are important here because data cleanup scripts are easy to break when formats change.

A small test suite makes the workflow safer to modify.

Run with Docker

The project can also run inside Docker.

Build the image:

docker build -t data-quality-etl-starter .

Run the container:

docker run --rm -v "${PWD}/data/output:/app/data/output" data-quality-etl-starter

On Windows PowerShell, the same command format can be used:

docker run --rm -v "${PWD}/data/output:/app/data/output" data-quality-etl-starter

Docker is useful for handoff because it reduces local environment differences. A reviewer or client can run the workflow without manually recreating the same Python environment.

What I intentionally kept out of v0.1

This project is intentionally small.

The current version does not include:

a FastAPI service
user authentication
scheduled jobs
a web dashboard
Airflow orchestration
dbt models
cloud deployment
real external API authentication
complex business-specific validation rules

FastAPI would be a natural future layer, but it is not part of the v0.1 core.

The CLI workflow remains the source of truth. A later API layer could expose endpoints for upload, validation, and report retrieval, but that should come after the core workflow is stable.

Keeping the first version small makes the project easier to run, review, test, and adapt.

How this maps to freelance client work

This kind of starter maps well to small Python data workflow tasks.

Examples include:

cleaning CSV exports before reporting
converting Excel files into normalized CSV output
flattening JSON into tables
validating required columns before import
producing simple data quality reports
loading cleaned data into SQLite or PostgreSQL
preparing data for dashboards or internal tools
turning a manual weekly cleanup process into a repeatable command

For freelance work, the value is not only the code.

The value is the handoff:

clear commands
sample inputs
predictable outputs
validation warnings
reports
tests
Docker support
documented limitations

That is what makes a small automation project easier for another person to trust.

Next steps

The next improvements I would consider are:

add more schema rule types
improve report formatting
add richer error messages
add logging
add run IDs for report history
add a real API reader example
add a FastAPI wrapper around the CLI workflow
add more PostgreSQL examples
add CI for automated testing

The important constraint is to keep the project practical. It should remain small enough for a developer, analyst, or client to understand without needing a full data platform.

GitHub repository

The full project is available here:

https://github.com/OnerGit/data-quality-etl-starter

If you work with messy CSV, Excel, JSON, or API-style data, this kind of starter can be a useful base for building repeatable data cleaning and reporting workflows.

DEV Community