Small teams often receive data before it is ready for reporting.
It may come from a CSV export, an Excel file, a JSON payload, or an API-style response. The structure is usually close enough to be useful, but not clean enough to trust directly.
Common problems include:
- inconsistent column names
- missing values
- duplicate rows
- invalid emails
- bad dates
- numeric fields stored as messy text
- nested JSON that needs to become a table
- repeated manual cleanup before every report
- no quality report for handoff
This article walks through a small open-source Python project I built to handle that kind of problem:
It is not a big data platform. It is not an Airflow or dbt project. It is not a production data warehouse. It is a small, reproducible starter workflow for data validation, cleaning, export, and reporting.
What this project builds
The project takes messy input data and runs it through a repeatable workflow:
messy CSV / Excel / JSON / mock API data
↓
read and flatten
↓
normalize columns
↓
validate expected schema rules
↓
clean duplicate rows and text values
↓
export cleaned CSV + SQLite
↓
generate data quality report
The current version supports:
- CSV input
- Excel input
- nested JSON input
- mock API-style JSON input
- column name normalization
- Pydantic-based workflow and schema models
- missing value, duplicate row, email, date, and number validation
- cleaned CSV output
- SQLite output by default
- optional PostgreSQL export
- Markdown and JSON data quality reports
- CLI execution
- pytest tests
- Docker-based execution
The main goal is not to build a complex platform. The goal is to show how a small data workflow can be structured, tested, documented, and adapted.
Why this matters for small-team data workflows
Many small-team data problems are not "big data" problems.
They are repeatability problems.
For example:
- a sales team exports a customer list every week
- an operations team receives Excel files from different sources
- a founder wants a simple CSV-to-report workflow
- an analyst needs JSON payloads flattened into reporting tables
- a freelancer needs to hand off a data cleanup script that another person can run
A one-off script can solve one file once.
A small workflow is more useful because it makes the steps explicit:
- What data did we receive?
- What rules did we expect?
- What changed during cleaning?
- What output files were created?
- What warnings should the next person review?
That is the reason this project writes both cleaned data and a quality report.
Project structure
The repository keeps the workflow small and modular:
data-quality-etl-starter/
├── data/
│ ├── input/
│ ├── expected/
│ └── output/
├── docs/
├── screenshots/
├── src/dq_etl_starter/
├── tests/
├── Dockerfile
├── docker-compose.yml
├── pyproject.toml
└── README.md
The most important source modules are:
src/dq_etl_starter/
├── readers.py # read CSV, Excel, and JSON files
├── mock_api.py # simulate API-style JSON without network calls
├── normalize.py # normalize columns and flatten JSON
├── validate.py # validate expected columns and simple schema rules
├── clean.py # trim text values and drop duplicates
├── exporters.py # export cleaned data to CSV, SQLite, or PostgreSQL
├── report.py # generate Markdown and JSON reports
├── models.py # Pydantic models for workflow contracts
└── cli.py # command-line entry point
This separation is intentional. Each step can be tested, replaced, or extended without turning the project into one long script.
Setup
Clone the repository:
git clone https://github.com/OnerGit/data-quality-etl-starter.git
cd data-quality-etl-starter
Create a virtual environment:
python -m venv .venv
Activate it on macOS or Linux:
source .venv/bin/activate
Activate it on Windows PowerShell:
.venv\Scripts\activate
Install dependencies and the local package:
pip install -r requirements.txt
pip install -e .
The editable install step is useful because the source code uses a src/ layout.
Input data examples
The project includes sample inputs under data/input/.
The examples are designed to represent common data workflow formats:
data/input/messy_customers.csv
data/input/messy_orders.xlsx
data/input/nested_customers.json
data/input/mock_api_orders.json
Here is an example of the kind of messy source data the workflow is designed to handle:
The mock API file does not call a real external API. It simulates an API-style JSON response so the workflow remains reproducible and does not require API keys.
That is useful for a starter project because anyone can run it locally without creating accounts, setting secrets, or depending on an external service.
Run the CSV workflow
The CSV example reads a messy customer file, validates it against an expected schema, cleans it, exports the result, and generates reports.
python -m dq_etl_starter.cli run \
--input data/input/messy_customers.csv \
--input-type csv \
--schema data/expected/customer_schema.json \
--output-dir data/output/csv \
--db-target sqlite \
--table-name cleaned_customers
Expected outputs:
data/output/csv/cleaned_customers.csv
data/output/csv/etl_output.sqlite
data/output/csv/quality_report.md
data/output/csv/quality_report.json
The important point is that the workflow does not only create a cleaned file. It also records what happened.
Run the Excel workflow
Excel exports are common in small business and operations workflows.
Run the sample Excel input:
python -m dq_etl_starter.cli run \
--input data/input/messy_orders.xlsx \
--input-type excel \
--schema data/expected/order_schema.json \
--output-dir data/output/excel \
--db-target sqlite \
--table-name cleaned_orders
The output pattern is the same:
data/output/excel/cleaned_orders.csv
data/output/excel/etl_output.sqlite
data/output/excel/quality_report.md
data/output/excel/quality_report.json
Keeping the CSV and Excel output folders separate makes it easier to compare runs without overwriting previous reports.
Run nested JSON data
JSON is useful for APIs and application exports, but nested JSON is not always reporting-ready.
This project supports a --records-path option so the workflow can extract a list of records from a nested payload.
python -m dq_etl_starter.cli run \
--input data/input/nested_customers.json \
--input-type json \
--records-path data.customers \
--schema data/expected/customer_schema.json \
--output-dir data/output/json \
--db-target sqlite \
--table-name cleaned_customers_json
This step demonstrates a practical pattern: convert nested data into a tabular structure before validation, cleaning, and reporting.
Run mock API-style data
The mock API workflow uses a local JSON file that looks like an API response.
python -m dq_etl_starter.cli run \
--input data/input/mock_api_orders.json \
--input-type mock-api \
--records-path data.orders \
--schema data/expected/order_schema.json \
--output-dir data/output/mock_api \
--db-target sqlite \
--table-name cleaned_api_orders
This is intentionally not a real API integration in the current version.
For a client project, this layer could later be replaced with a real API reader that handles authentication, pagination, retries, and rate limits. For a starter project, using a local mock payload keeps the workflow easy to run and test.
How validation works
The project uses schema files under data/expected/ to define what the workflow expects.
For example, a customer schema can describe expected columns and simple column rules. The workflow then checks the raw data before cleaning.
Validation can detect issues such as:
- missing expected columns
- unexpected columns
- missing values
- duplicate rows
- invalid email formats
- invalid date values
- invalid number values
The validation report uses source-oriented row references for CSV-style inputs. For example, if the source CSV has one header line and five data rows, a warning on Row 6 points to the fifth data record in the source file.
Pydantic is used for the workflow and reporting contracts. The row-level cleaning and validation remain DataFrame-based because CSV, Excel, JSON, and API-style datasets often have changing columns.
The validation step does not try to solve every business rule. That is a deliberate choice.
In real client work, every dataset has different rules. A small starter should make validation easy to extend rather than hardcoding too many assumptions.
How cleaning works
After validation, the workflow applies simple cleaning steps:
- trim text values
- normalize selected text fields
- drop duplicate rows
- keep the cleaned data in a DataFrame
- export the cleaned result
The project keeps cleaning intentionally conservative.
It does not silently invent missing values. It does not guess complex business logic. It focuses on cleaning steps that are easy to explain and review.
That matters for handoff. When someone else receives the output, they should be able to understand what the script changed and what still needs human review.
Output files
Each run can create four main outputs:
cleaned data as CSV
cleaned data in SQLite
quality report as Markdown
quality report as JSON
For example:
data/output/csv/cleaned_customers.csv
data/output/csv/etl_output.sqlite
data/output/csv/quality_report.md
data/output/csv/quality_report.json
The Markdown report is useful for humans. The JSON report is useful if another tool needs to consume the result later.
A typical report includes:
- raw row count
- cleaned row count
- column list
- missing values by column
- duplicate row count
- missing expected columns
- unexpected columns
- validation issues
- output file paths
The cleaned output can then be reviewed directly or used in a later reporting workflow.
This turns the workflow from "a script produced a file" into "a repeatable run produced data and a reviewable report."
Inspect SQLite output
SQLite is the default database target because it is local, portable, and easy to inspect.
After running the CSV workflow, you can open the SQLite file:
sqlite3 data/output/csv/etl_output.sqlite
Then inspect the tables:
.tables
SELECT * FROM cleaned_customers LIMIT 5;
This is useful when the cleaned data should later feed a dashboard, internal tool, or reporting process.
Optional PostgreSQL export
The project also includes optional PostgreSQL export support.
This is useful when cleaned data needs to be loaded into a shared database instead of a local SQLite file. SQLite remains the default local target.
Start PostgreSQL with Docker Compose:
docker compose up -d postgres
Run the workflow with PostgreSQL as the target:
DATABASE_URL=postgresql+psycopg://dq_user:dq_password@localhost:5432/dq_demo \
python -m dq_etl_starter.cli run \
--input data/input/messy_customers.csv \
--input-type csv \
--schema data/expected/customer_schema.json \
--output-dir data/output/postgres \
--db-target postgres \
--table-name cleaned_customers
For Windows PowerShell, set the environment variable separately:
$env:DATABASE_URL="postgresql+psycopg://dq_user:dq_password@localhost:5432/dq_demo"
python -m dq_etl_starter.cli run `
--input data/input/messy_customers.csv `
--input-type csv `
--schema data/expected/customer_schema.json `
--output-dir data/output/postgres `
--db-target postgres `
--table-name cleaned_customers
I would still treat PostgreSQL as an optional extension for this starter. SQLite is enough for the default local workflow.
Run tests
Run the test suite with:
pytest
The tests cover the workflow pieces that matter most for a small ETL starter:
- reading input files
- normalizing data
- validating data
- cleaning rows
- exporting data
- generating reports
- running the CLI
Tests are important here because data cleanup scripts are easy to break when formats change.
A small test suite makes the workflow safer to modify.
Run with Docker
The project can also run inside Docker.
Build the image:
docker build -t data-quality-etl-starter .
Run the container:
docker run --rm -v "${PWD}/data/output:/app/data/output" data-quality-etl-starter
On Windows PowerShell, the same command format can be used:
docker run --rm -v "${PWD}/data/output:/app/data/output" data-quality-etl-starter
Docker is useful for handoff because it reduces local environment differences. A reviewer or client can run the workflow without manually recreating the same Python environment.
What I intentionally kept out of v0.1
This project is intentionally small.
The current version does not include:
- a FastAPI service
- user authentication
- scheduled jobs
- a web dashboard
- Airflow orchestration
- dbt models
- cloud deployment
- real external API authentication
- complex business-specific validation rules
FastAPI would be a natural future layer, but it is not part of the v0.1 core.
The CLI workflow remains the source of truth. A later API layer could expose endpoints for upload, validation, and report retrieval, but that should come after the core workflow is stable.
Keeping the first version small makes the project easier to run, review, test, and adapt.
How this maps to freelance client work
This kind of starter maps well to small Python data workflow tasks.
Examples include:
- cleaning CSV exports before reporting
- converting Excel files into normalized CSV output
- flattening JSON into tables
- validating required columns before import
- producing simple data quality reports
- loading cleaned data into SQLite or PostgreSQL
- preparing data for dashboards or internal tools
- turning a manual weekly cleanup process into a repeatable command
For freelance work, the value is not only the code.
The value is the handoff:
- clear commands
- sample inputs
- predictable outputs
- validation warnings
- reports
- tests
- Docker support
- documented limitations
That is what makes a small automation project easier for another person to trust.
Next steps
The next improvements I would consider are:
- add more schema rule types
- improve report formatting
- add richer error messages
- add logging
- add run IDs for report history
- add a real API reader example
- add a FastAPI wrapper around the CLI workflow
- add more PostgreSQL examples
- add CI for automated testing
The important constraint is to keep the project practical. It should remain small enough for a developer, analyst, or client to understand without needing a full data platform.
GitHub repository
The full project is available here:
https://github.com/OnerGit/data-quality-etl-starter
If you work with messy CSV, Excel, JSON, or API-style data, this kind of starter can be a useful base for building repeatable data cleaning and reporting workflows.







Top comments (0)