Bob Oner

Posted on Jun 9

From Data Quality Checks to Analytics-Ready Parquet with Python

#python #etl #dataengineering #docker

In the first article, I walked through a small Python data quality ETL starter that reads messy CSV, Excel, JSON, and API-style data, validates it, cleans it, exports it, and generates quality reports.

Build a Python Data Quality ETL Starter for Messy CSV, Excel, JSON, and API-Style Data

This follow-up focuses on the v0.4.0 update of the same project:

Data Quality ETL Starter on GitHub

The new goal is still intentionally small. This is not a data warehouse, a BI platform, an Airflow project, a dbt project, or an AI data application.

The goal is to extend the starter workflow from small sample files to a more realistic local analytics demo:

generated messy order data
        ↓
existing validation and cleaning logic
        ↓
cleaned CSV
        ↓
cleaned Parquet
        ↓
DuckDB query demo
        ↓
summary CSV tables
        ↓
benchmark_report.md

This makes the project more useful as a portfolio asset because it demonstrates not only data cleaning, but also the next handoff step: producing files that are easier to analyze locally.

Why add an analytics-ready export path?

Many small-team data workflows do not end with a cleaned CSV file.

A cleaned CSV is useful, but a reporting workflow often needs one more layer:

a file format suitable for repeated analysis;
simple summary tables;
SQL queries that can be reviewed and reused;
a lightweight benchmark report;
a way to test the workflow with more than a tiny demo file.

That is the reason v0.4.0 adds an optional analytics-ready path.

The project still keeps the original CLI workflow as the source of truth. The analytics demo is an additional path, not a replacement for the existing CSV, Excel, JSON, mock API, SQLite, PostgreSQL, or FastAPI validation workflows.

What v0.4.0 adds

The v0.4.0 update adds three main pieces.

First, it adds a synthetic order data generator:

scripts/generate_sample_data.py

Second, it adds an analytics demo runner:

scripts/run_analytics_demo.py

Third, it adds analytics helpers under the package source:

src/dq_etl_starter/analytics.py

Together, these files show how to generate repeatable synthetic data, run it through the existing cleaning and validation logic, and produce analytics-ready outputs.

The project version is now 0.4.0.

Project structure after the update

The relevant structure now looks like this:

data-quality-etl-starter/
├── data/
│   ├── input/
│   ├── expected/
│   └── output/
├── docs/
│   ├── api.md
│   ├── analytics.md
│   └── postgres.md
├── screenshots/
├── scripts/
│   ├── generate_sample_data.py
│   └── run_analytics_demo.py
├── src/dq_etl_starter/
│   ├── analytics.py
│   ├── api.py
│   ├── clean.py
│   ├── cli.py
│   ├── exporters.py
│   ├── readers.py
│   ├── services.py
│   └── validate.py
└── tests/

This is still a small project, but the workflow now has a clearer progression:

local data quality workflow;
optional PostgreSQL export;
optional FastAPI validation wrapper;
optional analytics-ready export demo.

That progression matters because it maps to how small client projects often grow.

A client may first need a repeatable CSV cleanup script. Later, they may ask for database export, an API endpoint, or files that can feed reporting tools.

Install the project locally

Clone the repository:

git clone https://github.com/OnerGit/data-quality-etl-starter.git
cd data-quality-etl-starter

Create a virtual environment:

python -m venv .venv

Activate it on macOS or Linux:

source .venv/bin/activate

Activate it on Windows PowerShell:

.venv\Scripts\activate

Install dependencies and the local package:

pip install -r requirements.txt
pip install -e .

The editable install step is useful because the repository uses a src/ layout.

For the v0.4 analytics demo, the important optional dependencies are:

pyarrow for Parquet output;
duckdb for local SQL queries over Parquet files;
faker for generated demo data.

Generate synthetic order data

The generator creates deterministic synthetic customer/order-style data.

It does not download real customer data. It does not use a private dataset. It is designed only for testing and demonstration.

Generate 1,000 rows:

python scripts/generate_sample_data.py \
  --rows 1000 \
  --output data/generated/orders_1k.csv \
  --seed 42

Generate 10,000 rows:

python scripts/generate_sample_data.py \
  --rows 10000 \
  --output data/generated/orders_10k.csv \
  --seed 42

Generate 100,000 rows:

python scripts/generate_sample_data.py \
  --rows 100000 \
  --output data/generated/orders_100k.csv \
  --seed 42

Windows PowerShell example:

python scripts/generate_sample_data.py `
  --rows 100000 `
  --output data/generated/orders_100k.csv `
  --seed 42

Here is the 100,000-row generation step:

The fixed seed makes the output reproducible. That is useful for documentation, tests, demos, and future comparisons.

What kind of issues are introduced?

The generated data intentionally includes common data quality issues.

Examples include:

missing email values;
invalid email values;
missing country values;
inconsistent country casing;
duplicate rows;
invalid dates;
negative quantities;
zero prices.

This is important because a data quality demo should not only process clean data.

If the generated data is too perfect, the validation and cleaning workflow does not prove much. The point is to create enough realistic messiness to exercise the existing workflow.

Run the analytics-ready export demo

After generating the input file, run the analytics demo:

python scripts/run_analytics_demo.py \
  --input data/generated/orders_100k.csv \
  --schema data/expected/generated_order_schema.json \
  --output-dir data/output/analytics

Windows PowerShell:

python scripts/run_analytics_demo.py `
  --input data/generated/orders_100k.csv `
  --schema data/expected/generated_order_schema.json `
  --output-dir data/output/analytics

The script runs the generated input through the existing validation and cleaning logic, then writes analytics-ready outputs.

Expected output files:

data/output/analytics/cleaned_orders.csv
data/output/analytics/cleaned_orders.parquet
data/output/analytics/customer_summary.csv
data/output/analytics/revenue_by_country.csv
data/output/analytics/orders_by_month.csv
data/output/analytics/source_system_summary.csv
data/output/analytics/analytics_queries.sql
data/output/analytics/benchmark_report.md

Here is the analytics output and DuckDB query demo:

The output directory is intentionally excluded from Git. Large generated files and local analytics outputs should not be committed to the repository.

Why export Parquet?

CSV is easy to inspect and share. It is a good default output for small workflows.

Parquet is useful when the same cleaned dataset will be queried repeatedly or used by analytics tools. It preserves column types better than CSV and is commonly used in data workflows.

In this project, Parquet is not used to make the project sound bigger than it is. It is used as a practical local export format after the cleaning step.

The key handoff idea is:

cleaned CSV for readability
cleaned Parquet for local analytics
summary CSV files for reporting
SQL file for repeatable queries
benchmark report for documentation

That combination is still small, but it is more useful than a single cleaned CSV file.

Query Parquet locally with DuckDB

The demo writes reusable SQL to:

data/output/analytics/analytics_queries.sql

An example query looks like this:

SELECT
    country,
    ROUND(SUM(revenue), 2) AS total_revenue,
    COUNT(*) AS order_count
FROM read_parquet('data/output/analytics/cleaned_orders.parquet')
GROUP BY country
ORDER BY total_revenue DESC
LIMIT 10;

This query uses DuckDB to read the Parquet file directly.

That is a useful pattern for small local workflows because it avoids setting up a separate database service just to inspect analytics-ready output.

For a portfolio project, it also shows a clear bridge between Python data cleaning and SQL-based analysis.

Summary tables

The analytics demo produces several summary CSV files:

customer_summary.csv
revenue_by_country.csv
orders_by_month.csv
source_system_summary.csv

These are intentionally simple.

They are not a BI dashboard. They are not a reporting platform. They are small output tables that show how the cleaned data can be prepared for the next step.

For example:

revenue_by_country.csv can support a country-level revenue view;
orders_by_month.csv can support a monthly trend view;
source_system_summary.csv can help compare different input sources;
customer_summary.csv can support customer-level reporting.

In client work, this kind of output is often enough for a first automation milestone. The client can open the CSV files, load them into a spreadsheet, connect them to a BI tool, or use them as the input for the next version.

Benchmark report

The analytics demo also writes a Markdown benchmark report:

data/output/analytics/benchmark_report.md

Here is the report screenshot:

The report records information such as:

input file path;
raw row count;
cleaned row count;
analytics-ready row count;
validation issue count;
runtime seconds;
output file paths;
DuckDB preview query result.

The report is not meant to be a formal performance benchmark.

It is a lightweight run record. Its purpose is to make each run easier to review, compare, and hand off.

How validation still fits in

The v0.4 analytics path does not bypass the existing validation workflow.

The script still loads a schema file, reads the input CSV, validates expected fields, cleans the data, and then prepares the analytics output.

That matters because analytics output should not be generated from uninspected raw input.

The basic flow is:

read generated CSV
        ↓
load expected schema
        ↓
validate schema rules
        ↓
clean duplicate and text values
        ↓
prepare analytics columns
        ↓
write CSV, Parquet, summary tables, SQL, and benchmark report

This keeps the analytics demo connected to the original purpose of the project: data quality before reporting.

What is intentionally still out of scope?

The v0.4.0 update does not add:

user login;
frontend application;
async task queue;
cloud deployment;
BI dashboard;
Metabase;
data warehouse implementation;
Airflow;
dbt;
Spark;
SQLModel metadata layer;
AI, LLM, RAG, or agent features.

Those tools can be useful in the right context, but they would make this starter project heavier than necessary.

The design goal remains:

small
runnable
testable
documented
screenshot-ready
easy to inspect

That is more useful for this stage than adding a large platform stack.

Run the tests

Run the full test suite:

python -m compileall -q src/dq_etl_starter
python -m compileall -q scripts
pytest

Run only the v0.4 analytics-related tests:

pytest tests/test_generate_sample_data.py
pytest tests/test_analytics.py
pytest tests/test_exporters_parquet.py

The optional PostgreSQL integration tests should remain optional and should be skipped unless DATABASE_URL is configured.

This is another reason to keep the workflow modular. The generator, analytics helpers, Parquet exporter, and original ETL workflow can be checked independently.

Run with Docker

Build and run the default Docker workflow:

docker build -t data-quality-etl-starter:0.4.0 -t data-quality-etl-starter:latest .
docker run --rm data-quality-etl-starter:0.4.0

The default Docker run remains a simple reproducible CLI workflow.

This is intentional. The Docker path should not become complicated just because the project gained an optional analytics demo.

How this maps to freelance client work

This update maps well to common small data workflow requests.

Examples include:

generating test data before working with private client data;
validating messy order or customer exports;
cleaning data before monthly reporting;
producing Parquet files for local analytics;
writing repeatable DuckDB SQL queries;
producing summary CSV files for spreadsheet or BI handoff;
documenting each run with a benchmark or quality report.

For a client, this kind of workflow is valuable because it is practical and reviewable.

It answers questions like:

What did the script read?
What did it clean?
What outputs did it produce?
Can the same workflow run again next week?
Can the output be inspected without a custom application?
Can the workflow be extended without rewriting everything?

That is the level of reliability many small automation projects need before they become larger systems.

What I would improve next

The next version could move one step closer to reporting workflows without turning the project into a full BI platform.

Possible next steps:

write PostgreSQL reporting tables or views;
add more summary table examples;
add a lightweight local BI demo;
improve benchmark report formatting;
add better schema profiling;
add more realistic validation rules;
add CI for automated test runs.

The important constraint is to keep the project focused.

The current project is a small Python data quality ETL starter. The v0.4.0 update makes it easier to demonstrate analytics-ready output, but it is still not trying to become a full data platform.

Repository

GitHub repository:

https://github.com/OnerGit/data-quality-etl-starter