Bob Oner

Posted on Jun 16

Running a Real Retail Dataset Through a Python Data Quality Workflow

#python #etl #database #dataengineering

In the previous article, I extended a small Python data quality ETL starter with AI-ready data preparation.

The important constraint was that the workflow did not call an LLM API, generate embeddings, or train a model. It prepared structured data assets such as schema profiles, data dictionaries, validation summaries, feature-ready CSV files, and manifest files.

Preparing AI-Ready Data Without Calling an LLM API

This follow-up focuses on the v0.7.0 update of the same project:

Data Quality ETL Starter on GitHub

The new goal is to move beyond synthetic demo data and show that the same data quality workflow can process a public retail/e-commerce-style dataset locally.

This is still not a big data platform, a production retail analytics system, a benchmark leaderboard, or a public dataset redistribution repository.

The goal is narrower and more practical:

manually downloaded public retail dataset
        ↓
prepare_real_dataset_demo.py
        ↓
normalized retail transaction CSV
        ↓
existing CLI validation and cleaning workflow
        ↓
quality reports + SQLite export
        ↓
run_real_dataset_benchmark.py
        ↓
benchmark report + summary CSV outputs

That is a useful next step for a portfolio project because it shows the workflow can handle a more realistic dataset while still keeping data handling, scope, and reproducibility clear.

Why add a real dataset benchmark?

Earlier versions of this project used small sample files and generated synthetic order data.

That is useful for testing and documentation, but it leaves one practical question:

Can the workflow handle a public dataset that was not designed specifically for this repository?

v0.7.0 adds an optional real dataset benchmark path to answer that question.

The workflow now demonstrates how to:

take a public retail transaction dataset;
keep the raw dataset local-only;
map external source columns into a project-friendly schema;
derive practical fields such as revenue and cancellation flags;
reuse the existing CLI validation and cleaning workflow;
generate Markdown and JSON quality reports;
export cleaned data to SQLite;
produce benchmark evidence and summary CSV files.

The key design choice is that the existing CLI remains the source of truth.

The real dataset path does not become a separate pipeline. It prepares the source data, then passes it through the same validation and cleaning workflow used by the rest of the project.

Dataset used in v0.7.0

The default v0.7.0 dataset is the UCI Online Retail dataset.

Official source:

UCI Machine Learning Repository: Online Retail

Citation:

Chen, D. (2015). Online Retail [Dataset]. UCI Machine Learning Repository.
https://doi.org/10.24432/C5BW33

License note:

Creative Commons Attribution 4.0 International (CC BY 4.0)

The dataset is useful for this project because it is retail/e-commerce adjacent and transaction-shaped. It includes fields that map naturally into an invoice/order-style workflow:

InvoiceNo
StockCode
Description
Quantity
InvoiceDate
UnitPrice
CustomerID
Country

The project maps those source columns into normalized snake_case columns and adds derived fields.

What is kept out of Git

This part is important.

The repository does not redistribute the full raw UCI dataset. It also does not commit full normalized or cleaned real dataset outputs.

These paths are local-only:

data/external/
data/raw/public/
data/output/real_dataset/

The repository keeps:

source code;
schema files;
tests;
documentation;
screenshots;
small sample inputs;
instructions for running the workflow locally.

It does not keep:

full downloaded raw datasets;
full normalized real dataset outputs;
full cleaned real dataset outputs;
local SQLite files generated from real datasets;
private customer data;
client data;
API credentials;
tokens or secrets.

This keeps the repository lightweight and avoids turning it into a dataset mirror.

What v0.7.0 adds

The most relevant new files are:

scripts/prepare_real_dataset_demo.py
scripts/run_real_dataset_benchmark.py
src/dq_etl_starter/real_dataset.py
docs/data_sources.md
docs/real_dataset_benchmark.md
docs/limitations.md
data/expected/online_retail_schema.json

The real dataset helper module handles the project-specific mapping and summary logic.

The two scripts provide a simple local workflow:

prepare the manually downloaded dataset into a normalized CSV;
generate local benchmark evidence and summary outputs after the CLI quality workflow runs.

Project structure after the update

The project now has a clearer path from messy input files to public-dataset benchmark evidence:

data-quality-etl-starter/
├── data/
│   ├── expected/
│   │   └── online_retail_schema.json
│   └── output/
├── docs/
│   ├── data_sources.md
│   ├── limitations.md
│   └── real_dataset_benchmark.md
├── screenshots/
├── scripts/
│   ├── prepare_real_dataset_demo.py
│   └── run_real_dataset_benchmark.py
├── src/dq_etl_starter/
│   ├── real_dataset.py
│   ├── cli.py
│   ├── clean.py
│   ├── report.py
│   └── validate.py
└── tests/
    ├── test_real_dataset.py
    └── test_real_dataset_benchmark.py

The real dataset path is optional. The default small sample workflows remain unchanged.

Install the project locally

Clone the repository:

git clone https://github.com/OnerGit/data-quality-etl-starter.git
cd data-quality-etl-starter

Create a virtual environment:

python -m venv .venv

Activate it on macOS or Linux:

source .venv/bin/activate

Activate it on Windows PowerShell:

.venv\Scripts\activate

Install dependencies and the local package:

pip install -r requirements.txt
pip install -e .

The editable install step is useful because the project uses a src/ layout.

Step 1: Download the public dataset manually

Download the UCI Online Retail dataset from the official UCI Machine Learning Repository page.

Place the file here:

data/external/online_retail.xlsx

The project does not automatically download the dataset by default.

That is intentional.

For a public portfolio repository, I prefer to keep the data acquisition step explicit. It makes the source, license, citation, and local-only handling policy easier to review.

Step 2: Prepare the normalized dataset

Run the preparation script.

macOS / Linux:

python scripts/prepare_real_dataset_demo.py \
  --raw-input data/external/online_retail.xlsx \
  --output data/output/real_dataset/online_retail_normalized.csv

Windows PowerShell:

python scripts/prepare_real_dataset_demo.py `
  --raw-input data/external/online_retail.xlsx `
  --output data/output/real_dataset/online_retail_normalized.csv

This step reads the local source file, validates expected source columns, maps UCI columns into project-friendly names, derives additional fields, and writes a normalized CSV.

The normalized output columns are:

invoice_no
stock_code
description
quantity
invoice_date
unit_price
customer_id
country
revenue
is_cancellation
source_dataset

The derived fields are simple but useful:

revenue is derived from quantity and unit price;
is_cancellation marks cancellation-style rows;
source_dataset records dataset lineage.

This preparation layer is deliberately small. It does not try to perform all business logic. It only converts the external dataset into a shape that the existing project workflow can validate and clean.

Step 3: Run the existing CLI workflow

After preparation, the normalized CSV is passed into the existing CLI workflow.

macOS / Linux:

python -m dq_etl_starter.cli run \
  --input data/output/real_dataset/online_retail_normalized.csv \
  --input-type csv \
  --schema data/expected/online_retail_schema.json \
  --output-dir data/output/real_dataset/run \
  --db-target sqlite \
  --table-name cleaned_online_retail

Windows PowerShell:

python -m dq_etl_starter.cli run `
  --input data/output/real_dataset/online_retail_normalized.csv `
  --input-type csv `
  --schema data/expected/online_retail_schema.json `
  --output-dir data/output/real_dataset/run `
  --db-target sqlite `
  --table-name cleaned_online_retail

Expected local outputs:

data/output/real_dataset/run/cleaned_online_retail.csv
data/output/real_dataset/run/etl_output.sqlite
data/output/real_dataset/run/quality_report.md
data/output/real_dataset/run/quality_report.json

This is the most important design point in v0.7.0.

The real dataset path reuses the existing validation and cleaning workflow. It does not create a special one-off script that bypasses the project architecture.

Schema for the normalized retail dataset

The schema file is:

data/expected/online_retail_schema.json

It defines the expected normalized columns and validation rules for fields such as invoice number, stock code, quantity, invoice date, unit price, customer ID, country, revenue, cancellation flag, and source dataset.

The schema is not intended to certify the dataset as business-ready.

It is a practical contract for this starter workflow:

external retail columns
        ↓
normalized project columns
        ↓
expected schema rules
        ↓
quality report

That is a useful handoff pattern because the next person can inspect both the mapping and the validation report.

Quality report

The CLI workflow writes a Markdown report and a JSON report.

For the real dataset workflow, the Markdown report is written to:

data/output/real_dataset/run/quality_report.md

The report is useful because it records what the workflow found rather than only producing a cleaned file.

Typical report sections include:

raw row count;
cleaned row count;
missing values by column;
duplicate row count;
expected column checks;
validation issue summaries;
output file paths.

For a repeatable data workflow, this is important. A cleaned output file alone is not enough. The workflow should also explain what was detected and what still needs review.

Step 4: Generate the real dataset benchmark report

After the CLI workflow finishes, generate a local benchmark report and summary outputs.

macOS / Linux:

python scripts/run_real_dataset_benchmark.py \
  --normalized-input data/output/real_dataset/online_retail_normalized.csv \
  --quality-report data/output/real_dataset/run/quality_report.json \
  --output-dir data/output/real_dataset \
  --dataset-name uci_online_retail

Windows PowerShell:

python scripts/run_real_dataset_benchmark.py `
  --normalized-input data/output/real_dataset/online_retail_normalized.csv `
  --quality-report data/output/real_dataset/run/quality_report.json `
  --output-dir data/output/real_dataset `
  --dataset-name uci_online_retail

Expected local outputs:

data/output/real_dataset/benchmark_report.md
data/output/real_dataset/summary/revenue_by_country.csv
data/output/real_dataset/summary/revenue_by_month.csv
data/output/real_dataset/summary/cancellation_summary.csv
data/output/real_dataset/summary/missing_customer_summary.csv

The benchmark report is not a universal performance claim.

It is local evidence for this machine, this dependency environment, and this dataset preparation flow.

That distinction matters. Runtime can change depending on CPU, disk speed, Python version, package versions, source file format, operating system, and local machine conditions.

Summary outputs

The benchmark script also writes lightweight summary CSV files.

The summary outputs are intentionally simple:

revenue_by_country.csv
revenue_by_month.csv
cancellation_summary.csv
missing_customer_summary.csv

They are not a full BI model.

They are small reporting-ready outputs that show how a cleaned retail transaction dataset can be summarized after validation.

For example:

revenue_by_country.csv supports country-level revenue inspection;
revenue_by_month.csv supports monthly trend inspection;
cancellation_summary.csv records cancellation and non-positive row counters;
missing_customer_summary.csv helps inspect where customer IDs are missing.

This is often enough for a first local reporting workflow.

The next version could load these into PostgreSQL, query them in DuckDB, or feed a local dashboard, but v0.7.0 intentionally keeps the real dataset path focused.

What the benchmark report records

The benchmark report is designed to answer practical questions:

Which dataset was used?
Where did the normalized input come from?
How many rows were normalized?
How many rows reached the CLI quality workflow?
How many duplicate rows were detected?
How many cancellation rows were identified?
How many customer IDs or descriptions were missing?
Were invoice dates, quantities, and prices validated?
What files were produced?
What limitations apply?

That makes the run easier to review later.

It also makes the project stronger as a portfolio asset because the workflow is not only described in prose. It leaves behind files, screenshots, reports, and commands that can be inspected.

Why not automatically download the dataset?

The project could theoretically download the dataset automatically.

For this version, I chose not to do that.

Manual download keeps the workflow clearer:

the user sees the official source page;
the dataset citation remains visible;
the license note is explicit;
the repository does not redistribute the raw dataset;
the workflow does not depend on hidden network access;
local-only data handling is easier to explain.

For a small portfolio repository, this is a reasonable trade-off.

The project demonstrates how to process the dataset, not how to become a dataset distribution tool.

Tests

Run the full test suite:

python -m compileall -q src/dq_etl_starter
python -m compileall -q scripts
pytest

Run the v0.7-related tests:

pytest tests/test_real_dataset.py
pytest tests/test_real_dataset_benchmark.py

The tests focus on the reusable code paths rather than requiring the full external dataset to be committed.

That is another useful pattern for public repositories: test the transformation logic with small fixtures, and keep the large external dataset local-only.

What is intentionally out of scope?

The v0.7.0 real dataset benchmark does not add:

automatic dataset download;
raw dataset redistribution;
production scheduling;
Airflow orchestration;
dbt modeling;
Snowflake, Databricks, or PySpark;
production-scale retail analytics;
a complete BI dashboard;
a benchmark leaderboard;
machine learning training;
LLM calls;
RAG or AI agent features.

This project is still a small Python data workflow starter.

The v0.7.0 update proves a specific point: the workflow can be applied to a public retail transaction dataset locally, while keeping the data handling policy, validation steps, outputs, and limitations clear.

What I would improve next

Possible next improvements include:

adding a Makefile for repeated demo commands;
adding a smaller public fixture for faster walkthroughs;
adding optional DuckDB queries for the real dataset summaries;
adding optional PostgreSQL reporting tables for the real dataset path;
adding a short CI workflow for core tests;
improving benchmark report formatting;
adding more detailed data source mapping documentation;
adding a second public dataset only if it does not make the project too broad.

The main constraint remains the same:

Keep the project small, reproducible, inspectable, and easy to adapt.

Repository

GitHub repository:

https://github.com/OnerGit/data-quality-etl-starter

Preparing AI-Ready Data Without Calling an LLM API

This v0.7.0 update is a practical next step: from synthetic and generated demos to a local public retail dataset benchmark that reuses the same validation, cleaning, reporting, and handoff workflow.

DEV Community

Running a Real Retail Dataset Through a Python Data Quality Workflow

Why add a real dataset benchmark?

Dataset used in v0.7.0

What is kept out of Git

What v0.7.0 adds

Project structure after the update

Install the project locally

Step 1: Download the public dataset manually

Step 2: Prepare the normalized dataset

Step 3: Run the existing CLI workflow

Schema for the normalized retail dataset

Quality report

Step 4: Generate the real dataset benchmark report

Summary outputs

What the benchmark report records

Why not automatically download the dataset?

Tests

What is intentionally out of scope?

What I would improve next

Repository

Top comments (0)