Bob Oner

Posted on Jun 11

From Clean Data to BI-Ready Reporting Tables with Python, PostgreSQL, and Metabase

#python #etl #postgres #docker

In the previous article, I extended a small Python data quality ETL starter from validation and cleaning into analytics-ready local outputs with Parquet, DuckDB, summary CSV files, and a benchmark report.

From Data Quality Checks to Analytics-Ready Parquet with Python

This follow-up focuses on the v0.5.0 update of the same project:

Data Quality ETL Starter on GitHub

The new goal is still intentionally modest.

This is not a production BI platform. It is not a data warehouse. It is not a cloud deployment. It is not an embedded analytics product.

The goal is to show one practical next step after cleaning and analytics-ready export:

generated messy order data
        ↓
existing validation and cleaning workflow
        ↓
analytics-ready order rows
        ↓
PostgreSQL reporting tables
        ↓
lightweight SQL views
        ↓
optional Metabase local dashboard demo

That is a common handoff point in small data workflow projects.

A client or small team may not need a full data platform yet. They may simply need cleaned data loaded into a local reporting database, a few reusable SQL views, and a basic dashboard tool connected to those views.

Why add a BI-ready demo?

The earlier versions of this project focused on data quality and local analytics.

The workflow could already:

read messy CSV, Excel, JSON, and mock API-style data;
validate expected columns and schema rules;
clean duplicate rows and text values;
export cleaned CSV output;
export to SQLite and optional PostgreSQL;
expose validation through a thin FastAPI wrapper;
generate larger synthetic order data;
export analytics-ready CSV and Parquet files;
query Parquet locally with DuckDB;
produce summary CSV tables and a benchmark report.

Those steps are useful, but many reporting workflows eventually ask another question:

Can this cleaned data feed a reporting database or dashboard?

v0.5.0 adds a small answer to that question.

It loads cleaned and analytics-ready data into PostgreSQL, creates reporting tables and SQL views, and provides instructions for exploring the result in Metabase.

The point is not to make the project bigger for its own sake. The point is to demonstrate a realistic bridge from data cleaning to lightweight reporting.

What v0.5.0 adds

The v0.5.0 update adds an optional BI-ready path.

The most relevant new pieces are:

scripts/run_bi_demo.py
src/dq_etl_starter/bi.py
docs/bi.md
docs/metabase.md

The BI demo creates PostgreSQL reporting tables:

cleaned_orders
customer_summary
revenue_by_country
orders_by_month
source_system_summary

It also creates PostgreSQL reporting views:

vw_revenue_by_country
vw_orders_by_month
vw_source_system_quality
vw_monthly_revenue_trend

And it writes local output files:

data/output/bi/bi_summary_report.md
data/output/bi/reporting_queries.sql
data/output/bi/metabase_dashboard_notes.md

The data/output/bi/ directory is intentionally ignored by Git. It contains local run artifacts, not source files.

Project structure after the update

The project remains small, but the structure now shows a clearer path from input data to reporting preparation:

data-quality-etl-starter/
├── data/
│   ├── input/
│   ├── expected/
│   └── output/
├── docs/
│   ├── analytics.md
│   ├── api.md
│   ├── bi.md
│   ├── metabase.md
│   └── postgres.md
├── screenshots/
├── scripts/
│   ├── generate_sample_data.py
│   ├── run_analytics_demo.py
│   └── run_bi_demo.py
├── src/dq_etl_starter/
│   ├── analytics.py
│   ├── bi.py
│   ├── clean.py
│   ├── exporters.py
│   ├── services.py
│   └── validate.py
└── tests/

The important design choice is that the BI-ready path does not replace the original workflow.

It builds on it.

The project still starts with data validation and cleaning. The reporting layer comes later.

Install the project locally

Clone the repository:

git clone https://github.com/OnerGit/data-quality-etl-starter.git
cd data-quality-etl-starter

Create a virtual environment:

python -m venv .venv

Activate it on macOS or Linux:

source .venv/bin/activate

Activate it on Windows PowerShell:

.venv\Scripts\activate

Install dependencies and the local package:

pip install -r requirements.txt
pip install -e .

The editable install step is useful because the project uses a src/ layout.

Step 1: Generate synthetic order data

The BI demo uses generated synthetic order data.

It does not use real customer data. It does not download private data. It is designed to be reproducible and safe for a public portfolio project.

Generate 100,000 rows:

python scripts/generate_sample_data.py \
  --rows 100000 \
  --output data/generated/orders_100k.csv \
  --seed 42

Windows PowerShell:

python scripts/generate_sample_data.py `
  --rows 100000 `
  --output data/generated/orders_100k.csv `
  --seed 42

This synthetic dataset intentionally includes data quality issues, such as missing values, invalid emails, duplicate rows, invalid dates, negative quantities, and inconsistent country values.

That makes the demo more useful than a perfectly clean sample file.

Step 2: Start PostgreSQL

Start the local PostgreSQL service with Docker Compose:

docker compose up -d postgres

Check that the container is running:

docker compose ps

The demo uses PostgreSQL as the reporting database because it is a practical next step after local CSV, SQLite, and Parquet outputs.

SQLite is still useful for the default local workflow. PostgreSQL is useful when reporting tables should be available through a database connection.

Step 3: Run the BI demo

Run the BI demo:

python scripts/run_bi_demo.py \
  --input data/generated/orders_100k.csv \
  --schema data/expected/generated_order_schema.json \
  --output-dir data/output/bi \
  --db-url postgresql+psycopg://dq_user:dq_password@localhost:5432/dq_demo

Windows PowerShell:

python scripts/run_bi_demo.py `
  --input data/generated/orders_100k.csv `
  --schema data/expected/generated_order_schema.json `
  --output-dir data/output/bi `
  --db-url postgresql+psycopg://dq_user:dq_password@localhost:5432/dq_demo

The demo runs the generated input through the existing validation and cleaning logic, prepares analytics-ready rows, loads reporting tables into PostgreSQL, creates SQL views, and writes local documentation files.

Expected local files:

data/output/bi/bi_summary_report.md
data/output/bi/reporting_queries.sql
data/output/bi/metabase_dashboard_notes.md

Expected PostgreSQL reporting tables:

cleaned_orders
customer_summary
revenue_by_country
orders_by_month
source_system_summary

Expected PostgreSQL reporting views:

vw_revenue_by_country
vw_orders_by_month
vw_source_system_quality
vw_monthly_revenue_trend

Step 4: Inspect reporting tables and views

After running the BI demo, inspect the database.

List reporting tables:

docker exec -it dq_etl_postgres psql -U dq_user -d dq_demo -c "\dt"

List reporting views:

docker exec -it dq_etl_postgres psql -U dq_user -d dq_demo -c "\dv"

Preview the revenue-by-country view:

docker exec -it dq_etl_postgres psql -U dq_user -d dq_demo -c "SELECT * FROM vw_revenue_by_country LIMIT 10;"

Preview the monthly revenue trend:

docker exec -it dq_etl_postgres psql -U dq_user -d dq_demo -c "SELECT * FROM vw_monthly_revenue_trend ORDER BY order_month LIMIT 12;"

Here is the reporting table and view output:

This is the core of the v0.5.0 update.

The cleaned data is no longer only a local file. It is now available through a reporting database with reusable SQL views.

Step 5: Start optional Metabase

Metabase is optional.

The core workflow does not require it. The PostgreSQL reporting tables and SQL views are the important part.

Metabase is included only as a local dashboard exploration layer.

Start Metabase:

docker compose up -d metabase

Open:

http://localhost:3000

When Metabase runs through Docker Compose, connect it to PostgreSQL with these values:

Database type: PostgreSQL
Host: postgres
Port: 5432
Database name: dq_demo
Username: dq_user
Password: dq_password

Use postgres as the host because Metabase is running inside the Docker Compose network.

Use localhost only when connecting from your host machine, such as with psql or a desktop database client.

Here is the Metabase connection step:

Step 6: Create a simple dashboard

The project does not ship a production dashboard.

Instead, it provides suggested dashboard cards based on the reporting views.

Useful starting points include:

vw_revenue_by_country
vw_orders_by_month
vw_monthly_revenue_trend
vw_source_system_quality

Suggested cards:

revenue by country;
orders by month;
monthly revenue trend;
orders by source system;
average order value by country.

Here is a simple dashboard built from the reporting views:

The dashboard is intentionally basic.

Its job is not to impress with design. Its job is to prove that cleaned and reporting-ready data can be loaded into PostgreSQL and explored by a dashboard tool.

BI summary report

The demo also writes a Markdown report:

data/output/bi/bi_summary_report.md

The report records:

raw row count;
cleaned row count;
analytics-ready row count;
reporting tables created;
reporting views created;
local output files;
recommended dashboard cards;
scope notes.

Here is the BI summary report:

This file is useful for handoff.

Instead of only saying "the script ran", the project leaves behind a simple written record of what was created.

Why use SQL views?

The reporting views are small, but they are important.

They separate raw or cleaned tables from reporting-facing queries.

For example, a view such as vw_revenue_by_country gives a dashboard tool a stable object to query. If the underlying table logic changes later, the dashboard can still point to the view.

This is a common reporting pattern:

cleaned table
        ↓
summary table
        ↓
reporting view
        ↓
dashboard card

For a small project, SQL views provide a good balance between simplicity and structure.

They are easier to review than a hidden dashboard-only query, and they are lighter than introducing a full modeling framework too early.

What this proves

This v0.5.0 update demonstrates several practical capabilities:

generating safe synthetic data;
running data validation before reporting;
cleaning and preparing analytics-ready rows;
loading reporting tables into PostgreSQL;
creating reusable SQL views;
connecting a local BI tool to the reporting database;
documenting output files and scope;
keeping generated artifacts out of Git.

This is especially relevant for small data workflow projects.

Many clients do not need a complex platform at the beginning. They need a reliable workflow that turns messy files into something a reporting tool can use.

What is intentionally still out of scope?

This demo intentionally does not include:

production Metabase deployment;
cloud hosting;
user authentication;
embedded analytics;
multi-tenant dashboarding;
data warehouse modeling;
Airflow orchestration;
dbt project structure;
Spark processing;
scheduled production jobs;
real customer data;
AI, LLM, RAG, or agent features.

Those can be valid tools in other contexts.

For this project, adding them too early would make the starter harder to run and harder to review.

The current goal is narrower:

small local workflow
validated and cleaned data
PostgreSQL reporting tables
simple SQL views
optional dashboard exploration
clear documentation

Run the tests

Run the full test suite:

python -m compileall -q src/dq_etl_starter
python -m compileall -q scripts
pytest

Run v0.5 BI-ready tests:

pytest tests/test_bi.py
pytest tests/test_bi_reporting_sql.py

PostgreSQL integration tests should remain optional and should be skipped unless DATABASE_URL is configured.

This is important because the project should stay easy to test even when external services are not running.

Run the default Docker workflow

The default Docker run remains simple:

docker build -t data-quality-etl-starter:0.5.0 -t data-quality-etl-starter:latest .
docker run --rm data-quality-etl-starter:0.5.0

The BI demo uses Docker Compose services for PostgreSQL and optional Metabase, but the default image still supports a reproducible CLI workflow.

That separation keeps the project easier to understand.

How this maps to freelance client work

This update maps to a realistic client request:

"We have messy order/customer exports. Can you clean them, load them into a database, and prepare a few reporting tables or dashboard-ready views?"

A first version does not always need a full data warehouse.

A practical milestone might be:

generate or receive input data;
validate the schema;
clean duplicates and bad values;
load reporting tables into PostgreSQL;
create SQL views for common metrics;
connect a dashboard tool;
document how to rerun the workflow.

That is exactly the kind of path this v0.5.0 demo is designed to show.

For freelance work, the value is in the handoff:

clear commands;
local reproducibility;
synthetic data for safe demos;
database tables that can be inspected;
SQL views that can be reviewed;
screenshots that show the workflow;
a summary report that documents the run;
scope notes that prevent overclaiming.

What I would improve next

Possible next improvements include:

adding a few more reporting views;
adding richer schema profiling;
adding better BI summary formatting;
adding a Makefile for repeated demo commands;
adding GitHub Actions for test runs;
adding a small real-public-dataset validation note;
adding optional scheduled local runs;
adding clearer logging for each workflow stage.

The main constraint remains the same:

Do not turn the starter into a heavy platform too early.

It should stay small, reproducible, and easy to adapt.

Repository

GitHub repository:

https://github.com/OnerGit/data-quality-etl-starter