DEV Community: Bob Oner

From Text-Based PDFs to Clean Data: A Practical Workflow Case Study

Bob Oner — Mon, 22 Jun 2026 10:39:12 +0000

Many PDF extraction projects fail for reasons that are not obvious at the beginning.

The problem is not always that no text can be extracted.

The harder problems usually appear later:

layout variation
→ table structure
→ field validation
→ partial failures
→ output review
→ client expectations

A script that works on one PDF can still fail on the next layout. A table can look clean visually but extract into fragmented cells. A field can be present on the page but missing from the expected output schema. A workflow can produce many rows but still require review before it is useful.

That is why I built this public case-study repository:

https://github.com/OnerGit/pdf-to-clean-data-workflow

This article explains how I positioned it as a bounded PDF-to-clean-data workflow, not as a universal PDF parser.

Project positioning

This repository is a public case study.

It is not a full open-source parser.

The full implementation is private. The public repository contains sanitized previews, screenshots, documentation, validation summaries, known limitations, and public/private boundary notes.

That boundary is important because PDF extraction work can involve client documents, private layouts, internal debugging details, and reusable parser logic.

The public repo is designed to show the workflow approach:

PDF input
→ extraction
→ normalization
→ validation
→ CSV / Excel-style / SQLite-style outputs
→ review report
→ client handoff

The public repo does not include parser source code, production templates, raw PDFs, client data, tests, scripts, dependencies, Excel files, SQLite databases, or full batch outputs.

The goal is to demonstrate workflow discipline, not to publish a complete commercial tool.

v0.1: single-layout proof

The first version focused on a single-layout proof.

The private implementation used synthetic PDF examples, including an invoice-style PDF and a tabular report PDF.

The workflow demonstrated:

extracted invoice-style fields;
line item extraction;
table extraction;
validation checks;
preview CSV / JSON / Markdown outputs;
a clean-data handoff pattern.

In this version, the important question was simple:

Can a known text-based PDF layout be converted into reviewable structured outputs?

The answer was yes, within a narrow and controlled layout.

But that is not the same as claiming support for all invoices, all reports, or all PDFs.

That difference matters.

A single-layout proof shows that the workflow can work under controlled conditions. It does not prove that the parser can handle every client document.

v0.2: batch, layout detection, and debug workflow

The second version moved from a single document to a batch-oriented workflow.

This added multiple known layouts, batch processing, layout detection, a multi-page table preview, debug screenshots, and an extraction quality score.

Batch processing matters because real PDF work rarely arrives as one perfect sample.

A client may provide:

several invoices with slightly different headers;
reports with continuation pages;
multi-page tables;
mixed layouts in the same folder;
documents that look similar but extract differently.

The v0.2 workflow makes those differences visible instead of hiding them.

A debug screenshot is not just a visual detail. It is part of the review process.

For table extraction, it helps answer questions like:

Which page was processed?
Which table region was detected?
Does the extracted area match the expected table?
Where might row or column splitting occur?

This is also where an extraction quality score becomes useful.

The score is not a universal accuracy guarantee. It is a workflow signal that helps decide which outputs need review.

For a client handoff, that kind of signal is more useful than pretending every PDF is equally reliable.

v0.3: public data validation

The third version added public data validation against selected official public PDFs.

This was important because synthetic PDFs are useful, but they are controlled examples. Public PDFs introduce more realistic layout behavior without using private client data.

The v0.3 validation used selected official public, text-based PDFs, including:

a BLS Consumer Price Index release;
a U.S. Treasury Monthly Treasury Statement.

The public repo does not redistribute the raw PDFs. It publishes sanitized summaries, table previews, screenshots, and known limitation notes.

The results are intentionally not oversold.

For the BLS CPI sample, the status is partial_success.

Tables 1–7 produced page-level table candidates, but Table A did not produce a reliable statistical data body. That is recorded as a known limitation rather than converted into a passing claim.

For the U.S. Treasury Monthly Treasury Statement sample, the status is success.

Ruled-table extraction completed, but some long tables may split into separate header-only and body candidates. The workflow does not claim a generic automatic merge, because merging the wrong financial tables would be worse than requiring a review step.

This is the kind of result I want a PDF extraction case study to show.

Not every test is a perfect success.

Some public samples expose real limitations.

That makes the project more credible, not less.

What is included in the public repo

The public repository includes:

README.md;
NOTICE.md;
documentation under docs/;
sanitized previews under sample_outputs/;
screenshots under screenshots/;
validation summaries;
known limitations and failure cases;
a sample review workflow.

The public previews include CSV, JSON, and Markdown artifacts such as extracted fields, invoice headers, line items, table previews, batch summaries, expected-vs-actual checks, public validation summaries, and known failure cases.

These files are intentionally small and sanitized.

They show output shape and review logic without exposing private implementation details or full source documents.

What this repo does not claim

This repository does not claim to be:

a universal PDF parser;
a production-ready PDF extraction product;
an OCR system;
an AI document extraction system;
an LLM extraction workflow;
a production invoice parser;
a production bank statement parser;
a production tender parser;
a fully open-source implementation.

The current scope is narrower:

text-based PDFs
+ reviewed layouts
+ sample validation
+ public-safe previews
+ quality reporting
+ known limitations

The workflow does not support scanned PDFs, handwriting extraction, arbitrary document understanding, or production accuracy across unreviewed layouts.

Known layouts and selected public validation samples do not establish universal accuracy.

That is why sample review is required before full-batch processing.

Conclusion

This project is a bounded PDF-to-clean-data case study.

It shows how text-based PDF extraction can be framed as a workflow:

sample review
→ extraction
→ normalization
→ validation
→ quality report
→ public-safe preview
→ client handoff

The public repo does not publish the full parser implementation or claim universal PDF support.

It does show how I think about PDF-to-data work: start with representative samples, validate the output, document partial successes, keep known limitations visible, and only then move toward larger batch processing.

For PDF extraction work, the professional question is not only "Can we extract something?"

The better question is:

Can we produce clean, reviewable, validated outputs
with clear scope and known limitations?

That is what this case study is designed to demonstrate.

From Mock API Workflow to Delivery-Ready Asset: Extending a Shopify-style Reporting Case Study

Bob Oner — Wed, 17 Jun 2026 12:15:39 +0000

In the previous article, I introduced a public Shopify-style API reporting workflow case study.

That article focused on one question:

Can I show the workflow shape publicly
without exposing private implementation code,
credentials,
store domains,
or client data?

Designing a Shopify-style API Reporting Workflow as a Public Case Study

Public case-study repository:

https://github.com/OnerGit/shopify-api-reporting-workflow

General runnable data workflow project:

https://github.com/OnerGit/data-quality-etl-starter

The first public version of the Shopify-style case study showed the workflow shape:

mock REST-style API data
→ GraphQL-shaped mock responses
→ pagination
→ field mapping
→ normalized reporting tables
→ CSV / Excel / SQLite / Markdown outputs

That was useful, but it was not enough.

Once the mock workflow works, the next question becomes more practical:

What would this need before it could become
a reusable client delivery asset?

That is what the next private milestones explored.

Title options considered

Before writing the follow-up, I considered these titles:

From Mock API Workflow to Delivery-Ready Asset: Extending a Shopify-style Reporting Case Study
What Comes After a Public API Reporting Case Study?
Turning a Shopify-style API Reporting Demo into a Safer Client Delivery Workflow
Beyond Mock Data: Validation, Connector Boundaries, and Delivery Planning for API Reporting Workflows
From Mock Shopify-style Data to Safer Client Reporting Delivery

I chose the first title because it describes the direction clearly: this is not only about showing a demo, but about thinking through what makes a workflow safer to adapt for client work.

What changed after v0.2

The v0.1 and v0.2 case-study work answered the first layer of the problem.

They showed that Shopify-style order, customer, product, and line-item data can be shaped into reporting-friendly outputs.

The public repo showed sanitized evidence for:

mock REST-style workflow behavior;
GraphQL-shaped mock pagination;
output previews;
Excel-style workbook structure;
SQLite-style reporting tables;
Markdown report preview;
public/private boundary notes.

After that, the project moved into a different kind of work.

The later private milestones were not about adding a dashboard, a SaaS layer, or a public connector.

They were about the delivery layer:

mock workflow evidence
→ validation boundary
→ private connector template
→ manual live validation gate
→ redaction rules
→ retry and backoff planning
→ client handoff templates
→ optional extension planning

That layer is less visually exciting than a dashboard screenshot.

But for real API reporting work, it matters more.

A useful API reporting workflow is not only about extracting data. It also needs boundaries around credentials, validation evidence, failure handling, output review, and client-specific adaptation.

v0.3: development-store validation notes and evidence boundary

The v0.3 milestone focused on development-store validation notes and sanitized evidence handling.

This was not about publishing live validation evidence.

The public repo does not contain raw validation output. It does not contain store domains, tokens, raw API responses, customer records, or private implementation details.

Instead, the public documentation describes how validation evidence should be handled safely.

That distinction is important.

When a workflow moves from fake fixtures toward real API validation, the evidence itself can become sensitive.

A screenshot can accidentally reveal:

a development store domain;
an access scope;
a private path;
a request or response shape that should not be public;
customer names, emails, addresses, or order details;
internal implementation assumptions.

So the v0.3 work treated validation as a boundary problem, not just a checklist item.

The public-safe position is:

The public repo can describe validation readiness.
The private workflow can perform validation.
Raw validation evidence should remain private unless it is manually sanitized.

That is more conservative than posting everything.

It is also closer to how client work should be handled.

The point is not to show every internal detail. The point is to show that validation was considered responsibly.

v0.4: private connector template

The v0.4 milestone moved the private implementation toward a connector-template layer.

This is still not published in the public repo.

The public repository remains a case study. It does not contain runnable connector code, real query templates, OAuth instructions, .env examples, token examples, raw API responses, or production setup steps.

The private connector-template work focuses on the implementation concerns that appear when a workflow moves closer to real API delivery:

configuration boundary
→ manual validation gate
→ scope checks
→ redaction support
→ retry and backoff behavior
→ sanitized output summaries

This matters because real API work is not defined by a single successful request.

A reporting workflow has to answer less exciting but more important questions:

What happens if a request fails?
What happens if the token does not have the required scope?
How should pagination state be handled?
What should be logged?
What should never be logged?
What output can be shown publicly?
What should remain private to the client?

For Shopify-style GraphQL reporting workflows, there are several areas that need careful design:

access scopes;
API versions;
cursor pagination;
retry behavior;
customer data privacy;
store-specific product, variant, discount, tax, refund, fulfillment, and channel fields.

Shopify's own documentation states that the REST Admin API is now a legacy API and new public apps should use the GraphQL Admin API. That does not mean this case study is a real Shopify app. It means the private workflow design needs to be aware of the GraphQL direction and cursor-style pagination.

The v0.4 work is best understood as a private delivery template concept.

It is not a public connector.

It is not a claim that every Shopify store can use the same implementation without adaptation.

It is a safer foundation for adapting the workflow when a real client has specific data access, reporting definitions, and output requirements.

v0.5: delivery workflow and optional extensions

The v0.5 milestone focused on client delivery workflow planning.

It did not add Google Sheets as a public integration.

It did not add PostgreSQL as a production connector.

It did not add a scheduler, cloud deployment, dashboard, or webhook service.

Instead, v0.5 clarified how a reporting workflow could be delivered and reviewed.

A small API reporting project usually needs more than a script.

It needs a handoff process:

intake
→ data access boundary
→ report definition
→ field mapping review
→ sample output review
→ acceptance checks
→ handoff notes
→ maintenance decision

This is where many small automation projects become fragile.

The code may run once, but nobody has agreed on:

who owns the data access;
which fields are required;
how refunds and discounts should be counted;
whether taxes and shipping are included;
whether the output should be CSV, Excel, SQLite, PostgreSQL, or Google Sheets;
how often the workflow should run;
who reviews failed runs;
what data should be excluded for privacy;
what changes require a new mapping review.

The v0.5 public summaries frame Google Sheets, PostgreSQL, scheduling, dashboards, and cloud delivery as optional delivery decisions.

That is the right level of abstraction for this repository.

A client may need a Google Sheets output because the team already works there.

Another client may prefer PostgreSQL because a dashboard or internal tool will query the data.

Another may only need a weekly Excel workbook.

The workflow should not assume the heaviest option first.

The delivery path should be chosen after the reporting cadence, review process, and data ownership are clear.

What I learned

The main lesson from v0.3 through v0.5 is simple:

A serious portfolio project is not just a code demo.

It should show judgment.

That judgment includes:

what to publish;
what to keep private;
how to validate safely;
how to handle credentials;
how to avoid overclaiming;
how to document limitations;
how to turn a workflow into something a client can review and accept.

The engineering work is not only in the extractor or exporter.

It is also in the boundary between demo, validation, and delivery.

For a public portfolio repository, that boundary matters.

If I publish too little, the project does not prove anything.

If I publish too much, I risk exposing implementation details that should remain private or client-specific.

The current structure is a compromise:

public repo
→ explains the workflow shape and output expectations

private implementation
→ contains runnable delivery logic and client-adaptable code

client project
→ requires store-specific scopes, fields, privacy review, and metric definitions

That is the kind of structure I want this case study to demonstrate.

What this project is still not

Even after v0.3, v0.4, and v0.5, this project is still not:

a Shopify app;
a public production connector;
a SaaS dashboard;
a full data warehouse;
a Google Sheets integration;
a PostgreSQL production connector;
a public runnable package;
a source of real Shopify data;
a low-code workflow;
an AI agent workflow.

It also does not claim that one workflow is ready for every Shopify store.

A real Shopify reporting project would still need store-specific review:

required objects
→ access scopes
→ field mapping
→ customer privacy rules
→ metric definitions
→ output format
→ review process
→ maintenance plan

That is not a weakness.

It is the reality of API reporting work.

The public case study should make that reality visible instead of hiding it behind a simplified demo.

Designing a Shopify-style API Reporting Workflow as a Public Case Study

Bob Oner — Tue, 16 Jun 2026 06:22:43 +0000

In my previous article, I walked through a general Python data quality workflow using a public retail dataset.

That project focused on a reusable pattern:

raw data
→ schema mapping
→ validation
→ cleaning
→ SQLite export
→ quality report
→ benchmark evidence

This article is about a different kind of repository.

Instead of adding another version to the general ETL starter, I built a public Shopify-style e-commerce API reporting case study.

Running a Real Retail Dataset Through a Python Data Quality Workflow

The runnable open-source project behind that article is:

https://github.com/OnerGit/data-quality-etl-starter

That repository proves the general workflow capability: messy CSV, Excel, JSON, API-style data, validation, cleaning, exports, reports, analytics-ready outputs, BI-ready outputs, AI-ready preparation, and public dataset benchmark evidence.

This article is about a different kind of repository:

https://github.com/OnerGit/shopify-api-reporting-workflow

This new repository is not another version of data-quality-etl-starter.

It is a public portfolio case study for a Shopify-style e-commerce API reporting workflow.

For a fully runnable open-source data workflow project, see data-quality-etl-starter. This new repository is a public portfolio case study for a Shopify-style e-commerce API reporting workflow. The runnable implementation is maintained privately as a reusable commercial delivery asset.

Why build a Shopify-style reporting case study?

A general data workflow project is useful, but real client work is usually vertical.

A small e-commerce team does not usually ask for "a data quality ETL starter."

They ask for something more specific:

Can you export Shopify orders every week?
Can you clean product and customer data?
Can you generate an Excel sales report?
Can you turn API data into CSV files?
Can you create product-level or customer-level summaries?
Can you prepare a local reporting database?
Can you help us move from REST-style exports toward GraphQL-style API data?

Those requests are narrower than a full data platform.

They are also more concrete than a generic portfolio demo.

That is why I built shopify-api-reporting-workflow as a vertical case study. It applies the same workflow thinking from my general data quality project to a more realistic e-commerce reporting scenario.

The core workflow idea is:

Shopify-style API data
→ pagination
→ field mapping
→ normalized reporting tables
→ validation
→ CSV / Excel / SQLite exports
→ Markdown report

The project is intentionally scoped around reporting workflow design, not around building a full Shopify app.

Public repo vs private runnable implementation

The most important boundary in this repository is the public/private split.

The public repository includes:

README.md
NOTICE.md
docs/
sample_outputs/
screenshots/

It shows:

the case-study problem;
the reporting workflow shape;
public-safe sample output previews;
screenshots from the private runnable workflow;
implementation boundary notes;
limitations;
REST-to-GraphQL migration notes.

It does not include:

source code;
tests;
scripts;
dependency files;
Docker files;
complete mock data;
complete field mappings;
GraphQL query templates;
production connector code;
credentials;
tokens;
store domains;
client data.

That is intentional.

The runnable implementation exists locally and privately as a reusable commercial delivery asset. The public repository is designed to explain the workflow, output expectations, design boundaries, and implementation judgment without exposing reusable private code or client-sensitive material.

This is different from data-quality-etl-starter.

data-quality-etl-starter is a runnable open-source project.

shopify-api-reporting-workflow is a public case-study repository.

Both are useful, but they serve different purposes.

v0.1: mock REST-style API reporting workflow

The v0.1 private implementation models a mock REST-style e-commerce API reporting workflow.

It uses fake local fixtures only. It does not call the real Shopify API.

The workflow shape is:

mock REST-style API fixtures
→ paginated orders extraction
→ products / customers extraction
→ field mapping
→ order/customer flattening
→ line item expansion
→ validation
→ CSV / Excel / SQLite export
→ summary tables
→ Markdown report
→ sanitized public outputs

The goal of v0.1 was to prove the reporting workflow shape.

In e-commerce reporting, orders are often nested. A single order may contain customer information, shipping fields, fulfillment fields, tax fields, discounts, and line items.

That data is not always easy to use directly in a spreadsheet.

A practical reporting workflow usually needs to split the data into tables such as:

orders
order_line_items
customers
products
sales_summary_by_month
sales_summary_by_product
customer_order_summary

That is the main idea behind v0.1.

The workflow demonstrates how paginated API-style order data can be normalized into reporting-friendly outputs.

The public repository includes sanitized preview files such as:

report_preview.md
orders_cleaned_preview.csv
sales_summary_by_month_preview.csv
sales_summary_by_product_preview.csv
customer_order_summary_preview.csv

Those previews are intentionally small. They show output shape, not full production coverage.

The private workflow also has test evidence. The public screenshot is included only to show that the private implementation was checked locally; it does not expose the implementation itself.

Why CSV, Excel, and SQLite outputs?

For many small e-commerce reporting requests, the first deliverable is not a data warehouse.

It is usually something more practical:

CSV files for import or review
Excel workbook for store operators
SQLite-style local database for lightweight handoff
Markdown report for validation notes

That is why v0.1 focuses on export formats that are easy to inspect.

A store operator may want an Excel workbook.

A technical client may want CSV files.

A developer or analyst may want a local SQLite database.

The workflow also produces a Markdown report preview with extraction, validation, output, and limitation notes.

The important point is not the file format itself. The important point is the handoff:

What data was extracted?
What was normalized?
What warnings were found?
What summaries were generated?
What files were produced?
What assumptions need to be checked?

That is the kind of reporting workflow clients can review before moving into heavier BI infrastructure.

v0.2: GraphQL-shaped mock workflow and cursor pagination

The v0.2 update adds a GraphQL-shaped mock workflow.

This matters because Shopify's REST Admin API is now a legacy API, and new public apps should be designed around GraphQL Admin API.

The v0.2 workflow still does not call Shopify.

It uses local fake fixtures shaped like GraphQL responses.

The mock input structure includes:

edges
node
cursor
pageInfo

That makes the case study more realistic than a simple REST-style mock export.

The v0.2 workflow simulates cursor-style pagination and keeps the same reporting output concept:

fake GraphQL-shaped order data
→ cursor-style pagination simulation
→ GraphQL-style field path mapping
→ normalized reporting tables
→ validation notes
→ sanitized GraphQL report preview
→ REST-to-GraphQL migration summary

The public repository includes:

report_preview_graphql.md
docs/graphql_workflow_summary.md
docs/rest_to_graphql_mock_migration_summary.md

Again, this is not a production GraphQL Admin API client.

It does not include real GraphQL queries, OAuth, tokens, real store domains, access scopes, or production connector code.

The purpose of v0.2 is to show that the reporting workflow design is aware of the GraphQL direction and cursor-style pagination pattern.

Why model GraphQL-shaped responses?

A REST-style mock workflow is easy to understand, but it is not enough for a Shopify-aware reporting case study.

A real Shopify implementation would need to handle the current Admin API direction, approved access scopes, secure credentials, cursor pagination, rate limits, retries, and store-specific field mapping.

The public repository does not try to solve all of that.

Instead, it models the shape of the problem:

GraphQL connection response
→ cursor pagination state
→ nested node extraction
→ field path mapping
→ normalized reporting tables

That is useful because reporting work depends heavily on the shape of the source data.

If the source response shape changes, the mapping layer changes.

If pagination changes, the extraction layer changes.

If the reporting definitions change, the summary layer changes.

The case study makes those boundaries visible without publishing a production connector.

What the public repo shows

The public repository shows the case study through documentation, sample outputs, and screenshots.

The public material demonstrates:

a v0.1 REST-style mock workflow run;
a v0.2 GraphQL-shaped mock workflow run;
private test evidence;
public-safe sample outputs;
Excel-style workbook preview;
Markdown report preview;
SQLite-style table preview;
implementation boundary notes;
limitations;
workflow mapping to real client scenarios.

The public sample outputs include:

report_preview.md
report_preview_graphql.md
orders_cleaned_preview.csv
sales_summary_by_month_preview.csv
sales_summary_by_product_preview.csv
customer_order_summary_preview.csv

The screenshots are evidence from the private runnable workflow and sanitized public preview files.

They are included to show workflow behavior and output shape, not to expose the implementation.

This distinction is important.

A screenshot can show that a workflow exists and what it produces. It should not expose credentials, tokens, real store domains, client data, private paths, complete mock fixtures, or source code from the private implementation.

What is intentionally out of scope

This repository is intentionally not:

a Shopify app;
a production Shopify connector;
a public runnable implementation;
a complete GraphQL Admin API client;
an OAuth implementation;
a live-store integration;
a webhook service;
a full data warehouse;
a BI dashboard;
a SaaS product;
a low-code or n8n workflow;
an AI agent workflow.

It does not include:

real Shopify tokens;
store domains;
client data;
raw API responses;
complete field mappings;
production GraphQL query templates;
production connector code;
complete mock datasets;
private implementation paths.

A real Shopify reporting project would need to confirm many things before implementation:

required Shopify objects;
access scopes;
authentication approach;
pagination behavior;
rate limits and retries;
reporting metric definitions;
output file requirements;
customer data privacy requirements;
store-specific product, variant, discount, refund, tax, shipping, fulfillment, and channel fields.

The public repository does not hide those requirements. It documents the boundary.

How this maps to real client work

This type of workflow maps to practical e-commerce reporting requests.

Examples include:

Shopify order export to CSV
API to Excel reporting workbook
product and customer cleanup
order line item expansion
sales summary by month
sales summary by product
customer order summary
API-to-database workflow
GraphQL migration-aware reporting workflow

The value is not just extracting data.

The value is structuring the workflow so another person can understand it:

extract
→ map
→ validate
→ normalize
→ summarize
→ export
→ report

For a small reporting automation project, this can be a useful first stage before investing in a larger dashboard, warehouse, or SaaS tool.

The workflow also creates a safer technical discussion.

Instead of jumping straight into implementation, the project encourages questions like:

Which Shopify objects matter?
Which fields should be included?
How should refunds and discounts be counted?
Should reporting use gross sales or net sales?
Are taxes and shipping included?
Which output format is easiest to review?
Should the workflow produce a local database?
What data should be excluded for privacy reasons?

Those questions are part of the engineering work.

What I would validate next

The next step would not be to publish the private implementation.

Instead, I would validate the case study against more realistic reporting requirements.

Areas to validate next include:

order-level and line-item-level metric definitions;
refund and cancellation handling;
product variant mapping;
discount and tax treatment;
shipping and fulfillment status fields;
customer privacy handling;
incremental sync assumptions;
GraphQL rate-limit and retry strategy;
client-specific Excel workbook layout;
whether the final handoff should be CSV, Excel, SQLite, PostgreSQL, or BI-ready tables.

I would also keep the public/private boundary in place.

The public repo should remain a case-study asset.

The private implementation should remain reusable, adaptable, and safe for commercial delivery.

Closing summary

data-quality-etl-starter shows the general data workflow pattern.

shopify-api-reporting-workflow applies the same thinking to a vertical e-commerce reporting scenario.

The first project proves the reusable data quality workflow.

The second project shows how that workflow thinking can be narrowed into a Shopify-style API reporting case study with public-safe documentation, sanitized output previews, REST-style workflow evidence, and GraphQL-shaped pagination awareness.

It is not a Shopify app.

It is not a production connector.

It is not a public runnable package.

It is a transparent portfolio case study for a practical reporting workflow: API-shaped e-commerce data into validated, normalized, reporting-ready outputs.

Running a Real Retail Dataset Through a Python Data Quality Workflow

Bob Oner — Tue, 16 Jun 2026 02:54:17 +0000

In the previous article, I extended a small Python data quality ETL starter with AI-ready data preparation.

The important constraint was that the workflow did not call an LLM API, generate embeddings, or train a model. It prepared structured data assets such as schema profiles, data dictionaries, validation summaries, feature-ready CSV files, and manifest files.

Preparing AI-Ready Data Without Calling an LLM API

This follow-up focuses on the v0.7.0 update of the same project:

Data Quality ETL Starter on GitHub

The new goal is to move beyond synthetic demo data and show that the same data quality workflow can process a public retail/e-commerce-style dataset locally.

This is still not a big data platform, a production retail analytics system, a benchmark leaderboard, or a public dataset redistribution repository.

The goal is narrower and more practical:

manually downloaded public retail dataset
        ↓
prepare_real_dataset_demo.py
        ↓
normalized retail transaction CSV
        ↓
existing CLI validation and cleaning workflow
        ↓
quality reports + SQLite export
        ↓
run_real_dataset_benchmark.py
        ↓
benchmark report + summary CSV outputs

That is a useful next step for a portfolio project because it shows the workflow can handle a more realistic dataset while still keeping data handling, scope, and reproducibility clear.

Why add a real dataset benchmark?

Earlier versions of this project used small sample files and generated synthetic order data.

That is useful for testing and documentation, but it leaves one practical question:

Can the workflow handle a public dataset that was not designed specifically for this repository?

v0.7.0 adds an optional real dataset benchmark path to answer that question.

The workflow now demonstrates how to:

take a public retail transaction dataset;
keep the raw dataset local-only;
map external source columns into a project-friendly schema;
derive practical fields such as revenue and cancellation flags;
reuse the existing CLI validation and cleaning workflow;
generate Markdown and JSON quality reports;
export cleaned data to SQLite;
produce benchmark evidence and summary CSV files.

The key design choice is that the existing CLI remains the source of truth.

The real dataset path does not become a separate pipeline. It prepares the source data, then passes it through the same validation and cleaning workflow used by the rest of the project.

Dataset used in v0.7.0

The default v0.7.0 dataset is the UCI Online Retail dataset.

Official source:

UCI Machine Learning Repository: Online Retail

Citation:

Chen, D. (2015). Online Retail [Dataset]. UCI Machine Learning Repository.
https://doi.org/10.24432/C5BW33

License note:

Creative Commons Attribution 4.0 International (CC BY 4.0)

The dataset is useful for this project because it is retail/e-commerce adjacent and transaction-shaped. It includes fields that map naturally into an invoice/order-style workflow:

InvoiceNo
StockCode
Description
Quantity
InvoiceDate
UnitPrice
CustomerID
Country

The project maps those source columns into normalized snake_case columns and adds derived fields.

What is kept out of Git

This part is important.

The repository does not redistribute the full raw UCI dataset. It also does not commit full normalized or cleaned real dataset outputs.

These paths are local-only:

data/external/
data/raw/public/
data/output/real_dataset/

The repository keeps:

source code;
schema files;
tests;
documentation;
screenshots;
small sample inputs;
instructions for running the workflow locally.

It does not keep:

full downloaded raw datasets;
full normalized real dataset outputs;
full cleaned real dataset outputs;
local SQLite files generated from real datasets;
private customer data;
client data;
API credentials;
tokens or secrets.

This keeps the repository lightweight and avoids turning it into a dataset mirror.

What v0.7.0 adds

The most relevant new files are:

scripts/prepare_real_dataset_demo.py
scripts/run_real_dataset_benchmark.py
src/dq_etl_starter/real_dataset.py
docs/data_sources.md
docs/real_dataset_benchmark.md
docs/limitations.md
data/expected/online_retail_schema.json

The real dataset helper module handles the project-specific mapping and summary logic.

The two scripts provide a simple local workflow:

prepare the manually downloaded dataset into a normalized CSV;
generate local benchmark evidence and summary outputs after the CLI quality workflow runs.

Project structure after the update

The project now has a clearer path from messy input files to public-dataset benchmark evidence:

data-quality-etl-starter/
├── data/
│   ├── expected/
│   │   └── online_retail_schema.json
│   └── output/
├── docs/
│   ├── data_sources.md
│   ├── limitations.md
│   └── real_dataset_benchmark.md
├── screenshots/
├── scripts/
│   ├── prepare_real_dataset_demo.py
│   └── run_real_dataset_benchmark.py
├── src/dq_etl_starter/
│   ├── real_dataset.py
│   ├── cli.py
│   ├── clean.py
│   ├── report.py
│   └── validate.py
└── tests/
    ├── test_real_dataset.py
    └── test_real_dataset_benchmark.py

The real dataset path is optional. The default small sample workflows remain unchanged.

Install the project locally

Clone the repository:

git clone https://github.com/OnerGit/data-quality-etl-starter.git
cd data-quality-etl-starter

Create a virtual environment:

python -m venv .venv

Activate it on macOS or Linux:

source .venv/bin/activate

Activate it on Windows PowerShell:

.venv\Scripts\activate

Install dependencies and the local package:

pip install -r requirements.txt
pip install -e .

The editable install step is useful because the project uses a src/ layout.

Step 1: Download the public dataset manually

Download the UCI Online Retail dataset from the official UCI Machine Learning Repository page.

Place the file here:

data/external/online_retail.xlsx

The project does not automatically download the dataset by default.

That is intentional.

For a public portfolio repository, I prefer to keep the data acquisition step explicit. It makes the source, license, citation, and local-only handling policy easier to review.

Step 2: Prepare the normalized dataset

Run the preparation script.

macOS / Linux:

python scripts/prepare_real_dataset_demo.py \
  --raw-input data/external/online_retail.xlsx \
  --output data/output/real_dataset/online_retail_normalized.csv

Windows PowerShell:

python scripts/prepare_real_dataset_demo.py `
  --raw-input data/external/online_retail.xlsx `
  --output data/output/real_dataset/online_retail_normalized.csv

This step reads the local source file, validates expected source columns, maps UCI columns into project-friendly names, derives additional fields, and writes a normalized CSV.

The normalized output columns are:

invoice_no
stock_code
description
quantity
invoice_date
unit_price
customer_id
country
revenue
is_cancellation
source_dataset

The derived fields are simple but useful:

revenue is derived from quantity and unit price;
is_cancellation marks cancellation-style rows;
source_dataset records dataset lineage.

This preparation layer is deliberately small. It does not try to perform all business logic. It only converts the external dataset into a shape that the existing project workflow can validate and clean.

Step 3: Run the existing CLI workflow

After preparation, the normalized CSV is passed into the existing CLI workflow.

macOS / Linux:

python -m dq_etl_starter.cli run \
  --input data/output/real_dataset/online_retail_normalized.csv \
  --input-type csv \
  --schema data/expected/online_retail_schema.json \
  --output-dir data/output/real_dataset/run \
  --db-target sqlite \
  --table-name cleaned_online_retail

Windows PowerShell:

python -m dq_etl_starter.cli run `
  --input data/output/real_dataset/online_retail_normalized.csv `
  --input-type csv `
  --schema data/expected/online_retail_schema.json `
  --output-dir data/output/real_dataset/run `
  --db-target sqlite `
  --table-name cleaned_online_retail

Expected local outputs:

data/output/real_dataset/run/cleaned_online_retail.csv
data/output/real_dataset/run/etl_output.sqlite
data/output/real_dataset/run/quality_report.md
data/output/real_dataset/run/quality_report.json

This is the most important design point in v0.7.0.

The real dataset path reuses the existing validation and cleaning workflow. It does not create a special one-off script that bypasses the project architecture.

Schema for the normalized retail dataset

The schema file is:

data/expected/online_retail_schema.json

It defines the expected normalized columns and validation rules for fields such as invoice number, stock code, quantity, invoice date, unit price, customer ID, country, revenue, cancellation flag, and source dataset.

The schema is not intended to certify the dataset as business-ready.

It is a practical contract for this starter workflow:

external retail columns
        ↓
normalized project columns
        ↓
expected schema rules
        ↓
quality report

That is a useful handoff pattern because the next person can inspect both the mapping and the validation report.

Quality report

The CLI workflow writes a Markdown report and a JSON report.

For the real dataset workflow, the Markdown report is written to:

data/output/real_dataset/run/quality_report.md

The report is useful because it records what the workflow found rather than only producing a cleaned file.

Typical report sections include:

raw row count;
cleaned row count;
missing values by column;
duplicate row count;
expected column checks;
validation issue summaries;
output file paths.

For a repeatable data workflow, this is important. A cleaned output file alone is not enough. The workflow should also explain what was detected and what still needs review.

Step 4: Generate the real dataset benchmark report

After the CLI workflow finishes, generate a local benchmark report and summary outputs.

macOS / Linux:

python scripts/run_real_dataset_benchmark.py \
  --normalized-input data/output/real_dataset/online_retail_normalized.csv \
  --quality-report data/output/real_dataset/run/quality_report.json \
  --output-dir data/output/real_dataset \
  --dataset-name uci_online_retail

Windows PowerShell:

python scripts/run_real_dataset_benchmark.py `
  --normalized-input data/output/real_dataset/online_retail_normalized.csv `
  --quality-report data/output/real_dataset/run/quality_report.json `
  --output-dir data/output/real_dataset `
  --dataset-name uci_online_retail

Expected local outputs:

data/output/real_dataset/benchmark_report.md
data/output/real_dataset/summary/revenue_by_country.csv
data/output/real_dataset/summary/revenue_by_month.csv
data/output/real_dataset/summary/cancellation_summary.csv
data/output/real_dataset/summary/missing_customer_summary.csv

The benchmark report is not a universal performance claim.

It is local evidence for this machine, this dependency environment, and this dataset preparation flow.

That distinction matters. Runtime can change depending on CPU, disk speed, Python version, package versions, source file format, operating system, and local machine conditions.

Summary outputs

The benchmark script also writes lightweight summary CSV files.

The summary outputs are intentionally simple:

revenue_by_country.csv
revenue_by_month.csv
cancellation_summary.csv
missing_customer_summary.csv

They are not a full BI model.

They are small reporting-ready outputs that show how a cleaned retail transaction dataset can be summarized after validation.

For example:

revenue_by_country.csv supports country-level revenue inspection;
revenue_by_month.csv supports monthly trend inspection;
cancellation_summary.csv records cancellation and non-positive row counters;
missing_customer_summary.csv helps inspect where customer IDs are missing.

This is often enough for a first local reporting workflow.

The next version could load these into PostgreSQL, query them in DuckDB, or feed a local dashboard, but v0.7.0 intentionally keeps the real dataset path focused.

What the benchmark report records

The benchmark report is designed to answer practical questions:

Which dataset was used?
Where did the normalized input come from?
How many rows were normalized?
How many rows reached the CLI quality workflow?
How many duplicate rows were detected?
How many cancellation rows were identified?
How many customer IDs or descriptions were missing?
Were invoice dates, quantities, and prices validated?
What files were produced?
What limitations apply?

That makes the run easier to review later.

It also makes the project stronger as a portfolio asset because the workflow is not only described in prose. It leaves behind files, screenshots, reports, and commands that can be inspected.

Why not automatically download the dataset?

The project could theoretically download the dataset automatically.

For this version, I chose not to do that.

Manual download keeps the workflow clearer:

the user sees the official source page;
the dataset citation remains visible;
the license note is explicit;
the repository does not redistribute the raw dataset;
the workflow does not depend on hidden network access;
local-only data handling is easier to explain.

For a small portfolio repository, this is a reasonable trade-off.

The project demonstrates how to process the dataset, not how to become a dataset distribution tool.

Tests

Run the full test suite:

python -m compileall -q src/dq_etl_starter
python -m compileall -q scripts
pytest

Run the v0.7-related tests:

pytest tests/test_real_dataset.py
pytest tests/test_real_dataset_benchmark.py

The tests focus on the reusable code paths rather than requiring the full external dataset to be committed.

That is another useful pattern for public repositories: test the transformation logic with small fixtures, and keep the large external dataset local-only.

What is intentionally out of scope?

The v0.7.0 real dataset benchmark does not add:

automatic dataset download;
raw dataset redistribution;
production scheduling;
Airflow orchestration;
dbt modeling;
Snowflake, Databricks, or PySpark;
production-scale retail analytics;
a complete BI dashboard;
a benchmark leaderboard;
machine learning training;
LLM calls;
RAG or AI agent features.

This project is still a small Python data workflow starter.

The v0.7.0 update proves a specific point: the workflow can be applied to a public retail transaction dataset locally, while keeping the data handling policy, validation steps, outputs, and limitations clear.

What I would improve next

Possible next improvements include:

adding a Makefile for repeated demo commands;
adding a smaller public fixture for faster walkthroughs;
adding optional DuckDB queries for the real dataset summaries;
adding optional PostgreSQL reporting tables for the real dataset path;
adding a short CI workflow for core tests;
improving benchmark report formatting;
adding more detailed data source mapping documentation;
adding a second public dataset only if it does not make the project too broad.

The main constraint remains the same:

Keep the project small, reproducible, inspectable, and easy to adapt.

Repository

GitHub repository:

https://github.com/OnerGit/data-quality-etl-starter

Preparing AI-Ready Data Without Calling an LLM API

This v0.7.0 update is a practical next step: from synthetic and generated demos to a local public retail dataset benchmark that reuses the same validation, cleaning, reporting, and handoff workflow.

Preparing AI-Ready Data Without Calling an LLM API

Bob Oner — Fri, 12 Jun 2026 15:18:41 +0000

In the previous article, I extended a small Python data quality ETL starter from cleaned data into BI-ready reporting tables with PostgreSQL, SQL views, and an optional Metabase dashboard.

From Clean Data to BI-Ready Reporting Tables with Python, PostgreSQL, and Metabase

This follow-up focuses on the v0.6.0 update of the same project:

Data Quality ETL Starter on GitHub

The v0.6.0 update adds an optional AI-ready data preparation demo.

That phrase can easily become vague, so I want to define it clearly.

In this project, "AI-ready" does not mean:

calling an LLM API;
generating embeddings;
creating a vector database;
building a RAG chatbot;
training a machine learning model;
adding an AI agent;
automatically cleaning data with an LLM.

Instead, AI-ready means something more practical and earlier in the workflow:

cleaned
validated
documented
machine-readable
safe to inspect before downstream BI, ML, or AI use

The goal is not to build an AI application. The goal is to prepare data artifacts that another workflow could review and use later.

Why this step matters

Many teams want to "use AI on their data" before they have a reliable data preparation layer.

That usually creates a gap.

Before a dataset is useful for BI, ML, LLM, RAG, or any other AI-related workflow, a few basic questions still need to be answered:

What columns exist?
What does each column mean?
Which fields are identifiers or contact fields?
Which values are missing?
Which columns are numeric, categorical, datetime, or text-like?
What validation issues were found?
What data was removed or transformed?
What files were generated?
Did this process call any external AI service?

The v0.6.0 demo answers these questions by producing several small, reviewable output files.

This is especially useful for small-team workflows. A client or operator may not need a full ML platform. They may first need a clean handoff package that explains the dataset and makes downstream use safer.

What v0.6.0 adds

The v0.6.0 update adds a new optional workflow:

generated messy order data
        ↓
existing validation and cleaning workflow
        ↓
cleaned orders dataset
        ↓
schema profile JSON
        ↓
data dictionary JSON
        ↓
validation summary JSON
        ↓
feature-ready CSV
        ↓
embedding-ready text field extract
        ↓
AI-ready manifest + Markdown summary report

The main files added for this path are:

scripts/run_ai_ready_demo.py
src/dq_etl_starter/ai_ready.py
docs/ai_ready.md
tests/test_ai_ready.py
tests/test_ai_ready_outputs.py

The expected local output directory is:

data/output/ai_ready/

And the generated files are:

data/output/ai_ready/
├── ai_ready_summary_report.md
├── ai_ready_manifest.json
├── data_dictionary.json
├── schema_profile.json
├── validation_summary.json
├── feature_ready_orders.csv
└── embedding_ready_text_fields.csv

These are local artifacts. They should not be committed to the repository.

Project path so far

This project has grown in small steps:

v0.1.0  local data quality ETL baseline
v0.2.0  optional PostgreSQL export
v0.3.0  optional FastAPI validation wrapper
v0.4.0  analytics-ready Parquet + DuckDB demo
v0.5.0  BI-ready PostgreSQL + Metabase demo
v0.6.0  AI-ready data preparation demo

That sequence is intentional.

The project does not jump directly from messy CSV files to an AI application. It first builds the data workflow foundations:

reading input data;
validating schemas;
cleaning rows;
exporting data;
generating reports;
preparing analytics outputs;
loading reporting tables;
documenting data for downstream use.

The v0.6.0 update continues that path.

Install the project locally

Clone the repository:

git clone https://github.com/OnerGit/data-quality-etl-starter.git
cd data-quality-etl-starter

Create a virtual environment:

python -m venv .venv

Activate it on macOS or Linux:

source .venv/bin/activate

Activate it on Windows PowerShell:

.venv\Scripts\activate

Install dependencies and the local package:

pip install -r requirements.txt
pip install -e .

The editable install step is useful because the project uses a src/ layout.

Step 1: Generate synthetic input data

The AI-ready demo starts from generated synthetic order data.

It does not use real customer data. It does not download external datasets. It does not require API keys.

Generate 100,000 rows:

python scripts/generate_sample_data.py \
  --rows 100000 \
  --output data/generated/orders_100k.csv \
  --seed 42

Windows PowerShell:

python scripts/generate_sample_data.py `
  --rows 100000 `
  --output data/generated/orders_100k.csv `
  --seed 42

The fixed seed keeps the demo reproducible.

The generated data intentionally includes common data quality issues such as missing values, invalid email values, duplicate rows, invalid dates, negative quantities, zero prices, and inconsistent country values.

That makes the downstream preparation step more meaningful than running the workflow on a perfectly clean sample file.

Step 2: Run the AI-ready preparation demo

Run the v0.6.0 demo:

python scripts/run_ai_ready_demo.py \
  --input data/generated/orders_100k.csv \
  --schema data/expected/generated_order_schema.json \
  --output-dir data/output/ai_ready \
  --dataset-name cleaned_orders

Windows PowerShell:

python scripts/run_ai_ready_demo.py `
  --input data/generated/orders_100k.csv `
  --schema data/expected/generated_order_schema.json `
  --output-dir data/output/ai_ready `
  --dataset-name cleaned_orders

The script prints a completion message and lists the generated outputs.

A successful run should create:

schema_profile.json
data_dictionary.json
validation_summary.json
feature_ready_orders.csv
embedding_ready_text_fields.csv
ai_ready_manifest.json
ai_ready_summary_report.md

This workflow uses the existing project pieces first:

read the generated CSV;
load the expected schema;
validate the input;
clean the DataFrame;
prepare order data for downstream use.

Then the new AI-ready layer creates metadata, summaries, and handoff files.

Output 1: Schema profile

The first output is:

data/output/ai_ready/schema_profile.json

This file is a machine-readable profile of the prepared dataset.

It includes information such as:

dataset name;
row count;
column count;
column names;
inferred types;
pandas dtypes;
null counts;
null ratios;
unique counts;
unique ratios;
example values;
recommended column roles.

A simplified example looks like this:

{
  "dataset_name": "cleaned_orders",
  "row_count": 100000,
  "column_count": 12,
  "columns": [
    {
      "name": "order_id",
      "dtype": "string",
      "recommended_role": "identifier",
      "null_count": 0,
      "unique_count": 100000
    }
  ]
}

The exact values depend on the generated input and cleaning result.

This file is useful because downstream users can inspect structure before deciding how to use the dataset.

For example, a BI user may check date and numeric fields. An ML user may check identifiers and contact fields. An AI/RAG workflow may check which text fields exist before deciding whether embeddings are appropriate.

Output 2: Data dictionary

The second output is:

data/output/ai_ready/data_dictionary.json

This file explains what each column means.

It includes:

column name;
human-readable description;
type;
recommended role;
nullable flag;
example values;
usage notes.

For example, identifier fields are marked differently from numeric features or text-like fields.

This matters because a cleaned table is still not self-explanatory.

A field such as customer_id may be technically clean, but it should usually not be treated as a numeric feature. A field such as email may be useful for validation examples, but it should be reviewed carefully before any downstream AI or ML use.

The data dictionary makes those notes explicit.

Output 3: Validation summary

The third output is:

data/output/ai_ready/validation_summary.json

This file gives a compact machine-readable summary of the validation and cleaning stage.

It includes:

source file;
schema file;
row count before cleaning;
row count after preparation;
rows removed during preparation;
duplicate rows removed;
columns with missing values;
validation issue count;
validation issue codes;
AI-readiness notes.

This output is useful for auditability.

When a dataset is handed off to another workflow, the receiver should not only get a CSV file. They should also get a summary of what happened before the file was produced.

Output 4: Feature-ready CSV

The fourth output is:

data/output/ai_ready/feature_ready_orders.csv

This is a simple tabular output for downstream feature exploration.

By default, the workflow removes identifier and contact fields such as:

order_id
customer_id
email

It also transforms order_date into simple time-based fields such as:

order_year
order_month

This file does not train a model. It does not decide which features are correct for a business use case.

It only creates a cleaner starting point for later review.

That distinction is important. Feature-ready does not mean model-ready for every use case. It means the output is more suitable for feature exploration than the original raw file.

Output 5: Embedding-ready text field extract

The fifth output is:

data/output/ai_ready/embedding_ready_text_fields.csv

This file extracts text-like fields into a compact structure:

record_id,text,source_columns

The project does not generate embeddings.

It only prepares text fields so a downstream workflow can decide later whether embeddings are appropriate.

Contact fields such as email are excluded by default.

That is a deliberate design choice. It keeps the project focused on data preparation and avoids pretending that every text field should automatically go into a vector database.

Output 6: AI-ready manifest

The sixth output is:

data/output/ai_ready/ai_ready_manifest.json

This is the most important scope-control file in the v0.6.0 update.

It explicitly records that the workflow did not call AI services:

{
  "llm_api_called": false,
  "embedding_generated": false,
  "model_trained": false
}

This may look simple, but it is useful for a public technical project.

The AI label can easily create confusion. A manifest prevents overclaiming by documenting what the workflow did and did not do.

The manifest also lists intended downstream uses, such as:

BI handoff;
ML feature exploration;
LLM/RAG preparation outside this project;
data quality review.

And it lists out-of-scope items such as:

LLM API calls;
embeddings generation;
model training;
RAG chatbot;
AI agent;
vector database.

Output 7: AI-ready summary report

The final output is:

data/output/ai_ready/ai_ready_summary_report.md

This is a human-readable Markdown report.

It includes:

dataset name;
prepared row count;
generated output files;
scope note;
recommended downstream use;
out-of-scope items.

The summary report is meant for handoff.

A technical reviewer can open the JSON files. A less technical stakeholder can start with the Markdown report and understand the purpose of the run.

Why no LLM API call?

This project intentionally stops before the expensive or model-specific part of an AI workflow.

There are several reasons.

First, AI APIs introduce cost and credential management. A small data workflow starter should run without paid API keys.

Second, embedding and modeling decisions depend on the use case. A dataset prepared for sales forecasting is different from a dataset prepared for semantic search.

Third, calling an LLM does not remove the need for validation, profiling, documentation, and governance. Those steps are still required.

Fourth, this project is meant to demonstrate a Python data workflow skill set: data cleaning, validation, transformation, reporting, testing, and handoff.

For this version, preparing better data is more important than adding an AI wrapper.

How this maps to client work

A realistic client request may sound like this:

We want to use our order/customer data for reporting, analytics, or maybe AI later. Can you clean it and prepare a documented dataset first?

A practical first milestone could be:

inspect input files;
validate required fields;
clean duplicates and bad values;
remove obvious identifier or contact fields from feature exploration outputs;
generate a schema profile;
create a data dictionary;
write a validation summary;
prepare a feature-ready CSV;
prepare a text-field extract for later review;
document what was and was not done.

That is exactly the kind of handoff this v0.6.0 demo is designed to show.

It is not a full AI system. It is a data preparation layer that makes later AI-related work more responsible and easier to review.

Tests

Run the AI-ready tests:

pytest tests/test_ai_ready.py
pytest tests/test_ai_ready_outputs.py

Run the full test suite:

python -m compileall -q src/dq_etl_starter
python -m compileall -q scripts
pytest

The default tests do not require PostgreSQL, Metabase, external datasets, or any LLM API key.

That keeps the workflow easy to verify locally.

Local artifact policy

The generated AI-ready output files are local artifacts.

Do not commit:

data/generated/
data/output/ai_ready/
data/output/analytics/
data/output/bi/
*.parquet
*.duckdb
metabase.db/
metabase-data/
postgres_data/

The repository should keep source code, tests, documentation, schemas, lightweight sample files, and screenshots.

This matters for public portfolio projects. The repository should be easy to clone and review without carrying large generated outputs.

What is intentionally out of scope?

The v0.6.0 demo does not include:

OpenAI, Claude, Gemini, or other paid AI APIs;
local LLM integration;
embeddings generation;
vector databases;
RAG chatbots;
AI agents;
automatic SQL generation;
automatic data cleaning by LLM;
model training;
AutoML;
feature stores;
MLflow or MLOps tooling;
cloud deployment;
custom frontend code.

These tools can be useful in the right project.

They are simply not the goal of this starter.

The goal is to prepare documented, machine-readable data that downstream workflows can inspect and decide how to use.

What I would improve next

Possible next improvements include:

add more configurable column role rules;
add stronger data dictionary templates;
generate a small HTML summary report;
add richer schema drift checks;
add more realistic public dataset validation notes;
add a Makefile for common demo commands;
add CI for test execution;
add clearer examples for BI, ML, and RAG handoff paths.

The main constraint remains the same:

keep the project small, runnable, testable, documented, and honest about scope

That is more useful than adding an AI feature that hides the underlying data preparation work.

Repository

GitHub repository:

https://github.com/OnerGit/data-quality-etl-starter

From Clean Data to BI-Ready Reporting Tables with Python, PostgreSQL, and Metabase

This v0.6.0 update is a practical next step: preparing clean, validated, documented, machine-readable data for downstream BI, ML, or AI workflows without pretending that data preparation alone is a complete AI application.

From Clean Data to BI-Ready Reporting Tables with Python, PostgreSQL, and Metabase

Bob Oner — Thu, 11 Jun 2026 08:13:25 +0000

In the previous article, I extended a small Python data quality ETL starter from validation and cleaning into analytics-ready local outputs with Parquet, DuckDB, summary CSV files, and a benchmark report.

From Data Quality Checks to Analytics-Ready Parquet with Python

This follow-up focuses on the v0.5.0 update of the same project:

Data Quality ETL Starter on GitHub

The new goal is still intentionally modest.

This is not a production BI platform. It is not a data warehouse. It is not a cloud deployment. It is not an embedded analytics product.

The goal is to show one practical next step after cleaning and analytics-ready export:

generated messy order data
        ↓
existing validation and cleaning workflow
        ↓
analytics-ready order rows
        ↓
PostgreSQL reporting tables
        ↓
lightweight SQL views
        ↓
optional Metabase local dashboard demo

That is a common handoff point in small data workflow projects.

A client or small team may not need a full data platform yet. They may simply need cleaned data loaded into a local reporting database, a few reusable SQL views, and a basic dashboard tool connected to those views.

Why add a BI-ready demo?

The earlier versions of this project focused on data quality and local analytics.

The workflow could already:

read messy CSV, Excel, JSON, and mock API-style data;
validate expected columns and schema rules;
clean duplicate rows and text values;
export cleaned CSV output;
export to SQLite and optional PostgreSQL;
expose validation through a thin FastAPI wrapper;
generate larger synthetic order data;
export analytics-ready CSV and Parquet files;
query Parquet locally with DuckDB;
produce summary CSV tables and a benchmark report.

Those steps are useful, but many reporting workflows eventually ask another question:

Can this cleaned data feed a reporting database or dashboard?

v0.5.0 adds a small answer to that question.

It loads cleaned and analytics-ready data into PostgreSQL, creates reporting tables and SQL views, and provides instructions for exploring the result in Metabase.

The point is not to make the project bigger for its own sake. The point is to demonstrate a realistic bridge from data cleaning to lightweight reporting.

What v0.5.0 adds

The v0.5.0 update adds an optional BI-ready path.

The most relevant new pieces are:

scripts/run_bi_demo.py
src/dq_etl_starter/bi.py
docs/bi.md
docs/metabase.md

The BI demo creates PostgreSQL reporting tables:

cleaned_orders
customer_summary
revenue_by_country
orders_by_month
source_system_summary

It also creates PostgreSQL reporting views:

vw_revenue_by_country
vw_orders_by_month
vw_source_system_quality
vw_monthly_revenue_trend

And it writes local output files:

data/output/bi/bi_summary_report.md
data/output/bi/reporting_queries.sql
data/output/bi/metabase_dashboard_notes.md

The data/output/bi/ directory is intentionally ignored by Git. It contains local run artifacts, not source files.

Project structure after the update

The project remains small, but the structure now shows a clearer path from input data to reporting preparation:

data-quality-etl-starter/
├── data/
│   ├── input/
│   ├── expected/
│   └── output/
├── docs/
│   ├── analytics.md
│   ├── api.md
│   ├── bi.md
│   ├── metabase.md
│   └── postgres.md
├── screenshots/
├── scripts/
│   ├── generate_sample_data.py
│   ├── run_analytics_demo.py
│   └── run_bi_demo.py
├── src/dq_etl_starter/
│   ├── analytics.py
│   ├── bi.py
│   ├── clean.py
│   ├── exporters.py
│   ├── services.py
│   └── validate.py
└── tests/

The important design choice is that the BI-ready path does not replace the original workflow.

It builds on it.

The project still starts with data validation and cleaning. The reporting layer comes later.

Install the project locally

Clone the repository:

git clone https://github.com/OnerGit/data-quality-etl-starter.git
cd data-quality-etl-starter

Create a virtual environment:

python -m venv .venv

Activate it on macOS or Linux:

source .venv/bin/activate

Activate it on Windows PowerShell:

.venv\Scripts\activate

Install dependencies and the local package:

pip install -r requirements.txt
pip install -e .

The editable install step is useful because the project uses a src/ layout.

Step 1: Generate synthetic order data

The BI demo uses generated synthetic order data.

It does not use real customer data. It does not download private data. It is designed to be reproducible and safe for a public portfolio project.

Generate 100,000 rows:

python scripts/generate_sample_data.py \
  --rows 100000 \
  --output data/generated/orders_100k.csv \
  --seed 42

Windows PowerShell:

python scripts/generate_sample_data.py `
  --rows 100000 `
  --output data/generated/orders_100k.csv `
  --seed 42

This synthetic dataset intentionally includes data quality issues, such as missing values, invalid emails, duplicate rows, invalid dates, negative quantities, and inconsistent country values.

That makes the demo more useful than a perfectly clean sample file.

Step 2: Start PostgreSQL

Start the local PostgreSQL service with Docker Compose:

docker compose up -d postgres

Check that the container is running:

docker compose ps

The demo uses PostgreSQL as the reporting database because it is a practical next step after local CSV, SQLite, and Parquet outputs.

SQLite is still useful for the default local workflow. PostgreSQL is useful when reporting tables should be available through a database connection.

Step 3: Run the BI demo

Run the BI demo:

python scripts/run_bi_demo.py \
  --input data/generated/orders_100k.csv \
  --schema data/expected/generated_order_schema.json \
  --output-dir data/output/bi \
  --db-url postgresql+psycopg://dq_user:dq_password@localhost:5432/dq_demo

Windows PowerShell:

python scripts/run_bi_demo.py `
  --input data/generated/orders_100k.csv `
  --schema data/expected/generated_order_schema.json `
  --output-dir data/output/bi `
  --db-url postgresql+psycopg://dq_user:dq_password@localhost:5432/dq_demo

The demo runs the generated input through the existing validation and cleaning logic, prepares analytics-ready rows, loads reporting tables into PostgreSQL, creates SQL views, and writes local documentation files.

Expected local files:

data/output/bi/bi_summary_report.md
data/output/bi/reporting_queries.sql
data/output/bi/metabase_dashboard_notes.md

Expected PostgreSQL reporting tables:

cleaned_orders
customer_summary
revenue_by_country
orders_by_month
source_system_summary

Expected PostgreSQL reporting views:

vw_revenue_by_country
vw_orders_by_month
vw_source_system_quality
vw_monthly_revenue_trend

Step 4: Inspect reporting tables and views

After running the BI demo, inspect the database.

List reporting tables:

docker exec -it dq_etl_postgres psql -U dq_user -d dq_demo -c "\dt"

List reporting views:

docker exec -it dq_etl_postgres psql -U dq_user -d dq_demo -c "\dv"

Preview the revenue-by-country view:

docker exec -it dq_etl_postgres psql -U dq_user -d dq_demo -c "SELECT * FROM vw_revenue_by_country LIMIT 10;"

Preview the monthly revenue trend:

docker exec -it dq_etl_postgres psql -U dq_user -d dq_demo -c "SELECT * FROM vw_monthly_revenue_trend ORDER BY order_month LIMIT 12;"

Here is the reporting table and view output:

This is the core of the v0.5.0 update.

The cleaned data is no longer only a local file. It is now available through a reporting database with reusable SQL views.

Step 5: Start optional Metabase

Metabase is optional.

The core workflow does not require it. The PostgreSQL reporting tables and SQL views are the important part.

Metabase is included only as a local dashboard exploration layer.

Start Metabase:

docker compose up -d metabase

Open:

http://localhost:3000

When Metabase runs through Docker Compose, connect it to PostgreSQL with these values:

Database type: PostgreSQL
Host: postgres
Port: 5432
Database name: dq_demo
Username: dq_user
Password: dq_password

Use postgres as the host because Metabase is running inside the Docker Compose network.

Use localhost only when connecting from your host machine, such as with psql or a desktop database client.

Here is the Metabase connection step:

Step 6: Create a simple dashboard

The project does not ship a production dashboard.

Instead, it provides suggested dashboard cards based on the reporting views.

Useful starting points include:

vw_revenue_by_country
vw_orders_by_month
vw_monthly_revenue_trend
vw_source_system_quality

Suggested cards:

revenue by country;
orders by month;
monthly revenue trend;
orders by source system;
average order value by country.

Here is a simple dashboard built from the reporting views:

The dashboard is intentionally basic.

Its job is not to impress with design. Its job is to prove that cleaned and reporting-ready data can be loaded into PostgreSQL and explored by a dashboard tool.

BI summary report

The demo also writes a Markdown report:

data/output/bi/bi_summary_report.md

The report records:

raw row count;
cleaned row count;
analytics-ready row count;
reporting tables created;
reporting views created;
local output files;
recommended dashboard cards;
scope notes.

Here is the BI summary report:

This file is useful for handoff.

Instead of only saying "the script ran", the project leaves behind a simple written record of what was created.

Why use SQL views?

The reporting views are small, but they are important.

They separate raw or cleaned tables from reporting-facing queries.

For example, a view such as vw_revenue_by_country gives a dashboard tool a stable object to query. If the underlying table logic changes later, the dashboard can still point to the view.

This is a common reporting pattern:

cleaned table
        ↓
summary table
        ↓
reporting view
        ↓
dashboard card

For a small project, SQL views provide a good balance between simplicity and structure.

They are easier to review than a hidden dashboard-only query, and they are lighter than introducing a full modeling framework too early.

What this proves

This v0.5.0 update demonstrates several practical capabilities:

generating safe synthetic data;
running data validation before reporting;
cleaning and preparing analytics-ready rows;
loading reporting tables into PostgreSQL;
creating reusable SQL views;
connecting a local BI tool to the reporting database;
documenting output files and scope;
keeping generated artifacts out of Git.

This is especially relevant for small data workflow projects.

Many clients do not need a complex platform at the beginning. They need a reliable workflow that turns messy files into something a reporting tool can use.

What is intentionally still out of scope?

This demo intentionally does not include:

production Metabase deployment;
cloud hosting;
user authentication;
embedded analytics;
multi-tenant dashboarding;
data warehouse modeling;
Airflow orchestration;
dbt project structure;
Spark processing;
scheduled production jobs;
real customer data;
AI, LLM, RAG, or agent features.

Those can be valid tools in other contexts.

For this project, adding them too early would make the starter harder to run and harder to review.

The current goal is narrower:

small local workflow
validated and cleaned data
PostgreSQL reporting tables
simple SQL views
optional dashboard exploration
clear documentation

Run the tests

Run the full test suite:

python -m compileall -q src/dq_etl_starter
python -m compileall -q scripts
pytest

Run v0.5 BI-ready tests:

pytest tests/test_bi.py
pytest tests/test_bi_reporting_sql.py

PostgreSQL integration tests should remain optional and should be skipped unless DATABASE_URL is configured.

This is important because the project should stay easy to test even when external services are not running.

Run the default Docker workflow

The default Docker run remains simple:

docker build -t data-quality-etl-starter:0.5.0 -t data-quality-etl-starter:latest .
docker run --rm data-quality-etl-starter:0.5.0

The BI demo uses Docker Compose services for PostgreSQL and optional Metabase, but the default image still supports a reproducible CLI workflow.

That separation keeps the project easier to understand.

How this maps to freelance client work

This update maps to a realistic client request:

"We have messy order/customer exports. Can you clean them, load them into a database, and prepare a few reporting tables or dashboard-ready views?"

A first version does not always need a full data warehouse.

A practical milestone might be:

generate or receive input data;
validate the schema;
clean duplicates and bad values;
load reporting tables into PostgreSQL;
create SQL views for common metrics;
connect a dashboard tool;
document how to rerun the workflow.

That is exactly the kind of path this v0.5.0 demo is designed to show.

For freelance work, the value is in the handoff:

clear commands;
local reproducibility;
synthetic data for safe demos;
database tables that can be inspected;
SQL views that can be reviewed;
screenshots that show the workflow;
a summary report that documents the run;
scope notes that prevent overclaiming.

What I would improve next

Possible next improvements include:

adding a few more reporting views;
adding richer schema profiling;
adding better BI summary formatting;
adding a Makefile for repeated demo commands;
adding GitHub Actions for test runs;
adding a small real-public-dataset validation note;
adding optional scheduled local runs;
adding clearer logging for each workflow stage.

The main constraint remains the same:

Do not turn the starter into a heavy platform too early.

It should stay small, reproducible, and easy to adapt.

Repository

GitHub repository:

https://github.com/OnerGit/data-quality-etl-starter

From Data Quality Checks to Analytics-Ready Parquet with Python

This v0.5.0 update is a practical next step: from cleaned analytics-ready data to PostgreSQL reporting tables, SQL views, and an optional local dashboard demo.

From Data Quality Checks to Analytics-Ready Parquet with Python

Bob Oner — Tue, 09 Jun 2026 04:47:54 +0000

In the first article, I walked through a small Python data quality ETL starter that reads messy CSV, Excel, JSON, and API-style data, validates it, cleans it, exports it, and generates quality reports.

Build a Python Data Quality ETL Starter for Messy CSV, Excel, JSON, and API-Style Data

This follow-up focuses on the v0.4.0 update of the same project:

Data Quality ETL Starter on GitHub

The new goal is still intentionally small. This is not a data warehouse, a BI platform, an Airflow project, a dbt project, or an AI data application.

The goal is to extend the starter workflow from small sample files to a more realistic local analytics demo:

generated messy order data
        ↓
existing validation and cleaning logic
        ↓
cleaned CSV
        ↓
cleaned Parquet
        ↓
DuckDB query demo
        ↓
summary CSV tables
        ↓
benchmark_report.md

This makes the project more useful as a portfolio asset because it demonstrates not only data cleaning, but also the next handoff step: producing files that are easier to analyze locally.

Why add an analytics-ready export path?

Many small-team data workflows do not end with a cleaned CSV file.

A cleaned CSV is useful, but a reporting workflow often needs one more layer:

a file format suitable for repeated analysis;
simple summary tables;
SQL queries that can be reviewed and reused;
a lightweight benchmark report;
a way to test the workflow with more than a tiny demo file.

That is the reason v0.4.0 adds an optional analytics-ready path.

The project still keeps the original CLI workflow as the source of truth. The analytics demo is an additional path, not a replacement for the existing CSV, Excel, JSON, mock API, SQLite, PostgreSQL, or FastAPI validation workflows.

What v0.4.0 adds

The v0.4.0 update adds three main pieces.

First, it adds a synthetic order data generator:

scripts/generate_sample_data.py

Second, it adds an analytics demo runner:

scripts/run_analytics_demo.py

Third, it adds analytics helpers under the package source:

src/dq_etl_starter/analytics.py

Together, these files show how to generate repeatable synthetic data, run it through the existing cleaning and validation logic, and produce analytics-ready outputs.

The project version is now 0.4.0.

Project structure after the update

The relevant structure now looks like this:

data-quality-etl-starter/
├── data/
│   ├── input/
│   ├── expected/
│   └── output/
├── docs/
│   ├── api.md
│   ├── analytics.md
│   └── postgres.md
├── screenshots/
├── scripts/
│   ├── generate_sample_data.py
│   └── run_analytics_demo.py
├── src/dq_etl_starter/
│   ├── analytics.py
│   ├── api.py
│   ├── clean.py
│   ├── cli.py
│   ├── exporters.py
│   ├── readers.py
│   ├── services.py
│   └── validate.py
└── tests/

This is still a small project, but the workflow now has a clearer progression:

local data quality workflow;
optional PostgreSQL export;
optional FastAPI validation wrapper;
optional analytics-ready export demo.

That progression matters because it maps to how small client projects often grow.

A client may first need a repeatable CSV cleanup script. Later, they may ask for database export, an API endpoint, or files that can feed reporting tools.

Install the project locally

Clone the repository:

git clone https://github.com/OnerGit/data-quality-etl-starter.git
cd data-quality-etl-starter

Create a virtual environment:

python -m venv .venv

Activate it on macOS or Linux:

source .venv/bin/activate

Activate it on Windows PowerShell:

.venv\Scripts\activate

Install dependencies and the local package:

pip install -r requirements.txt
pip install -e .

The editable install step is useful because the repository uses a src/ layout.

For the v0.4 analytics demo, the important optional dependencies are:

pyarrow for Parquet output;
duckdb for local SQL queries over Parquet files;
faker for generated demo data.

Generate synthetic order data

The generator creates deterministic synthetic customer/order-style data.

It does not download real customer data. It does not use a private dataset. It is designed only for testing and demonstration.

Generate 1,000 rows:

python scripts/generate_sample_data.py \
  --rows 1000 \
  --output data/generated/orders_1k.csv \
  --seed 42

Generate 10,000 rows:

python scripts/generate_sample_data.py \
  --rows 10000 \
  --output data/generated/orders_10k.csv \
  --seed 42

Generate 100,000 rows:

python scripts/generate_sample_data.py \
  --rows 100000 \
  --output data/generated/orders_100k.csv \
  --seed 42

Windows PowerShell example:

python scripts/generate_sample_data.py `
  --rows 100000 `
  --output data/generated/orders_100k.csv `
  --seed 42

Here is the 100,000-row generation step:

The fixed seed makes the output reproducible. That is useful for documentation, tests, demos, and future comparisons.

What kind of issues are introduced?

The generated data intentionally includes common data quality issues.

Examples include:

missing email values;
invalid email values;
missing country values;
inconsistent country casing;
duplicate rows;
invalid dates;
negative quantities;
zero prices.

This is important because a data quality demo should not only process clean data.

If the generated data is too perfect, the validation and cleaning workflow does not prove much. The point is to create enough realistic messiness to exercise the existing workflow.

Run the analytics-ready export demo

After generating the input file, run the analytics demo:

python scripts/run_analytics_demo.py \
  --input data/generated/orders_100k.csv \
  --schema data/expected/generated_order_schema.json \
  --output-dir data/output/analytics

Windows PowerShell:

python scripts/run_analytics_demo.py `
  --input data/generated/orders_100k.csv `
  --schema data/expected/generated_order_schema.json `
  --output-dir data/output/analytics

The script runs the generated input through the existing validation and cleaning logic, then writes analytics-ready outputs.

Expected output files:

data/output/analytics/cleaned_orders.csv
data/output/analytics/cleaned_orders.parquet
data/output/analytics/customer_summary.csv
data/output/analytics/revenue_by_country.csv
data/output/analytics/orders_by_month.csv
data/output/analytics/source_system_summary.csv
data/output/analytics/analytics_queries.sql
data/output/analytics/benchmark_report.md

Here is the analytics output and DuckDB query demo:

The output directory is intentionally excluded from Git. Large generated files and local analytics outputs should not be committed to the repository.

Why export Parquet?

CSV is easy to inspect and share. It is a good default output for small workflows.

Parquet is useful when the same cleaned dataset will be queried repeatedly or used by analytics tools. It preserves column types better than CSV and is commonly used in data workflows.

In this project, Parquet is not used to make the project sound bigger than it is. It is used as a practical local export format after the cleaning step.

The key handoff idea is:

cleaned CSV for readability
cleaned Parquet for local analytics
summary CSV files for reporting
SQL file for repeatable queries
benchmark report for documentation

That combination is still small, but it is more useful than a single cleaned CSV file.

Query Parquet locally with DuckDB

The demo writes reusable SQL to:

data/output/analytics/analytics_queries.sql

An example query looks like this:

SELECT
    country,
    ROUND(SUM(revenue), 2) AS total_revenue,
    COUNT(*) AS order_count
FROM read_parquet('data/output/analytics/cleaned_orders.parquet')
GROUP BY country
ORDER BY total_revenue DESC
LIMIT 10;

This query uses DuckDB to read the Parquet file directly.

That is a useful pattern for small local workflows because it avoids setting up a separate database service just to inspect analytics-ready output.

For a portfolio project, it also shows a clear bridge between Python data cleaning and SQL-based analysis.

Summary tables

The analytics demo produces several summary CSV files:

customer_summary.csv
revenue_by_country.csv
orders_by_month.csv
source_system_summary.csv

These are intentionally simple.

They are not a BI dashboard. They are not a reporting platform. They are small output tables that show how the cleaned data can be prepared for the next step.

For example:

revenue_by_country.csv can support a country-level revenue view;
orders_by_month.csv can support a monthly trend view;
source_system_summary.csv can help compare different input sources;
customer_summary.csv can support customer-level reporting.

In client work, this kind of output is often enough for a first automation milestone. The client can open the CSV files, load them into a spreadsheet, connect them to a BI tool, or use them as the input for the next version.

Benchmark report

The analytics demo also writes a Markdown benchmark report:

data/output/analytics/benchmark_report.md

Here is the report screenshot:

The report records information such as:

input file path;
raw row count;
cleaned row count;
analytics-ready row count;
validation issue count;
runtime seconds;
output file paths;
DuckDB preview query result.

The report is not meant to be a formal performance benchmark.

It is a lightweight run record. Its purpose is to make each run easier to review, compare, and hand off.

How validation still fits in

The v0.4 analytics path does not bypass the existing validation workflow.

The script still loads a schema file, reads the input CSV, validates expected fields, cleans the data, and then prepares the analytics output.

That matters because analytics output should not be generated from uninspected raw input.

The basic flow is:

read generated CSV
        ↓
load expected schema
        ↓
validate schema rules
        ↓
clean duplicate and text values
        ↓
prepare analytics columns
        ↓
write CSV, Parquet, summary tables, SQL, and benchmark report

This keeps the analytics demo connected to the original purpose of the project: data quality before reporting.

What is intentionally still out of scope?

The v0.4.0 update does not add:

user login;
frontend application;
async task queue;
cloud deployment;
BI dashboard;
Metabase;
data warehouse implementation;
Airflow;
dbt;
Spark;
SQLModel metadata layer;
AI, LLM, RAG, or agent features.

Those tools can be useful in the right context, but they would make this starter project heavier than necessary.

The design goal remains:

small
runnable
testable
documented
screenshot-ready
easy to inspect

That is more useful for this stage than adding a large platform stack.

Run the tests

Run the full test suite:

python -m compileall -q src/dq_etl_starter
python -m compileall -q scripts
pytest

Run only the v0.4 analytics-related tests:

pytest tests/test_generate_sample_data.py
pytest tests/test_analytics.py
pytest tests/test_exporters_parquet.py

The optional PostgreSQL integration tests should remain optional and should be skipped unless DATABASE_URL is configured.

This is another reason to keep the workflow modular. The generator, analytics helpers, Parquet exporter, and original ETL workflow can be checked independently.

Run with Docker

Build and run the default Docker workflow:

docker build -t data-quality-etl-starter:0.4.0 -t data-quality-etl-starter:latest .
docker run --rm data-quality-etl-starter:0.4.0

The default Docker run remains a simple reproducible CLI workflow.

This is intentional. The Docker path should not become complicated just because the project gained an optional analytics demo.

How this maps to freelance client work

This update maps well to common small data workflow requests.

Examples include:

generating test data before working with private client data;
validating messy order or customer exports;
cleaning data before monthly reporting;
producing Parquet files for local analytics;
writing repeatable DuckDB SQL queries;
producing summary CSV files for spreadsheet or BI handoff;
documenting each run with a benchmark or quality report.

For a client, this kind of workflow is valuable because it is practical and reviewable.

It answers questions like:

What did the script read?
What did it clean?
What outputs did it produce?
Can the same workflow run again next week?
Can the output be inspected without a custom application?
Can the workflow be extended without rewriting everything?

That is the level of reliability many small automation projects need before they become larger systems.

What I would improve next

The next version could move one step closer to reporting workflows without turning the project into a full BI platform.

Possible next steps:

write PostgreSQL reporting tables or views;
add more summary table examples;
add a lightweight local BI demo;
improve benchmark report formatting;
add better schema profiling;
add more realistic validation rules;
add CI for automated test runs.

The important constraint is to keep the project focused.

The current project is a small Python data quality ETL starter. The v0.4.0 update makes it easier to demonstrate analytics-ready output, but it is still not trying to become a full data platform.

Repository

GitHub repository:

https://github.com/OnerGit/data-quality-etl-starter

Build a Python Data Quality ETL Starter for Messy CSV, Excel, JSON, and API-Style Data

This v0.4.0 update is a practical next step: from cleaning and validation to analytics-ready local outputs that can be inspected, queried, and handed off.

Build a Python Data Quality ETL Starter for Messy CSV, Excel, JSON, and API-Style Data

Bob Oner — Wed, 03 Jun 2026 07:43:59 +0000

Small teams often receive data before it is ready for reporting.

It may come from a CSV export, an Excel file, a JSON payload, or an API-style response. The structure is usually close enough to be useful, but not clean enough to trust directly.

Common problems include:

inconsistent column names
missing values
duplicate rows
invalid emails
bad dates
numeric fields stored as messy text
nested JSON that needs to become a table
repeated manual cleanup before every report
no quality report for handoff

This article walks through a small open-source Python project I built to handle that kind of problem:

Data Quality ETL Starter

It is not a big data platform. It is not an Airflow or dbt project. It is not a production data warehouse. It is a small, reproducible starter workflow for data validation, cleaning, export, and reporting.

What this project builds

The project takes messy input data and runs it through a repeatable workflow:

messy CSV / Excel / JSON / mock API data
        ↓
read and flatten
        ↓
normalize columns
        ↓
validate expected schema rules
        ↓
clean duplicate rows and text values
        ↓
export cleaned CSV + SQLite
        ↓
generate data quality report

The current version supports:

CSV input
Excel input
nested JSON input
mock API-style JSON input
column name normalization
Pydantic-based workflow and schema models
missing value, duplicate row, email, date, and number validation
cleaned CSV output
SQLite output by default
optional PostgreSQL export
Markdown and JSON data quality reports
CLI execution
pytest tests
Docker-based execution

The main goal is not to build a complex platform. The goal is to show how a small data workflow can be structured, tested, documented, and adapted.

Why this matters for small-team data workflows

Many small-team data problems are not "big data" problems.

They are repeatability problems.

For example:

a sales team exports a customer list every week
an operations team receives Excel files from different sources
a founder wants a simple CSV-to-report workflow
an analyst needs JSON payloads flattened into reporting tables
a freelancer needs to hand off a data cleanup script that another person can run

A one-off script can solve one file once.

A small workflow is more useful because it makes the steps explicit:

What data did we receive?
What rules did we expect?
What changed during cleaning?
What output files were created?
What warnings should the next person review?

That is the reason this project writes both cleaned data and a quality report.

Project structure

The repository keeps the workflow small and modular:

data-quality-etl-starter/
├── data/
│   ├── input/
│   ├── expected/
│   └── output/
├── docs/
├── screenshots/
├── src/dq_etl_starter/
├── tests/
├── Dockerfile
├── docker-compose.yml
├── pyproject.toml
└── README.md

The most important source modules are:

src/dq_etl_starter/
├── readers.py       # read CSV, Excel, and JSON files
├── mock_api.py      # simulate API-style JSON without network calls
├── normalize.py     # normalize columns and flatten JSON
├── validate.py      # validate expected columns and simple schema rules
├── clean.py         # trim text values and drop duplicates
├── exporters.py     # export cleaned data to CSV, SQLite, or PostgreSQL
├── report.py        # generate Markdown and JSON reports
├── models.py        # Pydantic models for workflow contracts
└── cli.py           # command-line entry point

This separation is intentional. Each step can be tested, replaced, or extended without turning the project into one long script.

Setup

Clone the repository:

git clone https://github.com/OnerGit/data-quality-etl-starter.git
cd data-quality-etl-starter

Create a virtual environment:

python -m venv .venv

Activate it on macOS or Linux:

source .venv/bin/activate

Activate it on Windows PowerShell:

.venv\Scripts\activate

Install dependencies and the local package:

pip install -r requirements.txt
pip install -e .

The editable install step is useful because the source code uses a src/ layout.

Input data examples

The project includes sample inputs under data/input/.

The examples are designed to represent common data workflow formats:

data/input/messy_customers.csv
data/input/messy_orders.xlsx
data/input/nested_customers.json
data/input/mock_api_orders.json

Here is an example of the kind of messy source data the workflow is designed to handle:

The mock API file does not call a real external API. It simulates an API-style JSON response so the workflow remains reproducible and does not require API keys.

That is useful for a starter project because anyone can run it locally without creating accounts, setting secrets, or depending on an external service.

Run the CSV workflow

The CSV example reads a messy customer file, validates it against an expected schema, cleans it, exports the result, and generates reports.

python -m dq_etl_starter.cli run \
  --input data/input/messy_customers.csv \
  --input-type csv \
  --schema data/expected/customer_schema.json \
  --output-dir data/output/csv \
  --db-target sqlite \
  --table-name cleaned_customers

Expected outputs:

data/output/csv/cleaned_customers.csv
data/output/csv/etl_output.sqlite
data/output/csv/quality_report.md
data/output/csv/quality_report.json

The important point is that the workflow does not only create a cleaned file. It also records what happened.

Run the Excel workflow

Excel exports are common in small business and operations workflows.

Run the sample Excel input:

python -m dq_etl_starter.cli run \
  --input data/input/messy_orders.xlsx \
  --input-type excel \
  --schema data/expected/order_schema.json \
  --output-dir data/output/excel \
  --db-target sqlite \
  --table-name cleaned_orders

The output pattern is the same:

data/output/excel/cleaned_orders.csv
data/output/excel/etl_output.sqlite
data/output/excel/quality_report.md
data/output/excel/quality_report.json

Keeping the CSV and Excel output folders separate makes it easier to compare runs without overwriting previous reports.

Run nested JSON data

JSON is useful for APIs and application exports, but nested JSON is not always reporting-ready.

This project supports a --records-path option so the workflow can extract a list of records from a nested payload.

python -m dq_etl_starter.cli run \
  --input data/input/nested_customers.json \
  --input-type json \
  --records-path data.customers \
  --schema data/expected/customer_schema.json \
  --output-dir data/output/json \
  --db-target sqlite \
  --table-name cleaned_customers_json

This step demonstrates a practical pattern: convert nested data into a tabular structure before validation, cleaning, and reporting.

Run mock API-style data

The mock API workflow uses a local JSON file that looks like an API response.

python -m dq_etl_starter.cli run \
  --input data/input/mock_api_orders.json \
  --input-type mock-api \
  --records-path data.orders \
  --schema data/expected/order_schema.json \
  --output-dir data/output/mock_api \
  --db-target sqlite \
  --table-name cleaned_api_orders

This is intentionally not a real API integration in the current version.

For a client project, this layer could later be replaced with a real API reader that handles authentication, pagination, retries, and rate limits. For a starter project, using a local mock payload keeps the workflow easy to run and test.

How validation works

The project uses schema files under data/expected/ to define what the workflow expects.

For example, a customer schema can describe expected columns and simple column rules. The workflow then checks the raw data before cleaning.

Validation can detect issues such as:

missing expected columns
unexpected columns
missing values
duplicate rows
invalid email formats
invalid date values
invalid number values

The validation report uses source-oriented row references for CSV-style inputs. For example, if the source CSV has one header line and five data rows, a warning on Row 6 points to the fifth data record in the source file.

Pydantic is used for the workflow and reporting contracts. The row-level cleaning and validation remain DataFrame-based because CSV, Excel, JSON, and API-style datasets often have changing columns.

The validation step does not try to solve every business rule. That is a deliberate choice.

In real client work, every dataset has different rules. A small starter should make validation easy to extend rather than hardcoding too many assumptions.

How cleaning works

After validation, the workflow applies simple cleaning steps:

trim text values
normalize selected text fields
drop duplicate rows
keep the cleaned data in a DataFrame
export the cleaned result

The project keeps cleaning intentionally conservative.

It does not silently invent missing values. It does not guess complex business logic. It focuses on cleaning steps that are easy to explain and review.

That matters for handoff. When someone else receives the output, they should be able to understand what the script changed and what still needs human review.

Output files

Each run can create four main outputs:

cleaned data as CSV
cleaned data in SQLite
quality report as Markdown
quality report as JSON

For example:

data/output/csv/cleaned_customers.csv
data/output/csv/etl_output.sqlite
data/output/csv/quality_report.md
data/output/csv/quality_report.json

The Markdown report is useful for humans. The JSON report is useful if another tool needs to consume the result later.

A typical report includes:

raw row count
cleaned row count
column list
missing values by column
duplicate row count
missing expected columns
unexpected columns
validation issues
output file paths

The cleaned output can then be reviewed directly or used in a later reporting workflow.

This turns the workflow from "a script produced a file" into "a repeatable run produced data and a reviewable report."

Inspect SQLite output

SQLite is the default database target because it is local, portable, and easy to inspect.

After running the CSV workflow, you can open the SQLite file:

sqlite3 data/output/csv/etl_output.sqlite

Then inspect the tables:

.tables
SELECT * FROM cleaned_customers LIMIT 5;

This is useful when the cleaned data should later feed a dashboard, internal tool, or reporting process.

Optional PostgreSQL export

The project also includes optional PostgreSQL export support.

This is useful when cleaned data needs to be loaded into a shared database instead of a local SQLite file. SQLite remains the default local target.

Start PostgreSQL with Docker Compose:

docker compose up -d postgres

Run the workflow with PostgreSQL as the target:

DATABASE_URL=postgresql+psycopg://dq_user:dq_password@localhost:5432/dq_demo \
python -m dq_etl_starter.cli run \
  --input data/input/messy_customers.csv \
  --input-type csv \
  --schema data/expected/customer_schema.json \
  --output-dir data/output/postgres \
  --db-target postgres \
  --table-name cleaned_customers

For Windows PowerShell, set the environment variable separately:

$env:DATABASE_URL="postgresql+psycopg://dq_user:dq_password@localhost:5432/dq_demo"

python -m dq_etl_starter.cli run `
  --input data/input/messy_customers.csv `
  --input-type csv `
  --schema data/expected/customer_schema.json `
  --output-dir data/output/postgres `
  --db-target postgres `
  --table-name cleaned_customers

I would still treat PostgreSQL as an optional extension for this starter. SQLite is enough for the default local workflow.

Run tests

Run the test suite with:

pytest

The tests cover the workflow pieces that matter most for a small ETL starter:

reading input files
normalizing data
validating data
cleaning rows
exporting data
generating reports
running the CLI

Tests are important here because data cleanup scripts are easy to break when formats change.

A small test suite makes the workflow safer to modify.

Run with Docker

The project can also run inside Docker.

Build the image:

docker build -t data-quality-etl-starter .

Run the container:

docker run --rm -v "${PWD}/data/output:/app/data/output" data-quality-etl-starter

On Windows PowerShell, the same command format can be used:

docker run --rm -v "${PWD}/data/output:/app/data/output" data-quality-etl-starter

Docker is useful for handoff because it reduces local environment differences. A reviewer or client can run the workflow without manually recreating the same Python environment.

What I intentionally kept out of v0.1

This project is intentionally small.

The current version does not include:

a FastAPI service
user authentication
scheduled jobs
a web dashboard
Airflow orchestration
dbt models
cloud deployment
real external API authentication
complex business-specific validation rules

FastAPI would be a natural future layer, but it is not part of the v0.1 core.

The CLI workflow remains the source of truth. A later API layer could expose endpoints for upload, validation, and report retrieval, but that should come after the core workflow is stable.

Keeping the first version small makes the project easier to run, review, test, and adapt.

How this maps to freelance client work

This kind of starter maps well to small Python data workflow tasks.

Examples include:

cleaning CSV exports before reporting
converting Excel files into normalized CSV output
flattening JSON into tables
validating required columns before import
producing simple data quality reports
loading cleaned data into SQLite or PostgreSQL
preparing data for dashboards or internal tools
turning a manual weekly cleanup process into a repeatable command

For freelance work, the value is not only the code.

The value is the handoff:

clear commands
sample inputs
predictable outputs
validation warnings
reports
tests
Docker support
documented limitations

That is what makes a small automation project easier for another person to trust.

Next steps

The next improvements I would consider are:

add more schema rule types
improve report formatting
add richer error messages
add logging
add run IDs for report history
add a real API reader example
add a FastAPI wrapper around the CLI workflow
add more PostgreSQL examples
add CI for automated testing

The important constraint is to keep the project practical. It should remain small enough for a developer, analyst, or client to understand without needing a full data platform.

GitHub repository

The full project is available here:

https://github.com/OnerGit/data-quality-etl-starter

If you work with messy CSV, Excel, JSON, or API-style data, this kind of starter can be a useful base for building repeatable data cleaning and reporting workflows.

Build a CSV Data Quality API with FastAPI, Pandas, Pytest, and Docker

Bob Oner — Fri, 29 May 2026 05:55:19 +0000

CSV files are still everywhere.

They appear in internal operations, analytics workflows, data exports, business reports, and small automation pipelines. Even when a team already uses databases or modern data platforms, CSV is often the format used to move data between people, tools, and systems.

The problem is that CSV files are easy to create but not always safe to trust.

Before a CSV file enters a pipeline, it is useful to answer a few basic questions:

How many rows and columns does it have?
Are there missing values?
Are there duplicate rows?
Are any columns completely empty?
Do the columns match what the next system expects?
Can another script or service consume the result in a predictable format?

In this article, I will walk through a small project that turns those checks into a reusable API:

FastAPI CSV Quality API

GitHub repo:

https://github.com/OnerGit/fastapi-csv-quality-api

The goal is not to build a full data quality platform. The goal is to show a practical engineering path from a local Python workflow to a small backend service that is documented, testable, and containerized.

What we will build

The API accepts a CSV file upload and returns a structured JSON quality report.

The report includes:

row count
column count
column names
missing values by column
missing value ratio by column
duplicate row count
duplicate row ratio
empty columns
column name issues
optional expected-column validation
warnings

The project also includes:

structured error responses
pytest tests
sample CSV files
Swagger UI
Dockerfile
Docker Compose support

Here is an example of the quality report shown in Swagger UI:

Tech stack

This project uses:

FastAPI for the web API
Pandas for CSV analysis
Pydantic for response models
pytest for automated tests
Uvicorn as the ASGI server
Docker and Docker Compose for containerized execution

The project structure is intentionally small:

fastapi-csv-quality-api/
├── README.md
├── LICENSE
├── requirements.txt
├── Dockerfile
├── docker-compose.yml
├── app/
│   ├── __init__.py
│   ├── main.py
│   ├── models.py
│   ├── analyzer.py
│   └── errors.py
├── tests/
│   ├── __init__.py
│   ├── test_health.py
│   ├── test_analyze.py
│   └── fixtures/
├── sample_data/
├── screenshots/
├── docs/
└── article_assets/

The separation is simple:

main.py exposes the API routes.
analyzer.py contains the CSV analysis logic.
models.py defines typed response structures.
errors.py keeps error response helpers separate.
tests/ verifies the expected behavior.

Step 1: Run the API locally

Clone the repository:

git clone https://github.com/OnerGit/fastapi-csv-quality-api.git
cd fastapi-csv-quality-api

Create a virtual environment:

python -m venv .venv

Activate it.

On macOS or Linux:

source .venv/bin/activate

On Windows PowerShell:

.\.venv\Scripts\Activate.ps1

Install dependencies:

python -m pip install --upgrade pip
pip install -r requirements.txt

Run the API:

uvicorn app.main:app --reload

Then open:

http://127.0.0.1:8000/docs

You should see the FastAPI Swagger UI.

Step 2: Add a health check endpoint

A small service should have a simple health check endpoint. It gives us a quick way to verify that the API is running.

Example request:

curl http://127.0.0.1:8000/health

Example response:

{
  "status": "ok",
  "service": "fastapi-csv-quality-api",
  "version": "0.1.0"
}

This endpoint is also useful in tests, Docker checks, and future deployment environments.

Step 3: Build the CSV upload endpoint

The main endpoint is:

POST /analyze

It accepts a CSV file as multipart form data.

Example request:

curl -X POST "http://127.0.0.1:8000/analyze" \
  -F "file=@sample_data/good_sample.csv"

On Windows PowerShell, use curl.exe instead of curl:

curl.exe -X POST "http://127.0.0.1:8000/analyze" `
  -F "file=@sample_data/good_sample.csv"

This is important because PowerShell may treat curl as an alias rather than the standard curl executable.

Step 4: Implement basic CSV quality checks

Once the uploaded file is accepted, the analyzer reads it and computes a set of practical metrics.

For a small MVP, the most useful checks are often simple:

row_count
column_count
column_names
missing_values_by_column
missing_value_ratio_by_column
duplicate_row_count
duplicate_row_ratio
empty_columns

These checks are enough to catch many common CSV problems:

unexpected empty fields
duplicated records
fully empty columns
files with the wrong shape
files that look valid but are not useful downstream

A simplified example response looks like this:

{
  "filename": "bad_sample.csv",
  "row_count": 6,
  "column_count": 6,
  "column_names": [
    "id",
    "name",
    "email",
    "age",
    "signup_date",
    "notes"
  ],
  "missing_values_by_column": {
    "id": 0,
    "name": 1,
    "email": 2,
    "age": 1,
    "signup_date": 2,
    "notes": 6
  },
  "missing_value_ratio_by_column": {
    "id": 0.0,
    "name": 0.1667,
    "email": 0.3333,
    "age": 0.1667,
    "signup_date": 0.3333,
    "notes": 1.0
  },
  "duplicate_row_count": 1,
  "duplicate_row_ratio": 0.1667,
  "empty_columns": [
    "notes"
  ],
  "warnings": [
    "The CSV file contains 12 missing value(s).",
    "The CSV file contains 1 duplicate row(s).",
    "The CSV file contains empty column(s): notes."
  ]
}

The key design choice is that the API returns structured JSON instead of plain text.

That makes the result easier to consume from:

another script
a data pipeline
a dashboard
a workflow automation tool
a monitoring job

Step 5: Add expected-column validation

In many workflows, the next system expects a fixed set of columns.

For example:

id,name,email,age,signup_date

The API supports optional expected-column validation through a form field:

curl -X POST "http://127.0.0.1:8000/analyze" \
  -F "file=@sample_data/good_sample.csv" \
  -F "expected_columns=id,name,email,age,signup_date"

This allows the service to compare the uploaded CSV headers against the expected headers.

The response can then tell you whether the file matches the expected shape, and which columns are missing or unexpected.

This is a small feature, but it changes the API from a generic CSV inspector into something more useful for real workflows.

Step 6: Return structured errors

CSV upload workflows can fail in many ways.

Some examples:

the user uploads a non-CSV file
the file is empty
the file cannot be parsed
the encoding is unsupported
the file is too large

Instead of returning inconsistent error messages, this project returns structured errors.

Example:

{
  "error": {
    "code": "invalid_file_type",
    "message": "Only .csv files are supported.",
    "details": {
      "filename": "not_csv.txt"
    }
  }
}

This format is useful because clients can check the code field programmatically.

For example, a frontend can show a friendly message for invalid_file_type, while a pipeline can log the error and stop processing.

Step 7: Add tests with pytest

A small API becomes much more useful when its behavior is protected by tests.

This project includes tests for:

/health returning 200
normal CSV analysis
missing value detection
duplicate row detection
non-CSV error handling
expected-column validation

Run the tests:

pytest

Example test result:

For a small demo project, tests are not just a formality. They make the project easier to refactor and safer to extend.

Step 8: Containerize the API with Docker

After the API works locally, the next step is to package it.

Build the Docker image:

docker build -t fastapi-csv-quality-api .

Run the container:

docker run --rm -p 8000:8000 fastapi-csv-quality-api

Then open:

http://127.0.0.1:8000/docs

The container listens on 0.0.0.0:8000, while your local machine accesses it through 127.0.0.1:8000 after port mapping.

Step 9: Use Docker Compose

Docker Compose is included for a simpler local workflow.

Start the service:

docker compose up --build

Stop the service:

docker compose down

This is useful when you want a repeatable local runtime without manually typing the full docker run command.

Why this project is intentionally small

This project is an MVP.

It intentionally does not include:

authentication
database storage
frontend UI
background jobs
large-file streaming
Kubernetes deployment
production cloud infrastructure

That is deliberate.

The purpose is to demonstrate a complete but lightweight backend workflow:

CSV upload → validation → analysis → structured response → tests → Docker packaging

This makes the project easier to read, test, and extend.

Possible next improvements

There are many ways to extend this project:

configurable file size limits
date format checks
numeric column checks
JSON schema export
GitHub Actions CI workflow
cloud deployment tutorial
larger file handling
dashboard integration

A natural next step would be to deploy the containerized service to a small cloud VM or Kubernetes platform. But that should be treated as a separate tutorial rather than added to the first MVP.

Conclusion

This project shows how to turn a simple local CSV inspection workflow into a reusable API.

The main engineering ideas are:

keep the first version small
return structured JSON instead of text
separate API routing from analysis logic
define response models clearly
test the behavior with pytest
package the service with Docker

Even though the example is small, the pattern is useful:

local script → API service → tested component → containerized tool

That pattern can be reused for many internal developer tools, data workflow utilities, and automation services.

You can find the full project here:

https://github.com/OnerGit/fastapi-csv-quality-api

AI-Assisted Development Is Not Autopilot

Bob Oner — Fri, 29 May 2026 04:48:48 +0000

AI can make coding faster. It can also make messy code faster.

That is the part of AI-assisted development that does not get discussed enough. A model can generate a route handler, a browser userscript, a README section, or a test idea in seconds. But speed is not the same as engineering progress. If the output is not scoped, tested, reviewed, and documented, the project may become harder to maintain even though it was faster to start.

My working rule is simple:

AI can draft, but engineering must decide.

I have been using this rule while building two small developer tools:

FastAPI CSV Quality API, a minimal FastAPI service that accepts CSV uploads and returns a structured data quality report.
ChatGPT Long Conversation Helper, a privacy-first Tampermonkey userscript for collapsing and navigating long ChatGPT conversations locally in the browser.

These are intentionally small projects. That is the point. Small projects are useful for learning where AI helps, where it needs limits, and how to turn generated drafts into reviewable engineering work.

This article is not a prompt guide. It is also not a benchmark, a productivity claim, or a recipe for replacing code review. It is a practical reflection on how I keep AI-assisted development useful without giving up scope control, testing, documentation, and human review.

AI can draft, but engineering must decide

I do not treat AI-generated code as finished code. I treat it as a draft that needs to pass through normal engineering gates.

For small tools, those gates do not need to be heavy. They can be simple:

Is the scope clear?
Is the interface small?
Is the behavior testable?
Are errors handled consistently?
Is the privacy boundary explicit?
Can another developer reproduce the project from the README?
Can I explain what the code does without relying on the original prompt?

That last question matters. If I cannot explain the code after reading it, I do not own the implementation yet.

In my workflow, AI is helpful during exploration. I may ask it to compare approaches, list edge cases, suggest a project structure, draft a README, or propose test cases. But I do not let it decide the final shape of the project. That decision belongs to the developer, because the developer is responsible for the behavior that gets published.

This is the difference between using AI as a drafting tool and treating AI as an autopilot.

Start with a small interface, not a large prompt

The biggest mistake I see in AI-assisted coding is starting with a large prompt that asks for an entire application.

That often produces code, but not necessarily a design.

For the CSV quality API, the useful boundary was not:

Build a complete data quality platform.

That would have been too broad. The useful boundary was much smaller:

A user uploads a CSV file. The API returns a structured JSON report.

That boundary made the project reviewable. It forced the implementation to answer concrete questions:

What is the endpoint?
What does the response model contain?
What happens with an empty file?
What happens with a non-CSV file?
What does a duplicate row count mean?
How should missing values be represented?
Which errors should be structured?

AI could help draft the FastAPI route and suggest pandas checks, but the important engineering work was defining the contract. Once the response shape was clear, the code had something to obey.

The browser userscript had a different kind of interface. It was not an HTTP API. It was a local UI boundary inside the browser.

The useful boundary was:

Add local collapse and expand controls to long conversation messages without sending, uploading, exporting, or scraping conversation content.

That boundary was just as important as an API contract. It prevented the project from drifting into a more sensitive tool. It also made implementation choices easier. The script could use DOM selectors, CSS, MutationObserver, and localStorage, but it should not use external requests, analytics, backend sync, or API calls.

In both projects, the small interface came before the implementation. That gave AI a box to work inside.

Tests turn AI output into reviewable code

AI-generated code becomes safer when it is forced to satisfy tests.

For the FastAPI CSV Quality API, automated tests were the main review tool. The tests were not only checking whether the app started. They were checking behavior that mattered to the API contract:

health endpoint behavior
valid CSV upload behavior
missing value reporting
duplicate row reporting
expected column validation
invalid file handling
empty upload handling

This matters because an API can look correct while silently changing its response shape. A field can be renamed. A ratio can be calculated differently. An error response can become inconsistent. Without tests, those changes are easy to miss.

A simplified test might look like this:

def test_analyze_csv_returns_quality_report(client, csv_file):
    response = client.post(
        "/analyze",
        files={"file": ("sample.csv", csv_file, "text/csv")},
    )

    assert response.status_code == 200
    data = response.json()
    assert "row_count" in data
    assert "missing_values_by_column" in data
    assert "duplicate_row_count" in data

The exact test is less important than the habit. The test says: this is the behavior I expect, and future changes must respect it.

The userscript needed a different testing strategy. Browser UI behavior is harder to protect with a quick pytest suite, especially when the page is dynamic and not controlled by the project. So I used a manual test checklist instead.

That checklist covered installation, single-message collapse and expand, global controls, dynamic messages, refresh behavior, localStorage state, and privacy checks. It also included cases that are easy to forget: code blocks, long lines, Markdown tables, streaming replies, and messages added after the initial page load.

This is still testing. It is just the right level of testing for the project.

The point is not that every small tool needs a full CI pipeline. The point is that AI output needs a review mechanism. For an API, that mechanism may be automated tests. For a browser userscript, it may start with a disciplined manual checklist.

Logs are part of the review surface

Logs are often treated as an afterthought in small tools. In AI-assisted development, I think they are more important.

When a project uses generated code, logs help answer a basic question:

What is the code actually doing at runtime?

For the API project, logs are useful when checking upload handling, parsing failures, and unexpected errors. For a small FastAPI service, I do not need a complex observability stack. But I do need error paths that are visible and understandable.

For the userscript, console warnings are useful when selectors fail or expected message containers are not found. This is especially important because DOM-based tools depend on a page structure that may change. If the script silently stops working, debugging becomes frustrating. A small, clear warning is better than silent failure.

Logs should not leak sensitive content. That is especially important for a tool that runs on conversation pages. Logging message text would violate the project’s own privacy boundary. Logging a generic warning such as “message container not found” is enough.

Good logs do not make the project bigger. They make it easier to review.

Privacy and permission boundaries must be explicit

AI is useful at suggesting features. That is also why the developer needs to say no.

For the long conversation helper, it would be easy to add more features:

export conversations
summarize previous replies
sync state across devices
send content to an API
add search over all messages
collect usage analytics

Some of those features may be useful in other products, but they do not belong in this MVP.

The project is intentionally local. It modifies the browser view. It stores local UI state. It does not transmit conversation content. It does not call the ChatGPT API. It does not automate message sending. It does not export conversations.

That is not only a privacy statement. It is an engineering constraint.

Once the privacy boundary is explicit, every new feature can be reviewed against it:

Does this feature require external requests?
Does it store message text?
Does it touch cookies, tokens, or account data?
Does it turn a local UI helper into a data extraction tool?

If the answer is yes, it is outside the scope.

This is where AI needs the most supervision. A model may suggest a technically possible feature without understanding the product boundary. The developer must decide whether the feature should exist at all.

Ask AI for alternatives, not final decisions

I get better results when I ask AI for options rather than final answers.

For example, in the API project, AI can suggest several response model shapes. But I still need to choose the one that is easiest to understand, test, and document.

In the userscript project, AI can suggest several selector strategies. But selector choice requires judgment. Deep class-name chains may work today and break tomorrow. Shallow role-based selectors may be more stable, but they still need manual testing. There is no perfect answer, only a trade-off that should be documented.

The same applies to error handling, README structure, release notes, and limitations. AI can produce a draft quickly. The developer decides what is accurate.

A useful AI prompt is not:

Build the final solution.

A better prompt is:

Give me three implementation options, their risks, and how I should test each one.

That kind of prompt keeps the developer in control. It turns AI into a reviewer, not an autopilot.

Documentation is part of development

For small projects, documentation is often postponed until the code is done. I try to do the opposite.

A README is not only a marketing page. It is a reproducibility contract. It should tell a reader what the project does, what it does not do, how to run it, how to test it, and where the limitations are.

For the CSV API, documentation needed to explain the endpoint, the response fields, sample data, test commands, Docker usage, and screenshots. Without that, the project would be much harder to evaluate from the outside.

For the userscript, documentation needed to explain installation, privacy, manual testing, troubleshooting, limitations, and release scope. That documentation is part of the engineering work because the tool runs in a sensitive context: a user’s browser session.

AI is useful for documentation drafts. It can help organize sections and turn rough notes into readable text. But documentation still needs technical review.

If the README says the project does not send external requests, the code must support that statement. If the limitations section says selectors may break when the page changes, the implementation should be structured so selectors are easy to update.

Documentation should not make the project sound bigger than it is. Good documentation reduces uncertainty. It does not inflate the project.

Stop before the project becomes too big

AI makes scope creep easier.

Once the first version works, it is tempting to ask for more: a dashboard, a Chrome extension, cloud sync, user accounts, analytics, background jobs, advanced configuration, and so on.

For portfolio projects, that can be dangerous. A small finished tool is often more convincing than a large unfinished platform.

The CSV API did not need to become a full data quality platform. It needed to show a clean API boundary, structured output, meaningful checks, tests, Docker packaging, and documentation.

The conversation helper did not need to become a full browser extension or AI workspace. It needed to solve one local navigation problem with a privacy-first boundary.

Stopping is an engineering skill.

A clear MVP makes the project easier to review. It also makes the writing stronger. Instead of explaining a large incomplete system, I can explain the trade-offs behind a small complete one.

My lightweight AI-assisted development loop

After building these two small tools, I now prefer a simple loop:

Define the smallest useful interface.
Ask AI for options and risks.
Write or review the first implementation.
Add tests or a manual test checklist.
Add logs where debugging would otherwise be unclear.
Document usage, limitations, and non-goals.
Review the code against the original boundary.
Stop before the project becomes unnecessarily large.

This is not a heavy process. It is a lightweight review loop for small tools.

The order matters. I do not want to start with a large prompt and then search for a structure afterward. I want the structure first, then use AI inside that structure.

For the CSV API, that structure was the upload endpoint and response contract. For the userscript, it was the local-only browser interaction model and the privacy boundary. In both cases, the AI-assisted parts were useful because the project already had a small reviewable shape.

What I learned from two small tools

The two projects are different, but they taught me the same lesson:

AI-assisted development works best when the surrounding process is disciplined.

From the FastAPI CSV Quality API, I learned that AI is helpful for turning a rough script idea into an API draft. But the real value comes from defining the response contract and protecting it with tests.

From the ChatGPT Long Conversation Helper, I learned that AI is helpful for exploring DOM logic and browser APIs. But the real value comes from privacy boundaries, manual testing, selector judgment, and clear limitations.

In both cases, the workflow mattered more than the initial code generation.

Conclusion

AI-assisted development is not autopilot.

It is useful when it helps developers move faster through drafts, alternatives, edge cases, and documentation. It becomes risky when generated code bypasses scope control, testing, privacy review, and human judgment.

For me, the practical answer is not to avoid AI. It is to wrap AI inside an engineering workflow.

Small interfaces keep the project understandable. Tests protect behavior. Logs make runtime behavior visible. Documentation makes the project reproducible. Limitations prevent overclaiming. Human review keeps the final responsibility where it belongs.

AI can help write code.

Engineering decides what code is worth keeping.

Related projects

Build a Privacy-First Tampermonkey Script for Long ChatGPT Conversations

Bob Oner — Thu, 28 May 2026 12:54:50 +0000

Build a Privacy-First Tampermonkey Script for Long ChatGPT Conversations

Long AI conversations are useful, but they become hard to scan.

If you use ChatGPT for technical planning, code review, writing drafts, debugging, or research, a single conversation can easily grow into dozens of turns. At that point, the problem is no longer generating more content. The problem is navigation.

You may want to jump back to an earlier question. You may want to hide a long assistant answer after you have already used it. You may want to keep only the most important parts visible while reviewing the whole thread.

I wanted a small tool for that specific problem: collapse and expand long ChatGPT questions and answers in the local browser view.

The result is ChatGPT Long Conversation Helper, a Tampermonkey userscript that adds per-message collapse controls, global collapse / expand controls, a three-line preview, and local UI state.

Companion repository:

https://github.com/OnerGit/ChatGPT-Long-Conversation-Helper

This is a third-party local userscript. It is not an official OpenAI or ChatGPT feature.

It only changes the local browser view. It does not upload, transmit, collect, export, or send conversation content. It does not call the ChatGPT API. It does not automate sending messages. It stores only local UI state in localStorage.

The problem: long conversations are hard to review

A long conversation is useful while you are building it. It becomes less useful when you need to review it later.

The page can contain long prompts, detailed answers, code blocks, checklists, and repeated planning notes. Scrolling through everything makes it harder to compare earlier decisions with later results.

The tool does not try to summarize the conversation. It keeps the content exactly where it is and adds a local way to hide or show each message.

What this userscript does

The first version focuses on one narrow workflow improvement: make long conversations easier to review.

The userscript adds:

a Collapse question / Expand question button for user messages;
a Collapse answer / Expand answer button for assistant messages;
a three-line preview when a message is collapsed;
a subtle fade mask near the preview boundary;
a floating global control panel;
Collapse all and Expand all buttons;
a compact LCH launcher after hiding the full panel;
local collapsed / expanded state with localStorage.

It deliberately does not provide export, scraping, summarization, automation, cloud sync, or API integration.

That scope matters. A browser UI helper should not silently become a data extraction tool.

Why I started with Tampermonkey

This project could eventually become a browser extension, but I did not start there.

A Tampermonkey userscript was a better MVP boundary for three reasons.

First, it is quick to test. I can paste a single .user.js file into Tampermonkey, open ChatGPT, and validate the DOM behavior immediately.

Second, it avoids extension packaging too early. A Chrome or Edge extension would require more decisions around permissions, manifest configuration, distribution, review, and long-term maintenance.

Third, the real uncertainty was not packaging. The real uncertainty was whether the DOM-based interaction would feel useful and stable enough.

So the first goal was simple: validate the interaction model locally before turning it into a heavier browser extension.

Setting the privacy boundary

Before writing the DOM code, I defined what the tool must not do.

The script should not:

upload conversation content;
transmit conversation content;
collect conversation content;
export conversations;
call the ChatGPT API;
automate sending messages;
read cookies;
read account tokens;
read payment information;
collect telemetry;
use analytics;
load remote scripts.

The only persisted data should be local UI state: whether a message is collapsed and whether the global panel is hidden.

That boundary influenced the implementation. The script uses browser APIs such as:

querySelectorAll
MutationObserver
localStorage
classList
addEventListener

It does not need fetch, XMLHttpRequest, WebSocket, sendBeacon, document.cookie, or external dependencies.

Userscript metadata

A userscript starts with metadata. This block tells Tampermonkey where the script should run and which special permissions it needs.

For this project, the metadata is intentionally small:

// ==UserScript==
// @name         ChatGPT Long Conversation Helper
// @namespace    chatgpt-long-conversation-helper
// @version      0.1.1
// @description  A privacy-first local UI helper that collapses and expands long ChatGPT questions and answers.
// @author       OnerGit
// @match        https://chatgpt.com/*
// @grant        none
// @run-at       document-idle
// @license      MIT
// ==/UserScript==

The important lines are:

// @match        https://chatgpt.com/*
// @grant        none
// @run-at       document-idle

@match limits the script to ChatGPT pages.

@grant none keeps the script in a simple mode without requesting special Tampermonkey APIs.

@run-at document-idle waits until the page is mostly loaded before running. This is useful for UI scripts because many target elements may not exist at the earliest loading stage.

This does not guarantee all conversation messages are already present. ChatGPT is a dynamic web app, so the script still needs a MutationObserver.

Finding message nodes in a dynamic page

The script needs to find user questions and assistant answers.

A tempting approach would be to copy a long selector chain from DevTools. For example, you might inspect a message and copy a selector that includes many nested class names.

That is usually fragile.

Modern web apps often change generated class names, wrapper elements, or layout structure. A selector that is too deep may break after a small UI update.

Instead, this script prefers shallow role-based selectors:

const CONFIG = {
  roleSelectors: [
    '[data-message-author-role="user"]',
    '[data-message-author-role="assistant"]'
  ],
  ignoredAncestors: 'form, textarea, input, nav, aside, header, footer, [role="dialog"]',
  processedAttr: 'data-clch-processed'
};

This is still a DOM dependency, and it can break if ChatGPT changes its page structure. But it is more maintainable than relying on a long chain of layout classes.

The script also avoids processing input boxes, dialogs, headers, footers, sidebars, and other non-conversation areas.

A simplified message finder looks like this:

function getConversationMessageNodes() {
  const found = new Set();

  CONFIG.roleSelectors.forEach((selector) => {
    document.querySelectorAll(selector).forEach((node) => {
      if (isLikelyConversationMessage(node)) {
        found.add(node);
      }
    });
  });

  return Array.from(found);
}

The Set prevents duplicates if selectors overlap.

Avoiding duplicate processing

A dynamic page can be scanned many times.

If the script adds a toolbar to a message every time it scans, the UI will quickly become broken. The solution is to mark processed nodes.

function processMessage(messageNode) {
  if (!messageNode || messageNode.getAttribute(CONFIG.processedAttr) === 'true') {
    return;
  }

  const role = getRole(messageNode);

  if (role !== 'user' && role !== 'assistant') {
    return;
  }

  messageNode.classList.add('clch-message');
  messageNode.setAttribute(CONFIG.processedAttr, 'true');

  addMessageToolbar(messageNode);
  restoreState(messageNode);
}

This makes scanning idempotent. Running scanMessages() multiple times should not keep adding more buttons to the same message.

That is important when using MutationObserver, because DOM changes may trigger scans repeatedly.

Adding collapse controls

For each message, the script inserts a small toolbar before the message node.

The toolbar contains one button:

function addMessageToolbar(messageNode) {
  const role = getRole(messageNode) || 'message';

  const toolbar = document.createElement('div');
  toolbar.className = 'clch-toolbar';

  const button = document.createElement('button');
  button.type = 'button';
  button.className = 'clch-toggle-button';
  button.textContent = getToggleLabel(role, false);
  button.setAttribute('aria-expanded', 'true');

  button.addEventListener('click', () => {
    const currentlyCollapsed =
      messageNode.getAttribute('data-clch-collapsed') === 'true';

    setCollapsed(messageNode, !currentlyCollapsed, true);
  });

  toolbar.appendChild(button);
  messageNode.parentNode.insertBefore(toolbar, messageNode);
}

The button does not move or rewrite the message content. It only toggles a collapsed class on the existing message node.

That design choice matters. Moving or wrapping message nodes can introduce layout risk with Markdown tables, code blocks, and wide answer containers. This version avoids re-parenting ChatGPT message DOM nodes and applies the collapsed state directly to the message node.

Styling the collapsed state

The collapsed state is mostly CSS.

The script applies a class such as:

clch-collapsed-message

Then CSS limits the visible height:

.clch-collapsed-message {
  max-height: calc(3 * 1.55em);
  overflow: hidden !important;
  position: relative !important;
}

A fade mask makes the preview feel less abrupt:

.clch-collapsed-message::after {
  content: "";
  position: absolute;
  left: 0;
  right: 0;
  bottom: 0;
  height: 1.9em;
  pointer-events: none;
  background: linear-gradient(
    to bottom,
    rgba(255, 255, 255, 0),
    var(--clch-fade-bg, #ffffff)
  );
}

This is intentionally simple. The script does not try to summarize the message. It does not parse the text. It does not store the content. It only changes how much of the existing message is visible.

Watching new messages with MutationObserver

ChatGPT conversations are dynamic. New user messages and assistant replies appear after the initial page load.

A one-time scan is not enough.

The script uses MutationObserver to watch for newly inserted content:

function startObserver() {
  const target = document.querySelector('main') || document.body;

  const observer = new MutationObserver(() => {
    window.clearTimeout(observerTimer);
    observerTimer = window.setTimeout(scheduleScan, CONFIG.observerThrottleMs);
  });

  observer.observe(target, {
    childList: true,
    subtree: true
  });
}

The observer does not process every mutation immediately. It schedules a scan with a small delay.

That delay matters because dynamic apps may produce several DOM changes during a single interaction. A small throttle/debounce keeps the script from doing unnecessary repeated work.

The scan function can then process any new message nodes that do not already have the data-clch-processed marker.

Saving local UI state

If you collapse several messages and refresh the page, it is useful for the local view to remember that state.

The script uses localStorage for this.

A simplified storage key looks like this:

clch:v0.1.1:/c/example-conversation:assistant:4:collapsed = 1

The key includes:

script namespace and version;
current URL path;
message role;
message index;
collapsed state.

The value is only a UI flag.

It does not store message text.

The storage helpers are wrapped in try/catch because browser storage can fail or be disabled:

function safeGetStorage(key) {
  try {
    return window.localStorage.getItem(key);
  } catch (error) {
    console.warn('[CLCH] Failed to read localStorage.', error);
    return null;
  }
}

function safeSetStorage(key, value) {
  try {
    window.localStorage.setItem(key, value);
  } catch (error) {
    console.warn('[CLCH] Failed to write localStorage.', error);
  }
}

This state recovery is best-effort. Because it is index-based, it may not restore perfectly if the conversation order changes or if the page DOM changes.

That limitation is acceptable for an MVP because the script is a local UI helper, not a data management system.

Global controls and the LCH launcher

Individual controls help when reviewing one message. Global controls help when a conversation is already long.

The floating panel provides:

Collapse all
Expand all
Hide controls

If the panel itself becomes visual noise, it can be hidden into a small LCH launcher.

This is a small UI detail, but it matters for a browser helper. A tool that reduces visual noise should not create too much noise of its own.

Manual testing

For a small userscript, manual testing is still important.

The test plan I used focuses on behavior rather than unit tests:

Install Tampermonkey.
Paste and enable the userscript.
Open a ChatGPT conversation at https://chatgpt.com/.
Confirm the floating control panel appears.
Confirm long user questions get Collapse question.
Confirm assistant answers get Collapse answer.
Collapse and expand individual messages.
Use Collapse all and Expand all.
Hide the panel and reopen it with LCH.
Send a new message and confirm dynamic content receives controls.
Refresh the page and check best-effort state recovery.
Test messages containing Markdown tables, code blocks, lists, and long lines.
Confirm no message content disappears after expanding.
Check that localStorage contains only UI state keys.
Confirm there are no script-triggered external requests.

The privacy test is part of the functional test. For this project, “it works” is not enough. It also needs to stay within the local-only boundary.

Known limitations

This is a best-effort UI enhancement.

The main limitation is DOM dependency. The script depends on the visible ChatGPT web page structure. If ChatGPT changes its DOM, selectors may need to be updated.

Other limitations:

streaming replies may not always receive controls immediately;
local state recovery may be imperfect after page changes;
message indexing can shift if the conversation structure changes;
the script is manually tested, not tested against an official ChatGPT extension API;
it is not an official feature;
it is not affiliated with OpenAI.

These limitations are not hidden because they are part of the engineering reality of a DOM-based userscript.

What I would improve next

I would keep the next version small.

Useful improvements include:

configurable preview line count;
optional keyboard shortcuts;
more robust selector fallback;
better settings UI;
improved dark-mode visual tuning;
clearer reset controls for local UI state.

A Chrome or Edge extension could also be considered later, but only after the userscript behavior stabilizes.

Moving from userscript to extension would require a new review of permissions, storage behavior, privacy documentation, packaging, and distribution. It should not be treated as a simple file conversion.

Conclusion

Small local tools can improve AI workflows, but the boundary matters.

For this project, the useful feature is not automation. It is navigation. The script does not send messages, call APIs, scrape conversations, or export data. It only changes the local browser view so long conversations are easier to scan.

That made Tampermonkey a good starting point. It allowed the core interaction to be tested quickly while keeping the project small enough to review.

The broader lesson is simple: when building AI workflow tools, productivity should not come at the cost of unclear data behavior. A small browser tool can still be useful if it has a narrow scope, a clear privacy boundary, and honest limitations.

GitHub repository:

https://github.com/OnerGit/ChatGPT-Long-Conversation-Helper

This is a third-party local userscript, not an official OpenAI or ChatGPT feature.