<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Bob Oner</title>
    <description>The latest articles on DEV Community by Bob Oner (@bob_oner).</description>
    <link>https://dev.to/bob_oner</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3956541%2F4d9860d9-0ec2-4a1d-a1ff-19523dd45e3e.png</url>
      <title>DEV Community: Bob Oner</title>
      <link>https://dev.to/bob_oner</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/bob_oner"/>
    <language>en</language>
    <item>
      <title>From Mock API Workflow to Delivery-Ready Asset: Extending a Shopify-style Reporting Case Study</title>
      <dc:creator>Bob Oner</dc:creator>
      <pubDate>Wed, 17 Jun 2026 12:15:39 +0000</pubDate>
      <link>https://dev.to/bob_oner/from-mock-api-workflow-to-delivery-ready-asset-extending-a-shopify-style-reporting-case-study-558f</link>
      <guid>https://dev.to/bob_oner/from-mock-api-workflow-to-delivery-ready-asset-extending-a-shopify-style-reporting-case-study-558f</guid>
      <description>&lt;p&gt;In the previous article, I introduced a public Shopify-style API reporting workflow case study.&lt;/p&gt;

&lt;p&gt;That article focused on one question:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Can I show the workflow shape publicly
without exposing private implementation code,
credentials,
store domains,
or client data?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Previous article:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/bob_oner/designing-a-shopify-style-api-reporting-workflow-as-a-public-case-study-3f9f"&gt;Designing a Shopify-style API Reporting Workflow as a Public Case Study&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Public case-study repository:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/OnerGit/shopify-api-reporting-workflow" rel="noopener noreferrer"&gt;https://github.com/OnerGit/shopify-api-reporting-workflow&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;General runnable data workflow project:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/OnerGit/data-quality-etl-starter" rel="noopener noreferrer"&gt;https://github.com/OnerGit/data-quality-etl-starter&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The first public version of the Shopify-style case study showed the workflow shape:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;mock REST-style API data
→ GraphQL-shaped mock responses
→ pagination
→ field mapping
→ normalized reporting tables
→ CSV / Excel / SQLite / Markdown outputs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That was useful, but it was not enough.&lt;/p&gt;

&lt;p&gt;Once the mock workflow works, the next question becomes more practical:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;What would this need before it could become
a reusable client delivery asset?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is what the next private milestones explored.&lt;/p&gt;

&lt;h2&gt;
  
  
  Title options considered
&lt;/h2&gt;

&lt;p&gt;Before writing the follow-up, I considered these titles:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;From Mock API Workflow to Delivery-Ready Asset: Extending a Shopify-style Reporting Case Study&lt;/li&gt;
&lt;li&gt;What Comes After a Public API Reporting Case Study?&lt;/li&gt;
&lt;li&gt;Turning a Shopify-style API Reporting Demo into a Safer Client Delivery Workflow&lt;/li&gt;
&lt;li&gt;Beyond Mock Data: Validation, Connector Boundaries, and Delivery Planning for API Reporting Workflows&lt;/li&gt;
&lt;li&gt;From Mock Shopify-style Data to Safer Client Reporting Delivery&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I chose the first title because it describes the direction clearly: this is not only about showing a demo, but about thinking through what makes a workflow safer to adapt for client work.&lt;/p&gt;

&lt;h2&gt;
  
  
  What changed after v0.2
&lt;/h2&gt;

&lt;p&gt;The v0.1 and v0.2 case-study work answered the first layer of the problem.&lt;/p&gt;

&lt;p&gt;They showed that Shopify-style order, customer, product, and line-item data can be shaped into reporting-friendly outputs.&lt;/p&gt;

&lt;p&gt;The public repo showed sanitized evidence for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;mock REST-style workflow behavior;&lt;/li&gt;
&lt;li&gt;GraphQL-shaped mock pagination;&lt;/li&gt;
&lt;li&gt;output previews;&lt;/li&gt;
&lt;li&gt;Excel-style workbook structure;&lt;/li&gt;
&lt;li&gt;SQLite-style reporting tables;&lt;/li&gt;
&lt;li&gt;Markdown report preview;&lt;/li&gt;
&lt;li&gt;public/private boundary notes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After that, the project moved into a different kind of work.&lt;/p&gt;

&lt;p&gt;The later private milestones were not about adding a dashboard, a SaaS layer, or a public connector.&lt;/p&gt;

&lt;p&gt;They were about the delivery layer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;mock workflow evidence
→ validation boundary
→ private connector template
→ manual live validation gate
→ redaction rules
→ retry and backoff planning
→ client handoff templates
→ optional extension planning
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That layer is less visually exciting than a dashboard screenshot.&lt;/p&gt;

&lt;p&gt;But for real API reporting work, it matters more.&lt;/p&gt;

&lt;p&gt;A useful API reporting workflow is not only about extracting data. It also needs boundaries around credentials, validation evidence, failure handling, output review, and client-specific adaptation.&lt;/p&gt;

&lt;h2&gt;
  
  
  v0.3: development-store validation notes and evidence boundary
&lt;/h2&gt;

&lt;p&gt;The v0.3 milestone focused on development-store validation notes and sanitized evidence handling.&lt;/p&gt;

&lt;p&gt;This was not about publishing live validation evidence.&lt;/p&gt;

&lt;p&gt;The public repo does not contain raw validation output. It does not contain store domains, tokens, raw API responses, customer records, or private implementation details.&lt;/p&gt;

&lt;p&gt;Instead, the public documentation describes how validation evidence should be handled safely.&lt;/p&gt;

&lt;p&gt;That distinction is important.&lt;/p&gt;

&lt;p&gt;When a workflow moves from fake fixtures toward real API validation, the evidence itself can become sensitive.&lt;/p&gt;

&lt;p&gt;A screenshot can accidentally reveal:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a development store domain;&lt;/li&gt;
&lt;li&gt;an access scope;&lt;/li&gt;
&lt;li&gt;a private path;&lt;/li&gt;
&lt;li&gt;a request or response shape that should not be public;&lt;/li&gt;
&lt;li&gt;customer names, emails, addresses, or order details;&lt;/li&gt;
&lt;li&gt;internal implementation assumptions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So the v0.3 work treated validation as a boundary problem, not just a checklist item.&lt;/p&gt;

&lt;p&gt;The public-safe position is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;The public repo can describe validation readiness.
The private workflow can perform validation.
Raw validation evidence should remain private unless it is manually sanitized.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is more conservative than posting everything.&lt;/p&gt;

&lt;p&gt;It is also closer to how client work should be handled.&lt;/p&gt;

&lt;p&gt;The point is not to show every internal detail. The point is to show that validation was considered responsibly.&lt;/p&gt;

&lt;h2&gt;
  
  
  v0.4: private connector template
&lt;/h2&gt;

&lt;p&gt;The v0.4 milestone moved the private implementation toward a connector-template layer.&lt;/p&gt;

&lt;p&gt;This is still not published in the public repo.&lt;/p&gt;

&lt;p&gt;The public repository remains a case study. It does not contain runnable connector code, real query templates, OAuth instructions, &lt;code&gt;.env&lt;/code&gt; examples, token examples, raw API responses, or production setup steps.&lt;/p&gt;

&lt;p&gt;The private connector-template work focuses on the implementation concerns that appear when a workflow moves closer to real API delivery:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;configuration boundary
→ manual validation gate
→ scope checks
→ redaction support
→ retry and backoff behavior
→ sanitized output summaries
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This matters because real API work is not defined by a single successful request.&lt;/p&gt;

&lt;p&gt;A reporting workflow has to answer less exciting but more important questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What happens if a request fails?&lt;/li&gt;
&lt;li&gt;What happens if the token does not have the required scope?&lt;/li&gt;
&lt;li&gt;How should pagination state be handled?&lt;/li&gt;
&lt;li&gt;What should be logged?&lt;/li&gt;
&lt;li&gt;What should never be logged?&lt;/li&gt;
&lt;li&gt;What output can be shown publicly?&lt;/li&gt;
&lt;li&gt;What should remain private to the client?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For Shopify-style GraphQL reporting workflows, there are several areas that need careful design:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;access scopes;&lt;/li&gt;
&lt;li&gt;API versions;&lt;/li&gt;
&lt;li&gt;cursor pagination;&lt;/li&gt;
&lt;li&gt;retry behavior;&lt;/li&gt;
&lt;li&gt;customer data privacy;&lt;/li&gt;
&lt;li&gt;store-specific product, variant, discount, tax, refund, fulfillment, and channel fields.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Shopify's own documentation states that the REST Admin API is now a legacy API and new public apps should use the GraphQL Admin API. That does not mean this case study is a real Shopify app. It means the private workflow design needs to be aware of the GraphQL direction and cursor-style pagination.&lt;/p&gt;

&lt;p&gt;The v0.4 work is best understood as a private delivery template concept.&lt;/p&gt;

&lt;p&gt;It is not a public connector.&lt;/p&gt;

&lt;p&gt;It is not a claim that every Shopify store can use the same implementation without adaptation.&lt;/p&gt;

&lt;p&gt;It is a safer foundation for adapting the workflow when a real client has specific data access, reporting definitions, and output requirements.&lt;/p&gt;

&lt;h2&gt;
  
  
  v0.5: delivery workflow and optional extensions
&lt;/h2&gt;

&lt;p&gt;The v0.5 milestone focused on client delivery workflow planning.&lt;/p&gt;

&lt;p&gt;It did not add Google Sheets as a public integration.&lt;/p&gt;

&lt;p&gt;It did not add PostgreSQL as a production connector.&lt;/p&gt;

&lt;p&gt;It did not add a scheduler, cloud deployment, dashboard, or webhook service.&lt;/p&gt;

&lt;p&gt;Instead, v0.5 clarified how a reporting workflow could be delivered and reviewed.&lt;/p&gt;

&lt;p&gt;A small API reporting project usually needs more than a script.&lt;/p&gt;

&lt;p&gt;It needs a handoff process:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;intake
→ data access boundary
→ report definition
→ field mapping review
→ sample output review
→ acceptance checks
→ handoff notes
→ maintenance decision
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is where many small automation projects become fragile.&lt;/p&gt;

&lt;p&gt;The code may run once, but nobody has agreed on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;who owns the data access;&lt;/li&gt;
&lt;li&gt;which fields are required;&lt;/li&gt;
&lt;li&gt;how refunds and discounts should be counted;&lt;/li&gt;
&lt;li&gt;whether taxes and shipping are included;&lt;/li&gt;
&lt;li&gt;whether the output should be CSV, Excel, SQLite, PostgreSQL, or Google Sheets;&lt;/li&gt;
&lt;li&gt;how often the workflow should run;&lt;/li&gt;
&lt;li&gt;who reviews failed runs;&lt;/li&gt;
&lt;li&gt;what data should be excluded for privacy;&lt;/li&gt;
&lt;li&gt;what changes require a new mapping review.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The v0.5 public summaries frame Google Sheets, PostgreSQL, scheduling, dashboards, and cloud delivery as optional delivery decisions.&lt;/p&gt;

&lt;p&gt;That is the right level of abstraction for this repository.&lt;/p&gt;

&lt;p&gt;A client may need a Google Sheets output because the team already works there.&lt;/p&gt;

&lt;p&gt;Another client may prefer PostgreSQL because a dashboard or internal tool will query the data.&lt;/p&gt;

&lt;p&gt;Another may only need a weekly Excel workbook.&lt;/p&gt;

&lt;p&gt;The workflow should not assume the heaviest option first.&lt;/p&gt;

&lt;p&gt;The delivery path should be chosen after the reporting cadence, review process, and data ownership are clear.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I learned
&lt;/h2&gt;

&lt;p&gt;The main lesson from v0.3 through v0.5 is simple:&lt;/p&gt;

&lt;p&gt;A serious portfolio project is not just a code demo.&lt;/p&gt;

&lt;p&gt;It should show judgment.&lt;/p&gt;

&lt;p&gt;That judgment includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;what to publish;&lt;/li&gt;
&lt;li&gt;what to keep private;&lt;/li&gt;
&lt;li&gt;how to validate safely;&lt;/li&gt;
&lt;li&gt;how to handle credentials;&lt;/li&gt;
&lt;li&gt;how to avoid overclaiming;&lt;/li&gt;
&lt;li&gt;how to document limitations;&lt;/li&gt;
&lt;li&gt;how to turn a workflow into something a client can review and accept.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The engineering work is not only in the extractor or exporter.&lt;/p&gt;

&lt;p&gt;It is also in the boundary between demo, validation, and delivery.&lt;/p&gt;

&lt;p&gt;For a public portfolio repository, that boundary matters.&lt;/p&gt;

&lt;p&gt;If I publish too little, the project does not prove anything.&lt;/p&gt;

&lt;p&gt;If I publish too much, I risk exposing implementation details that should remain private or client-specific.&lt;/p&gt;

&lt;p&gt;The current structure is a compromise:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;public repo
→ explains the workflow shape and output expectations

private implementation
→ contains runnable delivery logic and client-adaptable code

client project
→ requires store-specific scopes, fields, privacy review, and metric definitions
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is the kind of structure I want this case study to demonstrate.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this project is still not
&lt;/h2&gt;

&lt;p&gt;Even after v0.3, v0.4, and v0.5, this project is still not:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a Shopify app;&lt;/li&gt;
&lt;li&gt;a public production connector;&lt;/li&gt;
&lt;li&gt;a SaaS dashboard;&lt;/li&gt;
&lt;li&gt;a full data warehouse;&lt;/li&gt;
&lt;li&gt;a Google Sheets integration;&lt;/li&gt;
&lt;li&gt;a PostgreSQL production connector;&lt;/li&gt;
&lt;li&gt;a public runnable package;&lt;/li&gt;
&lt;li&gt;a source of real Shopify data;&lt;/li&gt;
&lt;li&gt;a low-code workflow;&lt;/li&gt;
&lt;li&gt;an AI agent workflow.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It also does not claim that one workflow is ready for every Shopify store.&lt;/p&gt;

&lt;p&gt;A real Shopify reporting project would still need store-specific review:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;required objects
→ access scopes
→ field mapping
→ customer privacy rules
→ metric definitions
→ output format
→ review process
→ maintenance plan
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is not a weakness.&lt;/p&gt;

&lt;p&gt;It is the reality of API reporting work.&lt;/p&gt;

&lt;p&gt;The public case study should make that reality visible instead of hiding it behind a simplified demo.&lt;/p&gt;

</description>
      <category>python</category>
      <category>shopify</category>
      <category>api</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Designing a Shopify-style API Reporting Workflow as a Public Case Study</title>
      <dc:creator>Bob Oner</dc:creator>
      <pubDate>Tue, 16 Jun 2026 06:22:43 +0000</pubDate>
      <link>https://dev.to/bob_oner/designing-a-shopify-style-api-reporting-workflow-as-a-public-case-study-3f9f</link>
      <guid>https://dev.to/bob_oner/designing-a-shopify-style-api-reporting-workflow-as-a-public-case-study-3f9f</guid>
      <description>&lt;p&gt;In my previous article, I walked through a general Python data quality workflow using a public retail dataset.&lt;/p&gt;

&lt;p&gt;That project focused on a reusable pattern:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;raw data
→ schema mapping
→ validation
→ cleaning
→ SQLite export
→ quality report
→ benchmark evidence
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This article is about a different kind of repository.&lt;/p&gt;

&lt;p&gt;Instead of adding another version to the general ETL starter, I built a public Shopify-style e-commerce API reporting case study.&lt;/p&gt;

&lt;p&gt;Previous article:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/bob_oner/running-a-real-retail-dataset-through-a-python-data-quality-workflow-490b"&gt;Running a Real Retail Dataset Through a Python Data Quality Workflow&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The runnable open-source project behind that article is:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/OnerGit/data-quality-etl-starter" rel="noopener noreferrer"&gt;https://github.com/OnerGit/data-quality-etl-starter&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;That repository proves the general workflow capability: messy CSV, Excel, JSON, API-style data, validation, cleaning, exports, reports, analytics-ready outputs, BI-ready outputs, AI-ready preparation, and public dataset benchmark evidence.&lt;/p&gt;

&lt;p&gt;This article is about a different kind of repository:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/OnerGit/shopify-api-reporting-workflow" rel="noopener noreferrer"&gt;https://github.com/OnerGit/shopify-api-reporting-workflow&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This new repository is not another version of &lt;code&gt;data-quality-etl-starter&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;It is a public portfolio case study for a Shopify-style e-commerce API reporting workflow.&lt;/p&gt;

&lt;p&gt;For a fully runnable open-source data workflow project, see &lt;code&gt;data-quality-etl-starter&lt;/code&gt;. This new repository is a public portfolio case study for a Shopify-style e-commerce API reporting workflow. The runnable implementation is maintained privately as a reusable commercial delivery asset.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvf5pg6ywnn2cw4hh9bs0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvf5pg6ywnn2cw4hh9bs0.png" alt="Shopify-style API reporting case study overview" width="800" height="1175"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why build a Shopify-style reporting case study?
&lt;/h2&gt;

&lt;p&gt;A general data workflow project is useful, but real client work is usually vertical.&lt;/p&gt;

&lt;p&gt;A small e-commerce team does not usually ask for "a data quality ETL starter."&lt;/p&gt;

&lt;p&gt;They ask for something more specific:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Can you export Shopify orders every week?
Can you clean product and customer data?
Can you generate an Excel sales report?
Can you turn API data into CSV files?
Can you create product-level or customer-level summaries?
Can you prepare a local reporting database?
Can you help us move from REST-style exports toward GraphQL-style API data?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Those requests are narrower than a full data platform.&lt;/p&gt;

&lt;p&gt;They are also more concrete than a generic portfolio demo.&lt;/p&gt;

&lt;p&gt;That is why I built &lt;code&gt;shopify-api-reporting-workflow&lt;/code&gt; as a vertical case study. It applies the same workflow thinking from my general data quality project to a more realistic e-commerce reporting scenario.&lt;/p&gt;

&lt;p&gt;The core workflow idea is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Shopify-style API data
→ pagination
→ field mapping
→ normalized reporting tables
→ validation
→ CSV / Excel / SQLite exports
→ Markdown report
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The project is intentionally scoped around reporting workflow design, not around building a full Shopify app.&lt;/p&gt;

&lt;h2&gt;
  
  
  Public repo vs private runnable implementation
&lt;/h2&gt;

&lt;p&gt;The most important boundary in this repository is the public/private split.&lt;/p&gt;

&lt;p&gt;The public repository includes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;README.md
NOTICE.md
docs/
sample_outputs/
screenshots/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It shows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the case-study problem;&lt;/li&gt;
&lt;li&gt;the reporting workflow shape;&lt;/li&gt;
&lt;li&gt;public-safe sample output previews;&lt;/li&gt;
&lt;li&gt;screenshots from the private runnable workflow;&lt;/li&gt;
&lt;li&gt;implementation boundary notes;&lt;/li&gt;
&lt;li&gt;limitations;&lt;/li&gt;
&lt;li&gt;REST-to-GraphQL migration notes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It does not include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;source code;&lt;/li&gt;
&lt;li&gt;tests;&lt;/li&gt;
&lt;li&gt;scripts;&lt;/li&gt;
&lt;li&gt;dependency files;&lt;/li&gt;
&lt;li&gt;Docker files;&lt;/li&gt;
&lt;li&gt;complete mock data;&lt;/li&gt;
&lt;li&gt;complete field mappings;&lt;/li&gt;
&lt;li&gt;GraphQL query templates;&lt;/li&gt;
&lt;li&gt;production connector code;&lt;/li&gt;
&lt;li&gt;credentials;&lt;/li&gt;
&lt;li&gt;tokens;&lt;/li&gt;
&lt;li&gt;store domains;&lt;/li&gt;
&lt;li&gt;client data.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is intentional.&lt;/p&gt;

&lt;p&gt;The runnable implementation exists locally and privately as a reusable commercial delivery asset. The public repository is designed to explain the workflow, output expectations, design boundaries, and implementation judgment without exposing reusable private code or client-sensitive material.&lt;/p&gt;

&lt;p&gt;This is different from &lt;code&gt;data-quality-etl-starter&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;data-quality-etl-starter&lt;/code&gt; is a runnable open-source project.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;shopify-api-reporting-workflow&lt;/code&gt; is a public case-study repository.&lt;/p&gt;

&lt;p&gt;Both are useful, but they serve different purposes.&lt;/p&gt;

&lt;h2&gt;
  
  
  v0.1: mock REST-style API reporting workflow
&lt;/h2&gt;

&lt;p&gt;The v0.1 private implementation models a mock REST-style e-commerce API reporting workflow.&lt;/p&gt;

&lt;p&gt;It uses fake local fixtures only. It does not call the real Shopify API.&lt;/p&gt;

&lt;p&gt;The workflow shape is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;mock REST-style API fixtures
→ paginated orders extraction
→ products / customers extraction
→ field mapping
→ order/customer flattening
→ line item expansion
→ validation
→ CSV / Excel / SQLite export
→ summary tables
→ Markdown report
→ sanitized public outputs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvmmpavqv3gw9f7aemiax.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvmmpavqv3gw9f7aemiax.png" alt="v0.1 mock REST-style workflow run" width="800" height="125"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The goal of v0.1 was to prove the reporting workflow shape.&lt;/p&gt;

&lt;p&gt;In e-commerce reporting, orders are often nested. A single order may contain customer information, shipping fields, fulfillment fields, tax fields, discounts, and line items.&lt;/p&gt;

&lt;p&gt;That data is not always easy to use directly in a spreadsheet.&lt;/p&gt;

&lt;p&gt;A practical reporting workflow usually needs to split the data into tables such as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;orders
order_line_items
customers
products
sales_summary_by_month
sales_summary_by_product
customer_order_summary
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is the main idea behind v0.1.&lt;/p&gt;

&lt;p&gt;The workflow demonstrates how paginated API-style order data can be normalized into reporting-friendly outputs.&lt;/p&gt;

&lt;p&gt;The public repository includes sanitized preview files such as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;report_preview.md
orders_cleaned_preview.csv
sales_summary_by_month_preview.csv
sales_summary_by_product_preview.csv
customer_order_summary_preview.csv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F08mmjkuqa6ht8ovpc0q6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F08mmjkuqa6ht8ovpc0q6.png" alt="Public-safe sample output previews" width="800" height="267"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Those previews are intentionally small. They show output shape, not full production coverage.&lt;/p&gt;

&lt;p&gt;The private workflow also has test evidence. The public screenshot is included only to show that the private implementation was checked locally; it does not expose the implementation itself.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feb9jvi7dkrzyt73j2ubv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feb9jvi7dkrzyt73j2ubv.png" alt="Private workflow test evidence" width="799" height="256"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why CSV, Excel, and SQLite outputs?
&lt;/h2&gt;

&lt;p&gt;For many small e-commerce reporting requests, the first deliverable is not a data warehouse.&lt;/p&gt;

&lt;p&gt;It is usually something more practical:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CSV files for import or review
Excel workbook for store operators
SQLite-style local database for lightweight handoff
Markdown report for validation notes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is why v0.1 focuses on export formats that are easy to inspect.&lt;/p&gt;

&lt;p&gt;A store operator may want an Excel workbook.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fygv501k4rk7tbwgxd7ps.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fygv501k4rk7tbwgxd7ps.png" alt="Excel workbook preview" width="800" height="272"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A technical client may want CSV files.&lt;/p&gt;

&lt;p&gt;A developer or analyst may want a local SQLite database.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2ldj92vepwlalsiipuhp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2ldj92vepwlalsiipuhp.png" alt="SQLite-style reporting tables preview" width="800" height="198"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The workflow also produces a Markdown report preview with extraction, validation, output, and limitation notes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7xfwll64kr1f7fnqha6e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7xfwll64kr1f7fnqha6e.png" alt="Sanitized Markdown report preview" width="800" height="1037"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The important point is not the file format itself. The important point is the handoff:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;What data was extracted?
What was normalized?
What warnings were found?
What summaries were generated?
What files were produced?
What assumptions need to be checked?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is the kind of reporting workflow clients can review before moving into heavier BI infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  v0.2: GraphQL-shaped mock workflow and cursor pagination
&lt;/h2&gt;

&lt;p&gt;The v0.2 update adds a GraphQL-shaped mock workflow.&lt;/p&gt;

&lt;p&gt;This matters because Shopify's REST Admin API is now a legacy API, and new public apps should be designed around GraphQL Admin API.&lt;/p&gt;

&lt;p&gt;The v0.2 workflow still does not call Shopify.&lt;/p&gt;

&lt;p&gt;It uses local fake fixtures shaped like GraphQL responses.&lt;/p&gt;

&lt;p&gt;The mock input structure includes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;edges
node
cursor
pageInfo
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That makes the case study more realistic than a simple REST-style mock export.&lt;/p&gt;

&lt;p&gt;The v0.2 workflow simulates cursor-style pagination and keeps the same reporting output concept:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;fake GraphQL-shaped order data
→ cursor-style pagination simulation
→ GraphQL-style field path mapping
→ normalized reporting tables
→ validation notes
→ sanitized GraphQL report preview
→ REST-to-GraphQL migration summary
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmvmn95gz29f8hd2gz6mc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmvmn95gz29f8hd2gz6mc.png" alt="v0.2 GraphQL-shaped mock workflow run" width="800" height="136"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The public repository includes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;report_preview_graphql.md
docs/graphql_workflow_summary.md
docs/rest_to_graphql_mock_migration_summary.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Again, this is not a production GraphQL Admin API client.&lt;/p&gt;

&lt;p&gt;It does not include real GraphQL queries, OAuth, tokens, real store domains, access scopes, or production connector code.&lt;/p&gt;

&lt;p&gt;The purpose of v0.2 is to show that the reporting workflow design is aware of the GraphQL direction and cursor-style pagination pattern.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why model GraphQL-shaped responses?
&lt;/h2&gt;

&lt;p&gt;A REST-style mock workflow is easy to understand, but it is not enough for a Shopify-aware reporting case study.&lt;/p&gt;

&lt;p&gt;A real Shopify implementation would need to handle the current Admin API direction, approved access scopes, secure credentials, cursor pagination, rate limits, retries, and store-specific field mapping.&lt;/p&gt;

&lt;p&gt;The public repository does not try to solve all of that.&lt;/p&gt;

&lt;p&gt;Instead, it models the shape of the problem:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;GraphQL connection response
→ cursor pagination state
→ nested node extraction
→ field path mapping
→ normalized reporting tables
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is useful because reporting work depends heavily on the shape of the source data.&lt;/p&gt;

&lt;p&gt;If the source response shape changes, the mapping layer changes.&lt;/p&gt;

&lt;p&gt;If pagination changes, the extraction layer changes.&lt;/p&gt;

&lt;p&gt;If the reporting definitions change, the summary layer changes.&lt;/p&gt;

&lt;p&gt;The case study makes those boundaries visible without publishing a production connector.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the public repo shows
&lt;/h2&gt;

&lt;p&gt;The public repository shows the case study through documentation, sample outputs, and screenshots.&lt;/p&gt;

&lt;p&gt;The public material demonstrates:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a v0.1 REST-style mock workflow run;&lt;/li&gt;
&lt;li&gt;a v0.2 GraphQL-shaped mock workflow run;&lt;/li&gt;
&lt;li&gt;private test evidence;&lt;/li&gt;
&lt;li&gt;public-safe sample outputs;&lt;/li&gt;
&lt;li&gt;Excel-style workbook preview;&lt;/li&gt;
&lt;li&gt;Markdown report preview;&lt;/li&gt;
&lt;li&gt;SQLite-style table preview;&lt;/li&gt;
&lt;li&gt;implementation boundary notes;&lt;/li&gt;
&lt;li&gt;limitations;&lt;/li&gt;
&lt;li&gt;workflow mapping to real client scenarios.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The public sample outputs include:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;report_preview.md
report_preview_graphql.md
orders_cleaned_preview.csv
sales_summary_by_month_preview.csv
sales_summary_by_product_preview.csv
customer_order_summary_preview.csv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The screenshots are evidence from the private runnable workflow and sanitized public preview files.&lt;/p&gt;

&lt;p&gt;They are included to show workflow behavior and output shape, not to expose the implementation.&lt;/p&gt;

&lt;p&gt;This distinction is important.&lt;/p&gt;

&lt;p&gt;A screenshot can show that a workflow exists and what it produces. It should not expose credentials, tokens, real store domains, client data, private paths, complete mock fixtures, or source code from the private implementation.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is intentionally out of scope
&lt;/h2&gt;

&lt;p&gt;This repository is intentionally not:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a Shopify app;&lt;/li&gt;
&lt;li&gt;a production Shopify connector;&lt;/li&gt;
&lt;li&gt;a public runnable implementation;&lt;/li&gt;
&lt;li&gt;a complete GraphQL Admin API client;&lt;/li&gt;
&lt;li&gt;an OAuth implementation;&lt;/li&gt;
&lt;li&gt;a live-store integration;&lt;/li&gt;
&lt;li&gt;a webhook service;&lt;/li&gt;
&lt;li&gt;a full data warehouse;&lt;/li&gt;
&lt;li&gt;a BI dashboard;&lt;/li&gt;
&lt;li&gt;a SaaS product;&lt;/li&gt;
&lt;li&gt;a low-code or n8n workflow;&lt;/li&gt;
&lt;li&gt;an AI agent workflow.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It does not include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;real Shopify tokens;&lt;/li&gt;
&lt;li&gt;store domains;&lt;/li&gt;
&lt;li&gt;client data;&lt;/li&gt;
&lt;li&gt;raw API responses;&lt;/li&gt;
&lt;li&gt;complete field mappings;&lt;/li&gt;
&lt;li&gt;production GraphQL query templates;&lt;/li&gt;
&lt;li&gt;production connector code;&lt;/li&gt;
&lt;li&gt;complete mock datasets;&lt;/li&gt;
&lt;li&gt;private implementation paths.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A real Shopify reporting project would need to confirm many things before implementation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;required Shopify objects;&lt;/li&gt;
&lt;li&gt;access scopes;&lt;/li&gt;
&lt;li&gt;authentication approach;&lt;/li&gt;
&lt;li&gt;pagination behavior;&lt;/li&gt;
&lt;li&gt;rate limits and retries;&lt;/li&gt;
&lt;li&gt;reporting metric definitions;&lt;/li&gt;
&lt;li&gt;output file requirements;&lt;/li&gt;
&lt;li&gt;customer data privacy requirements;&lt;/li&gt;
&lt;li&gt;store-specific product, variant, discount, refund, tax, shipping, fulfillment, and channel fields.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The public repository does not hide those requirements. It documents the boundary.&lt;/p&gt;

&lt;h2&gt;
  
  
  How this maps to real client work
&lt;/h2&gt;

&lt;p&gt;This type of workflow maps to practical e-commerce reporting requests.&lt;/p&gt;

&lt;p&gt;Examples include:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Shopify order export to CSV
API to Excel reporting workbook
product and customer cleanup
order line item expansion
sales summary by month
sales summary by product
customer order summary
API-to-database workflow
GraphQL migration-aware reporting workflow
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The value is not just extracting data.&lt;/p&gt;

&lt;p&gt;The value is structuring the workflow so another person can understand it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;extract
→ map
→ validate
→ normalize
→ summarize
→ export
→ report
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For a small reporting automation project, this can be a useful first stage before investing in a larger dashboard, warehouse, or SaaS tool.&lt;/p&gt;

&lt;p&gt;The workflow also creates a safer technical discussion.&lt;/p&gt;

&lt;p&gt;Instead of jumping straight into implementation, the project encourages questions like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which Shopify objects matter?&lt;/li&gt;
&lt;li&gt;Which fields should be included?&lt;/li&gt;
&lt;li&gt;How should refunds and discounts be counted?&lt;/li&gt;
&lt;li&gt;Should reporting use gross sales or net sales?&lt;/li&gt;
&lt;li&gt;Are taxes and shipping included?&lt;/li&gt;
&lt;li&gt;Which output format is easiest to review?&lt;/li&gt;
&lt;li&gt;Should the workflow produce a local database?&lt;/li&gt;
&lt;li&gt;What data should be excluded for privacy reasons?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those questions are part of the engineering work.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I would validate next
&lt;/h2&gt;

&lt;p&gt;The next step would not be to publish the private implementation.&lt;/p&gt;

&lt;p&gt;Instead, I would validate the case study against more realistic reporting requirements.&lt;/p&gt;

&lt;p&gt;Areas to validate next include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;order-level and line-item-level metric definitions;&lt;/li&gt;
&lt;li&gt;refund and cancellation handling;&lt;/li&gt;
&lt;li&gt;product variant mapping;&lt;/li&gt;
&lt;li&gt;discount and tax treatment;&lt;/li&gt;
&lt;li&gt;shipping and fulfillment status fields;&lt;/li&gt;
&lt;li&gt;customer privacy handling;&lt;/li&gt;
&lt;li&gt;incremental sync assumptions;&lt;/li&gt;
&lt;li&gt;GraphQL rate-limit and retry strategy;&lt;/li&gt;
&lt;li&gt;client-specific Excel workbook layout;&lt;/li&gt;
&lt;li&gt;whether the final handoff should be CSV, Excel, SQLite, PostgreSQL, or BI-ready tables.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I would also keep the public/private boundary in place.&lt;/p&gt;

&lt;p&gt;The public repo should remain a case-study asset.&lt;/p&gt;

&lt;p&gt;The private implementation should remain reusable, adaptable, and safe for commercial delivery.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing summary
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;data-quality-etl-starter&lt;/code&gt; shows the general data workflow pattern.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;shopify-api-reporting-workflow&lt;/code&gt; applies the same thinking to a vertical e-commerce reporting scenario.&lt;/p&gt;

&lt;p&gt;The first project proves the reusable data quality workflow.&lt;/p&gt;

&lt;p&gt;The second project shows how that workflow thinking can be narrowed into a Shopify-style API reporting case study with public-safe documentation, sanitized output previews, REST-style workflow evidence, and GraphQL-shaped pagination awareness.&lt;/p&gt;

&lt;p&gt;It is not a Shopify app.&lt;/p&gt;

&lt;p&gt;It is not a production connector.&lt;/p&gt;

&lt;p&gt;It is not a public runnable package.&lt;/p&gt;

&lt;p&gt;It is a transparent portfolio case study for a practical reporting workflow: API-shaped e-commerce data into validated, normalized, reporting-ready outputs.&lt;/p&gt;

</description>
      <category>python</category>
      <category>shopify</category>
      <category>api</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Running a Real Retail Dataset Through a Python Data Quality Workflow</title>
      <dc:creator>Bob Oner</dc:creator>
      <pubDate>Tue, 16 Jun 2026 02:54:17 +0000</pubDate>
      <link>https://dev.to/bob_oner/running-a-real-retail-dataset-through-a-python-data-quality-workflow-490b</link>
      <guid>https://dev.to/bob_oner/running-a-real-retail-dataset-through-a-python-data-quality-workflow-490b</guid>
      <description>&lt;p&gt;In the previous article, I extended a small Python data quality ETL starter with AI-ready data preparation.&lt;/p&gt;

&lt;p&gt;The important constraint was that the workflow did not call an LLM API, generate embeddings, or train a model. It prepared structured data assets such as schema profiles, data dictionaries, validation summaries, feature-ready CSV files, and manifest files.&lt;/p&gt;

&lt;p&gt;Previous article:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/bob_oner/preparing-ai-ready-data-without-calling-an-llm-api-5daf"&gt;Preparing AI-Ready Data Without Calling an LLM API&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This follow-up focuses on the v0.7.0 update of the same project:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/OnerGit/data-quality-etl-starter" rel="noopener noreferrer"&gt;Data Quality ETL Starter on GitHub&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The new goal is to move beyond synthetic demo data and show that the same data quality workflow can process a public retail/e-commerce-style dataset locally.&lt;/p&gt;

&lt;p&gt;This is still not a big data platform, a production retail analytics system, a benchmark leaderboard, or a public dataset redistribution repository.&lt;/p&gt;

&lt;p&gt;The goal is narrower and more practical:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;manually downloaded public retail dataset
        ↓
prepare_real_dataset_demo.py
        ↓
normalized retail transaction CSV
        ↓
existing CLI validation and cleaning workflow
        ↓
quality reports + SQLite export
        ↓
run_real_dataset_benchmark.py
        ↓
benchmark report + summary CSV outputs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is a useful next step for a portfolio project because it shows the workflow can handle a more realistic dataset while still keeping data handling, scope, and reproducibility clear.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why add a real dataset benchmark?
&lt;/h2&gt;

&lt;p&gt;Earlier versions of this project used small sample files and generated synthetic order data.&lt;/p&gt;

&lt;p&gt;That is useful for testing and documentation, but it leaves one practical question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Can the workflow handle a public dataset that was not designed specifically for this repository?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;v0.7.0 adds an optional real dataset benchmark path to answer that question.&lt;/p&gt;

&lt;p&gt;The workflow now demonstrates how to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;take a public retail transaction dataset;&lt;/li&gt;
&lt;li&gt;keep the raw dataset local-only;&lt;/li&gt;
&lt;li&gt;map external source columns into a project-friendly schema;&lt;/li&gt;
&lt;li&gt;derive practical fields such as revenue and cancellation flags;&lt;/li&gt;
&lt;li&gt;reuse the existing CLI validation and cleaning workflow;&lt;/li&gt;
&lt;li&gt;generate Markdown and JSON quality reports;&lt;/li&gt;
&lt;li&gt;export cleaned data to SQLite;&lt;/li&gt;
&lt;li&gt;produce benchmark evidence and summary CSV files.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key design choice is that the existing CLI remains the source of truth.&lt;/p&gt;

&lt;p&gt;The real dataset path does not become a separate pipeline. It prepares the source data, then passes it through the same validation and cleaning workflow used by the rest of the project.&lt;/p&gt;

&lt;h2&gt;
  
  
  Dataset used in v0.7.0
&lt;/h2&gt;

&lt;p&gt;The default v0.7.0 dataset is the UCI Online Retail dataset.&lt;/p&gt;

&lt;p&gt;Official source:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://archive.ics.uci.edu/dataset/352/online%2Bretail" rel="noopener noreferrer"&gt;UCI Machine Learning Repository: Online Retail&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Citation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Chen, D. (2015). Online Retail [Dataset]. UCI Machine Learning Repository.
https://doi.org/10.24432/C5BW33
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;License note:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Creative Commons Attribution 4.0 International (CC BY 4.0)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The dataset is useful for this project because it is retail/e-commerce adjacent and transaction-shaped. It includes fields that map naturally into an invoice/order-style workflow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;InvoiceNo
StockCode
Description
Quantity
InvoiceDate
UnitPrice
CustomerID
Country
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The project maps those source columns into normalized &lt;code&gt;snake_case&lt;/code&gt; columns and adds derived fields.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is kept out of Git
&lt;/h2&gt;

&lt;p&gt;This part is important.&lt;/p&gt;

&lt;p&gt;The repository does not redistribute the full raw UCI dataset. It also does not commit full normalized or cleaned real dataset outputs.&lt;/p&gt;

&lt;p&gt;These paths are local-only:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data/external/
data/raw/public/
data/output/real_dataset/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The repository keeps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;source code;&lt;/li&gt;
&lt;li&gt;schema files;&lt;/li&gt;
&lt;li&gt;tests;&lt;/li&gt;
&lt;li&gt;documentation;&lt;/li&gt;
&lt;li&gt;screenshots;&lt;/li&gt;
&lt;li&gt;small sample inputs;&lt;/li&gt;
&lt;li&gt;instructions for running the workflow locally.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It does not keep:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;full downloaded raw datasets;&lt;/li&gt;
&lt;li&gt;full normalized real dataset outputs;&lt;/li&gt;
&lt;li&gt;full cleaned real dataset outputs;&lt;/li&gt;
&lt;li&gt;local SQLite files generated from real datasets;&lt;/li&gt;
&lt;li&gt;private customer data;&lt;/li&gt;
&lt;li&gt;client data;&lt;/li&gt;
&lt;li&gt;API credentials;&lt;/li&gt;
&lt;li&gt;tokens or secrets.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This keeps the repository lightweight and avoids turning it into a dataset mirror.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxzqv8xdawan9qw018n6h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxzqv8xdawan9qw018n6h.png" alt="Real dataset source and redistribution note" width="800" height="1157"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What v0.7.0 adds
&lt;/h2&gt;

&lt;p&gt;The most relevant new files are:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;scripts/prepare_real_dataset_demo.py
scripts/run_real_dataset_benchmark.py
src/dq_etl_starter/real_dataset.py
docs/data_sources.md
docs/real_dataset_benchmark.md
docs/limitations.md
data/expected/online_retail_schema.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The real dataset helper module handles the project-specific mapping and summary logic.&lt;/p&gt;

&lt;p&gt;The two scripts provide a simple local workflow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;prepare the manually downloaded dataset into a normalized CSV;&lt;/li&gt;
&lt;li&gt;generate local benchmark evidence and summary outputs after the CLI quality workflow runs.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Project structure after the update
&lt;/h2&gt;

&lt;p&gt;The project now has a clearer path from messy input files to public-dataset benchmark evidence:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data-quality-etl-starter/
├── data/
│   ├── expected/
│   │   └── online_retail_schema.json
│   └── output/
├── docs/
│   ├── data_sources.md
│   ├── limitations.md
│   └── real_dataset_benchmark.md
├── screenshots/
├── scripts/
│   ├── prepare_real_dataset_demo.py
│   └── run_real_dataset_benchmark.py
├── src/dq_etl_starter/
│   ├── real_dataset.py
│   ├── cli.py
│   ├── clean.py
│   ├── report.py
│   └── validate.py
└── tests/
    ├── test_real_dataset.py
    └── test_real_dataset_benchmark.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The real dataset path is optional. The default small sample workflows remain unchanged.&lt;/p&gt;

&lt;h2&gt;
  
  
  Install the project locally
&lt;/h2&gt;

&lt;p&gt;Clone the repository:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/OnerGit/data-quality-etl-starter.git
&lt;span class="nb"&gt;cd &lt;/span&gt;data-quality-etl-starter
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create a virtual environment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;-m&lt;/span&gt; venv .venv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Activate it on macOS or Linux:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;source&lt;/span&gt; .venv/bin/activate
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Activate it on Windows PowerShell:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;venv&lt;/span&gt;&lt;span class="n"&gt;\Scripts\activate&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Install dependencies and the local package:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The editable install step is useful because the project uses a &lt;code&gt;src/&lt;/code&gt; layout.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Download the public dataset manually
&lt;/h2&gt;

&lt;p&gt;Download the UCI Online Retail dataset from the official UCI Machine Learning Repository page.&lt;/p&gt;

&lt;p&gt;Place the file here:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data/external/online_retail.xlsx
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The project does not automatically download the dataset by default.&lt;/p&gt;

&lt;p&gt;That is intentional.&lt;/p&gt;

&lt;p&gt;For a public portfolio repository, I prefer to keep the data acquisition step explicit. It makes the source, license, citation, and local-only handling policy easier to review.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Prepare the normalized dataset
&lt;/h2&gt;

&lt;p&gt;Run the preparation script.&lt;/p&gt;

&lt;p&gt;macOS / Linux:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python scripts/prepare_real_dataset_demo.py &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--raw-input&lt;/span&gt; data/external/online_retail.xlsx &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; data/output/real_dataset/online_retail_normalized.csv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Windows PowerShell:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;python&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;scripts/prepare_real_dataset_demo.py&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--raw-input&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;data/external/online_retail.xlsx&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--output&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;data/output/real_dataset/online_retail_normalized.csv&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This step reads the local source file, validates expected source columns, maps UCI columns into project-friendly names, derives additional fields, and writes a normalized CSV.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhac0tzy4s5fjhqyzg19h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhac0tzy4s5fjhqyzg19h.png" alt="Real dataset preparation run" width="800" height="469"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The normalized output columns are:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;invoice_no
stock_code
description
quantity
invoice_date
unit_price
customer_id
country
revenue
is_cancellation
source_dataset
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The derived fields are simple but useful:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;revenue&lt;/code&gt; is derived from quantity and unit price;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;is_cancellation&lt;/code&gt; marks cancellation-style rows;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;source_dataset&lt;/code&gt; records dataset lineage.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This preparation layer is deliberately small. It does not try to perform all business logic. It only converts the external dataset into a shape that the existing project workflow can validate and clean.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Run the existing CLI workflow
&lt;/h2&gt;

&lt;p&gt;After preparation, the normalized CSV is passed into the existing CLI workflow.&lt;/p&gt;

&lt;p&gt;macOS / Linux:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;-m&lt;/span&gt; dq_etl_starter.cli run &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--input&lt;/span&gt; data/output/real_dataset/online_retail_normalized.csv &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--input-type&lt;/span&gt; csv &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--schema&lt;/span&gt; data/expected/online_retail_schema.json &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output-dir&lt;/span&gt; data/output/real_dataset/run &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--db-target&lt;/span&gt; sqlite &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--table-name&lt;/span&gt; cleaned_online_retail
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Windows PowerShell:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;python&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-m&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;dq_etl_starter.cli&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;run&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--input&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;data/output/real_dataset/online_retail_normalized.csv&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--input-type&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;csv&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--schema&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;data/expected/online_retail_schema.json&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--output-dir&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;data/output/real_dataset/run&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--db-target&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;sqlite&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--table-name&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;cleaned_online_retail&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected local outputs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data/output/real_dataset/run/cleaned_online_retail.csv
data/output/real_dataset/run/etl_output.sqlite
data/output/real_dataset/run/quality_report.md
data/output/real_dataset/run/quality_report.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foeueblimhk6v90rmpxtq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foeueblimhk6v90rmpxtq.png" alt="Existing CLI workflow running against normalized real dataset" width="800" height="256"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is the most important design point in v0.7.0.&lt;/p&gt;

&lt;p&gt;The real dataset path reuses the existing validation and cleaning workflow. It does not create a special one-off script that bypasses the project architecture.&lt;/p&gt;

&lt;h2&gt;
  
  
  Schema for the normalized retail dataset
&lt;/h2&gt;

&lt;p&gt;The schema file is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data/expected/online_retail_schema.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It defines the expected normalized columns and validation rules for fields such as invoice number, stock code, quantity, invoice date, unit price, customer ID, country, revenue, cancellation flag, and source dataset.&lt;/p&gt;

&lt;p&gt;The schema is not intended to certify the dataset as business-ready.&lt;/p&gt;

&lt;p&gt;It is a practical contract for this starter workflow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;external retail columns
        ↓
normalized project columns
        ↓
expected schema rules
        ↓
quality report
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is a useful handoff pattern because the next person can inspect both the mapping and the validation report.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quality report
&lt;/h2&gt;

&lt;p&gt;The CLI workflow writes a Markdown report and a JSON report.&lt;/p&gt;

&lt;p&gt;For the real dataset workflow, the Markdown report is written to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data/output/real_dataset/run/quality_report.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd16k5yukz427rdq5bnm1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd16k5yukz427rdq5bnm1.png" alt="Real dataset quality report" width="800" height="1476"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The report is useful because it records what the workflow found rather than only producing a cleaned file.&lt;/p&gt;

&lt;p&gt;Typical report sections include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;raw row count;&lt;/li&gt;
&lt;li&gt;cleaned row count;&lt;/li&gt;
&lt;li&gt;missing values by column;&lt;/li&gt;
&lt;li&gt;duplicate row count;&lt;/li&gt;
&lt;li&gt;expected column checks;&lt;/li&gt;
&lt;li&gt;validation issue summaries;&lt;/li&gt;
&lt;li&gt;output file paths.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a repeatable data workflow, this is important. A cleaned output file alone is not enough. The workflow should also explain what was detected and what still needs review.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4: Generate the real dataset benchmark report
&lt;/h2&gt;

&lt;p&gt;After the CLI workflow finishes, generate a local benchmark report and summary outputs.&lt;/p&gt;

&lt;p&gt;macOS / Linux:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python scripts/run_real_dataset_benchmark.py &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--normalized-input&lt;/span&gt; data/output/real_dataset/online_retail_normalized.csv &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--quality-report&lt;/span&gt; data/output/real_dataset/run/quality_report.json &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output-dir&lt;/span&gt; data/output/real_dataset &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--dataset-name&lt;/span&gt; uci_online_retail
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Windows PowerShell:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;python&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;scripts/run_real_dataset_benchmark.py&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--normalized-input&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;data/output/real_dataset/online_retail_normalized.csv&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--quality-report&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;data/output/real_dataset/run/quality_report.json&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--output-dir&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;data/output/real_dataset&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--dataset-name&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;uci_online_retail&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected local outputs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data/output/real_dataset/benchmark_report.md
data/output/real_dataset/summary/revenue_by_country.csv
data/output/real_dataset/summary/revenue_by_month.csv
data/output/real_dataset/summary/cancellation_summary.csv
data/output/real_dataset/summary/missing_customer_summary.csv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2h630ddfl36lauxhz38w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2h630ddfl36lauxhz38w.png" alt="Real dataset benchmark report" width="800" height="1167"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The benchmark report is not a universal performance claim.&lt;/p&gt;

&lt;p&gt;It is local evidence for this machine, this dependency environment, and this dataset preparation flow.&lt;/p&gt;

&lt;p&gt;That distinction matters. Runtime can change depending on CPU, disk speed, Python version, package versions, source file format, operating system, and local machine conditions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary outputs
&lt;/h2&gt;

&lt;p&gt;The benchmark script also writes lightweight summary CSV files.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9q4b6ssuorcgv8kyb9g4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9q4b6ssuorcgv8kyb9g4.png" alt="Real dataset summary outputs" width="799" height="225"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The summary outputs are intentionally simple:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;revenue_by_country.csv
revenue_by_month.csv
cancellation_summary.csv
missing_customer_summary.csv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;They are not a full BI model.&lt;/p&gt;

&lt;p&gt;They are small reporting-ready outputs that show how a cleaned retail transaction dataset can be summarized after validation.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;revenue_by_country.csv&lt;/code&gt; supports country-level revenue inspection;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;revenue_by_month.csv&lt;/code&gt; supports monthly trend inspection;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;cancellation_summary.csv&lt;/code&gt; records cancellation and non-positive row counters;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;missing_customer_summary.csv&lt;/code&gt; helps inspect where customer IDs are missing.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is often enough for a first local reporting workflow.&lt;/p&gt;

&lt;p&gt;The next version could load these into PostgreSQL, query them in DuckDB, or feed a local dashboard, but v0.7.0 intentionally keeps the real dataset path focused.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the benchmark report records
&lt;/h2&gt;

&lt;p&gt;The benchmark report is designed to answer practical questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which dataset was used?&lt;/li&gt;
&lt;li&gt;Where did the normalized input come from?&lt;/li&gt;
&lt;li&gt;How many rows were normalized?&lt;/li&gt;
&lt;li&gt;How many rows reached the CLI quality workflow?&lt;/li&gt;
&lt;li&gt;How many duplicate rows were detected?&lt;/li&gt;
&lt;li&gt;How many cancellation rows were identified?&lt;/li&gt;
&lt;li&gt;How many customer IDs or descriptions were missing?&lt;/li&gt;
&lt;li&gt;Were invoice dates, quantities, and prices validated?&lt;/li&gt;
&lt;li&gt;What files were produced?&lt;/li&gt;
&lt;li&gt;What limitations apply?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That makes the run easier to review later.&lt;/p&gt;

&lt;p&gt;It also makes the project stronger as a portfolio asset because the workflow is not only described in prose. It leaves behind files, screenshots, reports, and commands that can be inspected.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why not automatically download the dataset?
&lt;/h2&gt;

&lt;p&gt;The project could theoretically download the dataset automatically.&lt;/p&gt;

&lt;p&gt;For this version, I chose not to do that.&lt;/p&gt;

&lt;p&gt;Manual download keeps the workflow clearer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the user sees the official source page;&lt;/li&gt;
&lt;li&gt;the dataset citation remains visible;&lt;/li&gt;
&lt;li&gt;the license note is explicit;&lt;/li&gt;
&lt;li&gt;the repository does not redistribute the raw dataset;&lt;/li&gt;
&lt;li&gt;the workflow does not depend on hidden network access;&lt;/li&gt;
&lt;li&gt;local-only data handling is easier to explain.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a small portfolio repository, this is a reasonable trade-off.&lt;/p&gt;

&lt;p&gt;The project demonstrates how to process the dataset, not how to become a dataset distribution tool.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tests
&lt;/h2&gt;

&lt;p&gt;Run the full test suite:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;-m&lt;/span&gt; compileall &lt;span class="nt"&gt;-q&lt;/span&gt; src/dq_etl_starter
python &lt;span class="nt"&gt;-m&lt;/span&gt; compileall &lt;span class="nt"&gt;-q&lt;/span&gt; scripts
pytest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run the v0.7-related tests:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pytest tests/test_real_dataset.py
pytest tests/test_real_dataset_benchmark.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The tests focus on the reusable code paths rather than requiring the full external dataset to be committed.&lt;/p&gt;

&lt;p&gt;That is another useful pattern for public repositories: test the transformation logic with small fixtures, and keep the large external dataset local-only.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is intentionally out of scope?
&lt;/h2&gt;

&lt;p&gt;The v0.7.0 real dataset benchmark does not add:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;automatic dataset download;&lt;/li&gt;
&lt;li&gt;raw dataset redistribution;&lt;/li&gt;
&lt;li&gt;production scheduling;&lt;/li&gt;
&lt;li&gt;Airflow orchestration;&lt;/li&gt;
&lt;li&gt;dbt modeling;&lt;/li&gt;
&lt;li&gt;Snowflake, Databricks, or PySpark;&lt;/li&gt;
&lt;li&gt;production-scale retail analytics;&lt;/li&gt;
&lt;li&gt;a complete BI dashboard;&lt;/li&gt;
&lt;li&gt;a benchmark leaderboard;&lt;/li&gt;
&lt;li&gt;machine learning training;&lt;/li&gt;
&lt;li&gt;LLM calls;&lt;/li&gt;
&lt;li&gt;RAG or AI agent features.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This project is still a small Python data workflow starter.&lt;/p&gt;

&lt;p&gt;The v0.7.0 update proves a specific point: the workflow can be applied to a public retail transaction dataset locally, while keeping the data handling policy, validation steps, outputs, and limitations clear.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I would improve next
&lt;/h2&gt;

&lt;p&gt;Possible next improvements include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;adding a Makefile for repeated demo commands;&lt;/li&gt;
&lt;li&gt;adding a smaller public fixture for faster walkthroughs;&lt;/li&gt;
&lt;li&gt;adding optional DuckDB queries for the real dataset summaries;&lt;/li&gt;
&lt;li&gt;adding optional PostgreSQL reporting tables for the real dataset path;&lt;/li&gt;
&lt;li&gt;adding a short CI workflow for core tests;&lt;/li&gt;
&lt;li&gt;improving benchmark report formatting;&lt;/li&gt;
&lt;li&gt;adding more detailed data source mapping documentation;&lt;/li&gt;
&lt;li&gt;adding a second public dataset only if it does not make the project too broad.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The main constraint remains the same:&lt;/p&gt;

&lt;p&gt;Keep the project small, reproducible, inspectable, and easy to adapt.&lt;/p&gt;

&lt;h2&gt;
  
  
  Repository
&lt;/h2&gt;

&lt;p&gt;GitHub repository:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/OnerGit/data-quality-etl-starter" rel="noopener noreferrer"&gt;https://github.com/OnerGit/data-quality-etl-starter&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Previous article:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/bob_oner/preparing-ai-ready-data-without-calling-an-llm-api-5daf"&gt;Preparing AI-Ready Data Without Calling an LLM API&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This v0.7.0 update is a practical next step: from synthetic and generated demos to a local public retail dataset benchmark that reuses the same validation, cleaning, reporting, and handoff workflow.&lt;/p&gt;

</description>
      <category>python</category>
      <category>etl</category>
      <category>database</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Preparing AI-Ready Data Without Calling an LLM API</title>
      <dc:creator>Bob Oner</dc:creator>
      <pubDate>Fri, 12 Jun 2026 15:18:41 +0000</pubDate>
      <link>https://dev.to/bob_oner/preparing-ai-ready-data-without-calling-an-llm-api-5daf</link>
      <guid>https://dev.to/bob_oner/preparing-ai-ready-data-without-calling-an-llm-api-5daf</guid>
      <description>&lt;p&gt;In the previous article, I extended a small Python data quality ETL starter from cleaned data into BI-ready reporting tables with PostgreSQL, SQL views, and an optional Metabase dashboard.&lt;/p&gt;

&lt;p&gt;Previous article:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/bob_oner/from-clean-data-to-bi-ready-reporting-tables-with-python-postgresql-and-metabase-348p"&gt;From Clean Data to BI-Ready Reporting Tables with Python, PostgreSQL, and Metabase&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This follow-up focuses on the v0.6.0 update of the same project:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/OnerGit/data-quality-etl-starter" rel="noopener noreferrer"&gt;Data Quality ETL Starter on GitHub&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The v0.6.0 update adds an optional &lt;strong&gt;AI-ready data preparation&lt;/strong&gt; demo.&lt;/p&gt;

&lt;p&gt;That phrase can easily become vague, so I want to define it clearly.&lt;/p&gt;

&lt;p&gt;In this project, "AI-ready" does &lt;strong&gt;not&lt;/strong&gt; mean:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;calling an LLM API;&lt;/li&gt;
&lt;li&gt;generating embeddings;&lt;/li&gt;
&lt;li&gt;creating a vector database;&lt;/li&gt;
&lt;li&gt;building a RAG chatbot;&lt;/li&gt;
&lt;li&gt;training a machine learning model;&lt;/li&gt;
&lt;li&gt;adding an AI agent;&lt;/li&gt;
&lt;li&gt;automatically cleaning data with an LLM.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Instead, AI-ready means something more practical and earlier in the workflow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cleaned
validated
documented
machine-readable
safe to inspect before downstream BI, ML, or AI use
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The goal is not to build an AI application. The goal is to prepare data artifacts that another workflow could review and use later.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this step matters
&lt;/h2&gt;

&lt;p&gt;Many teams want to "use AI on their data" before they have a reliable data preparation layer.&lt;/p&gt;

&lt;p&gt;That usually creates a gap.&lt;/p&gt;

&lt;p&gt;Before a dataset is useful for BI, ML, LLM, RAG, or any other AI-related workflow, a few basic questions still need to be answered:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What columns exist?&lt;/li&gt;
&lt;li&gt;What does each column mean?&lt;/li&gt;
&lt;li&gt;Which fields are identifiers or contact fields?&lt;/li&gt;
&lt;li&gt;Which values are missing?&lt;/li&gt;
&lt;li&gt;Which columns are numeric, categorical, datetime, or text-like?&lt;/li&gt;
&lt;li&gt;What validation issues were found?&lt;/li&gt;
&lt;li&gt;What data was removed or transformed?&lt;/li&gt;
&lt;li&gt;What files were generated?&lt;/li&gt;
&lt;li&gt;Did this process call any external AI service?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The v0.6.0 demo answers these questions by producing several small, reviewable output files.&lt;/p&gt;

&lt;p&gt;This is especially useful for small-team workflows. A client or operator may not need a full ML platform. They may first need a clean handoff package that explains the dataset and makes downstream use safer.&lt;/p&gt;

&lt;h2&gt;
  
  
  What v0.6.0 adds
&lt;/h2&gt;

&lt;p&gt;The v0.6.0 update adds a new optional workflow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;generated messy order data
        ↓
existing validation and cleaning workflow
        ↓
cleaned orders dataset
        ↓
schema profile JSON
        ↓
data dictionary JSON
        ↓
validation summary JSON
        ↓
feature-ready CSV
        ↓
embedding-ready text field extract
        ↓
AI-ready manifest + Markdown summary report
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The main files added for this path are:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;scripts/run_ai_ready_demo.py
src/dq_etl_starter/ai_ready.py
docs/ai_ready.md
tests/test_ai_ready.py
tests/test_ai_ready_outputs.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The expected local output directory is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data/output/ai_ready/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And the generated files are:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data/output/ai_ready/
├── ai_ready_summary_report.md
├── ai_ready_manifest.json
├── data_dictionary.json
├── schema_profile.json
├── validation_summary.json
├── feature_ready_orders.csv
└── embedding_ready_text_fields.csv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These are local artifacts. They should not be committed to the repository.&lt;/p&gt;

&lt;h2&gt;
  
  
  Project path so far
&lt;/h2&gt;

&lt;p&gt;This project has grown in small steps:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;v0.1.0  local data quality ETL baseline
v0.2.0  optional PostgreSQL export
v0.3.0  optional FastAPI validation wrapper
v0.4.0  analytics-ready Parquet + DuckDB demo
v0.5.0  BI-ready PostgreSQL + Metabase demo
v0.6.0  AI-ready data preparation demo
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That sequence is intentional.&lt;/p&gt;

&lt;p&gt;The project does not jump directly from messy CSV files to an AI application. It first builds the data workflow foundations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;reading input data;&lt;/li&gt;
&lt;li&gt;validating schemas;&lt;/li&gt;
&lt;li&gt;cleaning rows;&lt;/li&gt;
&lt;li&gt;exporting data;&lt;/li&gt;
&lt;li&gt;generating reports;&lt;/li&gt;
&lt;li&gt;preparing analytics outputs;&lt;/li&gt;
&lt;li&gt;loading reporting tables;&lt;/li&gt;
&lt;li&gt;documenting data for downstream use.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The v0.6.0 update continues that path.&lt;/p&gt;

&lt;h2&gt;
  
  
  Install the project locally
&lt;/h2&gt;

&lt;p&gt;Clone the repository:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/OnerGit/data-quality-etl-starter.git
&lt;span class="nb"&gt;cd &lt;/span&gt;data-quality-etl-starter
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create a virtual environment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;-m&lt;/span&gt; venv .venv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Activate it on macOS or Linux:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;source&lt;/span&gt; .venv/bin/activate
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Activate it on Windows PowerShell:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;venv&lt;/span&gt;&lt;span class="n"&gt;\Scripts\activate&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Install dependencies and the local package:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The editable install step is useful because the project uses a &lt;code&gt;src/&lt;/code&gt; layout.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Generate synthetic input data
&lt;/h2&gt;

&lt;p&gt;The AI-ready demo starts from generated synthetic order data.&lt;/p&gt;

&lt;p&gt;It does not use real customer data. It does not download external datasets. It does not require API keys.&lt;/p&gt;

&lt;p&gt;Generate 100,000 rows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python scripts/generate_sample_data.py &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--rows&lt;/span&gt; 100000 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; data/generated/orders_100k.csv &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--seed&lt;/span&gt; 42
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Windows PowerShell:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;python&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;scripts/generate_sample_data.py&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--rows&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;100000&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--output&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;data/generated/orders_100k.csv&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--seed&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;42&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The fixed seed keeps the demo reproducible.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm16dpqkw3d0imla69o09.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm16dpqkw3d0imla69o09.png" alt="Generated 100k synthetic order data" width="800" height="135"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The generated data intentionally includes common data quality issues such as missing values, invalid email values, duplicate rows, invalid dates, negative quantities, zero prices, and inconsistent country values.&lt;/p&gt;

&lt;p&gt;That makes the downstream preparation step more meaningful than running the workflow on a perfectly clean sample file.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Run the AI-ready preparation demo
&lt;/h2&gt;

&lt;p&gt;Run the v0.6.0 demo:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python scripts/run_ai_ready_demo.py &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--input&lt;/span&gt; data/generated/orders_100k.csv &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--schema&lt;/span&gt; data/expected/generated_order_schema.json &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output-dir&lt;/span&gt; data/output/ai_ready &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--dataset-name&lt;/span&gt; cleaned_orders
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Windows PowerShell:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;python&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;scripts/run_ai_ready_demo.py&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--input&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;data/generated/orders_100k.csv&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--schema&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;data/expected/generated_order_schema.json&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--output-dir&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;data/output/ai_ready&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--dataset-name&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;cleaned_orders&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The script prints a completion message and lists the generated outputs.&lt;/p&gt;

&lt;p&gt;A successful run should create:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;schema_profile.json
data_dictionary.json
validation_summary.json
feature_ready_orders.csv
embedding_ready_text_fields.csv
ai_ready_manifest.json
ai_ready_summary_report.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This workflow uses the existing project pieces first:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;read the generated CSV;&lt;/li&gt;
&lt;li&gt;load the expected schema;&lt;/li&gt;
&lt;li&gt;validate the input;&lt;/li&gt;
&lt;li&gt;clean the DataFrame;&lt;/li&gt;
&lt;li&gt;prepare order data for downstream use.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then the new AI-ready layer creates metadata, summaries, and handoff files.&lt;/p&gt;

&lt;h2&gt;
  
  
  Output 1: Schema profile
&lt;/h2&gt;

&lt;p&gt;The first output is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data/output/ai_ready/schema_profile.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This file is a machine-readable profile of the prepared dataset.&lt;/p&gt;

&lt;p&gt;It includes information such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;dataset name;&lt;/li&gt;
&lt;li&gt;row count;&lt;/li&gt;
&lt;li&gt;column count;&lt;/li&gt;
&lt;li&gt;column names;&lt;/li&gt;
&lt;li&gt;inferred types;&lt;/li&gt;
&lt;li&gt;pandas dtypes;&lt;/li&gt;
&lt;li&gt;null counts;&lt;/li&gt;
&lt;li&gt;null ratios;&lt;/li&gt;
&lt;li&gt;unique counts;&lt;/li&gt;
&lt;li&gt;unique ratios;&lt;/li&gt;
&lt;li&gt;example values;&lt;/li&gt;
&lt;li&gt;recommended column roles.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A simplified example looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"dataset_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"cleaned_orders"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"row_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;100000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"column_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"columns"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"order_id"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"dtype"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"string"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"recommended_role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"identifier"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"null_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"unique_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;100000&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The exact values depend on the generated input and cleaning result.&lt;/p&gt;

&lt;p&gt;This file is useful because downstream users can inspect structure before deciding how to use the dataset.&lt;/p&gt;

&lt;p&gt;For example, a BI user may check date and numeric fields. An ML user may check identifiers and contact fields. An AI/RAG workflow may check which text fields exist before deciding whether embeddings are appropriate.&lt;/p&gt;

&lt;h2&gt;
  
  
  Output 2: Data dictionary
&lt;/h2&gt;

&lt;p&gt;The second output is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data/output/ai_ready/data_dictionary.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This file explains what each column means.&lt;/p&gt;

&lt;p&gt;It includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;column name;&lt;/li&gt;
&lt;li&gt;human-readable description;&lt;/li&gt;
&lt;li&gt;type;&lt;/li&gt;
&lt;li&gt;recommended role;&lt;/li&gt;
&lt;li&gt;nullable flag;&lt;/li&gt;
&lt;li&gt;example values;&lt;/li&gt;
&lt;li&gt;usage notes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For example, identifier fields are marked differently from numeric features or text-like fields.&lt;/p&gt;

&lt;p&gt;This matters because a cleaned table is still not self-explanatory.&lt;/p&gt;

&lt;p&gt;A field such as &lt;code&gt;customer_id&lt;/code&gt; may be technically clean, but it should usually not be treated as a numeric feature. A field such as &lt;code&gt;email&lt;/code&gt; may be useful for validation examples, but it should be reviewed carefully before any downstream AI or ML use.&lt;/p&gt;

&lt;p&gt;The data dictionary makes those notes explicit.&lt;/p&gt;

&lt;h2&gt;
  
  
  Output 3: Validation summary
&lt;/h2&gt;

&lt;p&gt;The third output is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data/output/ai_ready/validation_summary.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This file gives a compact machine-readable summary of the validation and cleaning stage.&lt;/p&gt;

&lt;p&gt;It includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;source file;&lt;/li&gt;
&lt;li&gt;schema file;&lt;/li&gt;
&lt;li&gt;row count before cleaning;&lt;/li&gt;
&lt;li&gt;row count after preparation;&lt;/li&gt;
&lt;li&gt;rows removed during preparation;&lt;/li&gt;
&lt;li&gt;duplicate rows removed;&lt;/li&gt;
&lt;li&gt;columns with missing values;&lt;/li&gt;
&lt;li&gt;validation issue count;&lt;/li&gt;
&lt;li&gt;validation issue codes;&lt;/li&gt;
&lt;li&gt;AI-readiness notes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This output is useful for auditability.&lt;/p&gt;

&lt;p&gt;When a dataset is handed off to another workflow, the receiver should not only get a CSV file. They should also get a summary of what happened before the file was produced.&lt;/p&gt;

&lt;h2&gt;
  
  
  Output 4: Feature-ready CSV
&lt;/h2&gt;

&lt;p&gt;The fourth output is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data/output/ai_ready/feature_ready_orders.csv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is a simple tabular output for downstream feature exploration.&lt;/p&gt;

&lt;p&gt;By default, the workflow removes identifier and contact fields such as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;order_id
customer_id
email
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It also transforms &lt;code&gt;order_date&lt;/code&gt; into simple time-based fields such as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;order_year
order_month
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This file does not train a model. It does not decide which features are correct for a business use case.&lt;/p&gt;

&lt;p&gt;It only creates a cleaner starting point for later review.&lt;/p&gt;

&lt;p&gt;That distinction is important. Feature-ready does not mean model-ready for every use case. It means the output is more suitable for feature exploration than the original raw file.&lt;/p&gt;

&lt;h2&gt;
  
  
  Output 5: Embedding-ready text field extract
&lt;/h2&gt;

&lt;p&gt;The fifth output is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data/output/ai_ready/embedding_ready_text_fields.csv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This file extracts text-like fields into a compact structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;record_id,text,source_columns
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The project does &lt;strong&gt;not&lt;/strong&gt; generate embeddings.&lt;/p&gt;

&lt;p&gt;It only prepares text fields so a downstream workflow can decide later whether embeddings are appropriate.&lt;/p&gt;

&lt;p&gt;Contact fields such as &lt;code&gt;email&lt;/code&gt; are excluded by default.&lt;/p&gt;

&lt;p&gt;That is a deliberate design choice. It keeps the project focused on data preparation and avoids pretending that every text field should automatically go into a vector database.&lt;/p&gt;

&lt;h2&gt;
  
  
  Output 6: AI-ready manifest
&lt;/h2&gt;

&lt;p&gt;The sixth output is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data/output/ai_ready/ai_ready_manifest.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the most important scope-control file in the v0.6.0 update.&lt;/p&gt;

&lt;p&gt;It explicitly records that the workflow did not call AI services:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"llm_api_called"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"embedding_generated"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"model_trained"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This may look simple, but it is useful for a public technical project.&lt;/p&gt;

&lt;p&gt;The AI label can easily create confusion. A manifest prevents overclaiming by documenting what the workflow did and did not do.&lt;/p&gt;

&lt;p&gt;The manifest also lists intended downstream uses, such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;BI handoff;&lt;/li&gt;
&lt;li&gt;ML feature exploration;&lt;/li&gt;
&lt;li&gt;LLM/RAG preparation outside this project;&lt;/li&gt;
&lt;li&gt;data quality review.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And it lists out-of-scope items such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;LLM API calls;&lt;/li&gt;
&lt;li&gt;embeddings generation;&lt;/li&gt;
&lt;li&gt;model training;&lt;/li&gt;
&lt;li&gt;RAG chatbot;&lt;/li&gt;
&lt;li&gt;AI agent;&lt;/li&gt;
&lt;li&gt;vector database.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Output 7: AI-ready summary report
&lt;/h2&gt;

&lt;p&gt;The final output is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data/output/ai_ready/ai_ready_summary_report.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is a human-readable Markdown report.&lt;/p&gt;

&lt;p&gt;It includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;dataset name;&lt;/li&gt;
&lt;li&gt;prepared row count;&lt;/li&gt;
&lt;li&gt;generated output files;&lt;/li&gt;
&lt;li&gt;scope note;&lt;/li&gt;
&lt;li&gt;recommended downstream use;&lt;/li&gt;
&lt;li&gt;out-of-scope items.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The summary report is meant for handoff.&lt;/p&gt;

&lt;p&gt;A technical reviewer can open the JSON files. A less technical stakeholder can start with the Markdown report and understand the purpose of the run.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why no LLM API call?
&lt;/h2&gt;

&lt;p&gt;This project intentionally stops before the expensive or model-specific part of an AI workflow.&lt;/p&gt;

&lt;p&gt;There are several reasons.&lt;/p&gt;

&lt;p&gt;First, AI APIs introduce cost and credential management. A small data workflow starter should run without paid API keys.&lt;/p&gt;

&lt;p&gt;Second, embedding and modeling decisions depend on the use case. A dataset prepared for sales forecasting is different from a dataset prepared for semantic search.&lt;/p&gt;

&lt;p&gt;Third, calling an LLM does not remove the need for validation, profiling, documentation, and governance. Those steps are still required.&lt;/p&gt;

&lt;p&gt;Fourth, this project is meant to demonstrate a Python data workflow skill set: data cleaning, validation, transformation, reporting, testing, and handoff.&lt;/p&gt;

&lt;p&gt;For this version, preparing better data is more important than adding an AI wrapper.&lt;/p&gt;

&lt;h2&gt;
  
  
  How this maps to client work
&lt;/h2&gt;

&lt;p&gt;A realistic client request may sound like this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;We want to use our order/customer data for reporting, analytics, or maybe AI later. Can you clean it and prepare a documented dataset first?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A practical first milestone could be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;inspect input files;&lt;/li&gt;
&lt;li&gt;validate required fields;&lt;/li&gt;
&lt;li&gt;clean duplicates and bad values;&lt;/li&gt;
&lt;li&gt;remove obvious identifier or contact fields from feature exploration outputs;&lt;/li&gt;
&lt;li&gt;generate a schema profile;&lt;/li&gt;
&lt;li&gt;create a data dictionary;&lt;/li&gt;
&lt;li&gt;write a validation summary;&lt;/li&gt;
&lt;li&gt;prepare a feature-ready CSV;&lt;/li&gt;
&lt;li&gt;prepare a text-field extract for later review;&lt;/li&gt;
&lt;li&gt;document what was and was not done.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is exactly the kind of handoff this v0.6.0 demo is designed to show.&lt;/p&gt;

&lt;p&gt;It is not a full AI system. It is a data preparation layer that makes later AI-related work more responsible and easier to review.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tests
&lt;/h2&gt;

&lt;p&gt;Run the AI-ready tests:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pytest tests/test_ai_ready.py
pytest tests/test_ai_ready_outputs.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run the full test suite:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;-m&lt;/span&gt; compileall &lt;span class="nt"&gt;-q&lt;/span&gt; src/dq_etl_starter
python &lt;span class="nt"&gt;-m&lt;/span&gt; compileall &lt;span class="nt"&gt;-q&lt;/span&gt; scripts
pytest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The default tests do not require PostgreSQL, Metabase, external datasets, or any LLM API key.&lt;/p&gt;

&lt;p&gt;That keeps the workflow easy to verify locally.&lt;/p&gt;

&lt;h2&gt;
  
  
  Local artifact policy
&lt;/h2&gt;

&lt;p&gt;The generated AI-ready output files are local artifacts.&lt;/p&gt;

&lt;p&gt;Do not commit:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data/generated/
data/output/ai_ready/
data/output/analytics/
data/output/bi/
*.parquet
*.duckdb
metabase.db/
metabase-data/
postgres_data/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The repository should keep source code, tests, documentation, schemas, lightweight sample files, and screenshots.&lt;/p&gt;

&lt;p&gt;This matters for public portfolio projects. The repository should be easy to clone and review without carrying large generated outputs.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is intentionally out of scope?
&lt;/h2&gt;

&lt;p&gt;The v0.6.0 demo does not include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OpenAI, Claude, Gemini, or other paid AI APIs;&lt;/li&gt;
&lt;li&gt;local LLM integration;&lt;/li&gt;
&lt;li&gt;embeddings generation;&lt;/li&gt;
&lt;li&gt;vector databases;&lt;/li&gt;
&lt;li&gt;RAG chatbots;&lt;/li&gt;
&lt;li&gt;AI agents;&lt;/li&gt;
&lt;li&gt;automatic SQL generation;&lt;/li&gt;
&lt;li&gt;automatic data cleaning by LLM;&lt;/li&gt;
&lt;li&gt;model training;&lt;/li&gt;
&lt;li&gt;AutoML;&lt;/li&gt;
&lt;li&gt;feature stores;&lt;/li&gt;
&lt;li&gt;MLflow or MLOps tooling;&lt;/li&gt;
&lt;li&gt;cloud deployment;&lt;/li&gt;
&lt;li&gt;custom frontend code.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These tools can be useful in the right project.&lt;/p&gt;

&lt;p&gt;They are simply not the goal of this starter.&lt;/p&gt;

&lt;p&gt;The goal is to prepare documented, machine-readable data that downstream workflows can inspect and decide how to use.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I would improve next
&lt;/h2&gt;

&lt;p&gt;Possible next improvements include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;add more configurable column role rules;&lt;/li&gt;
&lt;li&gt;add stronger data dictionary templates;&lt;/li&gt;
&lt;li&gt;generate a small HTML summary report;&lt;/li&gt;
&lt;li&gt;add richer schema drift checks;&lt;/li&gt;
&lt;li&gt;add more realistic public dataset validation notes;&lt;/li&gt;
&lt;li&gt;add a Makefile for common demo commands;&lt;/li&gt;
&lt;li&gt;add CI for test execution;&lt;/li&gt;
&lt;li&gt;add clearer examples for BI, ML, and RAG handoff paths.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The main constraint remains the same:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;keep the project small, runnable, testable, documented, and honest about scope
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is more useful than adding an AI feature that hides the underlying data preparation work.&lt;/p&gt;

&lt;h2&gt;
  
  
  Repository
&lt;/h2&gt;

&lt;p&gt;GitHub repository:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/OnerGit/data-quality-etl-starter" rel="noopener noreferrer"&gt;https://github.com/OnerGit/data-quality-etl-starter&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Previous article:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/bob_oner/from-clean-data-to-bi-ready-reporting-tables-with-python-postgresql-and-metabase-348p"&gt;From Clean Data to BI-Ready Reporting Tables with Python, PostgreSQL, and Metabase&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This v0.6.0 update is a practical next step: preparing clean, validated, documented, machine-readable data for downstream BI, ML, or AI workflows without pretending that data preparation alone is a complete AI application.&lt;/p&gt;

</description>
      <category>python</category>
      <category>ai</category>
      <category>postgres</category>
      <category>docker</category>
    </item>
    <item>
      <title>From Clean Data to BI-Ready Reporting Tables with Python, PostgreSQL, and Metabase</title>
      <dc:creator>Bob Oner</dc:creator>
      <pubDate>Thu, 11 Jun 2026 08:13:25 +0000</pubDate>
      <link>https://dev.to/bob_oner/from-clean-data-to-bi-ready-reporting-tables-with-python-postgresql-and-metabase-348p</link>
      <guid>https://dev.to/bob_oner/from-clean-data-to-bi-ready-reporting-tables-with-python-postgresql-and-metabase-348p</guid>
      <description>&lt;p&gt;In the previous article, I extended a small Python data quality ETL starter from validation and cleaning into analytics-ready local outputs with Parquet, DuckDB, summary CSV files, and a benchmark report.&lt;/p&gt;

&lt;p&gt;Previous article:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/bob_oner/from-data-quality-checks-to-analytics-ready-parquet-with-python-39bd"&gt;From Data Quality Checks to Analytics-Ready Parquet with Python&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This follow-up focuses on the v0.5.0 update of the same project:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/OnerGit/data-quality-etl-starter" rel="noopener noreferrer"&gt;Data Quality ETL Starter on GitHub&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The new goal is still intentionally modest.&lt;/p&gt;

&lt;p&gt;This is not a production BI platform. It is not a data warehouse. It is not a cloud deployment. It is not an embedded analytics product.&lt;/p&gt;

&lt;p&gt;The goal is to show one practical next step after cleaning and analytics-ready export:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;generated messy order data
        ↓
existing validation and cleaning workflow
        ↓
analytics-ready order rows
        ↓
PostgreSQL reporting tables
        ↓
lightweight SQL views
        ↓
optional Metabase local dashboard demo
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is a common handoff point in small data workflow projects.&lt;/p&gt;

&lt;p&gt;A client or small team may not need a full data platform yet. They may simply need cleaned data loaded into a local reporting database, a few reusable SQL views, and a basic dashboard tool connected to those views.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why add a BI-ready demo?
&lt;/h2&gt;

&lt;p&gt;The earlier versions of this project focused on data quality and local analytics.&lt;/p&gt;

&lt;p&gt;The workflow could already:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;read messy CSV, Excel, JSON, and mock API-style data;&lt;/li&gt;
&lt;li&gt;validate expected columns and schema rules;&lt;/li&gt;
&lt;li&gt;clean duplicate rows and text values;&lt;/li&gt;
&lt;li&gt;export cleaned CSV output;&lt;/li&gt;
&lt;li&gt;export to SQLite and optional PostgreSQL;&lt;/li&gt;
&lt;li&gt;expose validation through a thin FastAPI wrapper;&lt;/li&gt;
&lt;li&gt;generate larger synthetic order data;&lt;/li&gt;
&lt;li&gt;export analytics-ready CSV and Parquet files;&lt;/li&gt;
&lt;li&gt;query Parquet locally with DuckDB;&lt;/li&gt;
&lt;li&gt;produce summary CSV tables and a benchmark report.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those steps are useful, but many reporting workflows eventually ask another question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Can this cleaned data feed a reporting database or dashboard?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;v0.5.0 adds a small answer to that question.&lt;/p&gt;

&lt;p&gt;It loads cleaned and analytics-ready data into PostgreSQL, creates reporting tables and SQL views, and provides instructions for exploring the result in Metabase.&lt;/p&gt;

&lt;p&gt;The point is not to make the project bigger for its own sake. The point is to demonstrate a realistic bridge from data cleaning to lightweight reporting.&lt;/p&gt;

&lt;h2&gt;
  
  
  What v0.5.0 adds
&lt;/h2&gt;

&lt;p&gt;The v0.5.0 update adds an optional BI-ready path.&lt;/p&gt;

&lt;p&gt;The most relevant new pieces are:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;scripts/run_bi_demo.py
src/dq_etl_starter/bi.py
docs/bi.md
docs/metabase.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The BI demo creates PostgreSQL reporting tables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cleaned_orders
customer_summary
revenue_by_country
orders_by_month
source_system_summary
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It also creates PostgreSQL reporting views:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;vw_revenue_by_country
vw_orders_by_month
vw_source_system_quality
vw_monthly_revenue_trend
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And it writes local output files:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data/output/bi/bi_summary_report.md
data/output/bi/reporting_queries.sql
data/output/bi/metabase_dashboard_notes.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;data/output/bi/&lt;/code&gt; directory is intentionally ignored by Git. It contains local run artifacts, not source files.&lt;/p&gt;

&lt;h2&gt;
  
  
  Project structure after the update
&lt;/h2&gt;

&lt;p&gt;The project remains small, but the structure now shows a clearer path from input data to reporting preparation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data-quality-etl-starter/
├── data/
│   ├── input/
│   ├── expected/
│   └── output/
├── docs/
│   ├── analytics.md
│   ├── api.md
│   ├── bi.md
│   ├── metabase.md
│   └── postgres.md
├── screenshots/
├── scripts/
│   ├── generate_sample_data.py
│   ├── run_analytics_demo.py
│   └── run_bi_demo.py
├── src/dq_etl_starter/
│   ├── analytics.py
│   ├── bi.py
│   ├── clean.py
│   ├── exporters.py
│   ├── services.py
│   └── validate.py
└── tests/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The important design choice is that the BI-ready path does not replace the original workflow.&lt;/p&gt;

&lt;p&gt;It builds on it.&lt;/p&gt;

&lt;p&gt;The project still starts with data validation and cleaning. The reporting layer comes later.&lt;/p&gt;

&lt;h2&gt;
  
  
  Install the project locally
&lt;/h2&gt;

&lt;p&gt;Clone the repository:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/OnerGit/data-quality-etl-starter.git
&lt;span class="nb"&gt;cd &lt;/span&gt;data-quality-etl-starter
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create a virtual environment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;-m&lt;/span&gt; venv .venv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Activate it on macOS or Linux:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;source&lt;/span&gt; .venv/bin/activate
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Activate it on Windows PowerShell:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;venv&lt;/span&gt;&lt;span class="n"&gt;\Scripts\activate&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Install dependencies and the local package:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The editable install step is useful because the project uses a &lt;code&gt;src/&lt;/code&gt; layout.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Generate synthetic order data
&lt;/h2&gt;

&lt;p&gt;The BI demo uses generated synthetic order data.&lt;/p&gt;

&lt;p&gt;It does not use real customer data. It does not download private data. It is designed to be reproducible and safe for a public portfolio project.&lt;/p&gt;

&lt;p&gt;Generate 100,000 rows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python scripts/generate_sample_data.py &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--rows&lt;/span&gt; 100000 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; data/generated/orders_100k.csv &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--seed&lt;/span&gt; 42
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Windows PowerShell:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;python&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;scripts/generate_sample_data.py&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--rows&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;100000&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--output&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;data/generated/orders_100k.csv&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--seed&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;42&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This synthetic dataset intentionally includes data quality issues, such as missing values, invalid emails, duplicate rows, invalid dates, negative quantities, and inconsistent country values.&lt;/p&gt;

&lt;p&gt;That makes the demo more useful than a perfectly clean sample file.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Start PostgreSQL
&lt;/h2&gt;

&lt;p&gt;Start the local PostgreSQL service with Docker Compose:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker compose up &lt;span class="nt"&gt;-d&lt;/span&gt; postgres
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Check that the container is running:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker compose ps
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The demo uses PostgreSQL as the reporting database because it is a practical next step after local CSV, SQLite, and Parquet outputs.&lt;/p&gt;

&lt;p&gt;SQLite is still useful for the default local workflow. PostgreSQL is useful when reporting tables should be available through a database connection.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Run the BI demo
&lt;/h2&gt;

&lt;p&gt;Run the BI demo:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python scripts/run_bi_demo.py &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--input&lt;/span&gt; data/generated/orders_100k.csv &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--schema&lt;/span&gt; data/expected/generated_order_schema.json &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output-dir&lt;/span&gt; data/output/bi &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--db-url&lt;/span&gt; postgresql+psycopg://dq_user:dq_password@localhost:5432/dq_demo
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Windows PowerShell:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;python&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;scripts/run_bi_demo.py&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--input&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;data/generated/orders_100k.csv&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--schema&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;data/expected/generated_order_schema.json&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--output-dir&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;data/output/bi&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--db-url&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;postgresql&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="nx"&gt;psycopg://dq_user:dq_password&lt;/span&gt;&lt;span class="err"&gt;@&lt;/span&gt;&lt;span class="nx"&gt;localhost:5432/dq_demo&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The demo runs the generated input through the existing validation and cleaning logic, prepares analytics-ready rows, loads reporting tables into PostgreSQL, creates SQL views, and writes local documentation files.&lt;/p&gt;

&lt;p&gt;Expected local files:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data/output/bi/bi_summary_report.md
data/output/bi/reporting_queries.sql
data/output/bi/metabase_dashboard_notes.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected PostgreSQL reporting tables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cleaned_orders
customer_summary
revenue_by_country
orders_by_month
source_system_summary
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected PostgreSQL reporting views:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;vw_revenue_by_country
vw_orders_by_month
vw_source_system_quality
vw_monthly_revenue_trend
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 4: Inspect reporting tables and views
&lt;/h2&gt;

&lt;p&gt;After running the BI demo, inspect the database.&lt;/p&gt;

&lt;p&gt;List reporting tables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-it&lt;/span&gt; dq_etl_postgres psql &lt;span class="nt"&gt;-U&lt;/span&gt; dq_user &lt;span class="nt"&gt;-d&lt;/span&gt; dq_demo &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\d&lt;/span&gt;&lt;span class="s2"&gt;t"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;List reporting views:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-it&lt;/span&gt; dq_etl_postgres psql &lt;span class="nt"&gt;-U&lt;/span&gt; dq_user &lt;span class="nt"&gt;-d&lt;/span&gt; dq_demo &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\d&lt;/span&gt;&lt;span class="s2"&gt;v"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Preview the revenue-by-country view:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-it&lt;/span&gt; dq_etl_postgres psql &lt;span class="nt"&gt;-U&lt;/span&gt; dq_user &lt;span class="nt"&gt;-d&lt;/span&gt; dq_demo &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"SELECT * FROM vw_revenue_by_country LIMIT 10;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Preview the monthly revenue trend:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-it&lt;/span&gt; dq_etl_postgres psql &lt;span class="nt"&gt;-U&lt;/span&gt; dq_user &lt;span class="nt"&gt;-d&lt;/span&gt; dq_demo &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"SELECT * FROM vw_monthly_revenue_trend ORDER BY order_month LIMIT 12;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here is the reporting table and view output:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxcg9ly48q9eubosznyj0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxcg9ly48q9eubosznyj0.png" alt="BI reporting tables and views" width="800" height="565"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is the core of the v0.5.0 update.&lt;/p&gt;

&lt;p&gt;The cleaned data is no longer only a local file. It is now available through a reporting database with reusable SQL views.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 5: Start optional Metabase
&lt;/h2&gt;

&lt;p&gt;Metabase is optional.&lt;/p&gt;

&lt;p&gt;The core workflow does not require it. The PostgreSQL reporting tables and SQL views are the important part.&lt;/p&gt;

&lt;p&gt;Metabase is included only as a local dashboard exploration layer.&lt;/p&gt;

&lt;p&gt;Start Metabase:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker compose up &lt;span class="nt"&gt;-d&lt;/span&gt; metabase
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Open:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;http://localhost:3000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When Metabase runs through Docker Compose, connect it to PostgreSQL with these values:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Database type: PostgreSQL
Host: postgres
Port: 5432
Database name: dq_demo
Username: dq_user
Password: dq_password
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use &lt;code&gt;postgres&lt;/code&gt; as the host because Metabase is running inside the Docker Compose network.&lt;/p&gt;

&lt;p&gt;Use &lt;code&gt;localhost&lt;/code&gt; only when connecting from your host machine, such as with &lt;code&gt;psql&lt;/code&gt; or a desktop database client.&lt;/p&gt;

&lt;p&gt;Here is the Metabase connection step:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj5azc2b7dk60k41nazgf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj5azc2b7dk60k41nazgf.png" alt="Metabase PostgreSQL connection" width="800" height="563"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 6: Create a simple dashboard
&lt;/h2&gt;

&lt;p&gt;The project does not ship a production dashboard.&lt;/p&gt;

&lt;p&gt;Instead, it provides suggested dashboard cards based on the reporting views.&lt;/p&gt;

&lt;p&gt;Useful starting points include:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;vw_revenue_by_country
vw_orders_by_month
vw_monthly_revenue_trend
vw_source_system_quality
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Suggested cards:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;revenue by country;&lt;/li&gt;
&lt;li&gt;orders by month;&lt;/li&gt;
&lt;li&gt;monthly revenue trend;&lt;/li&gt;
&lt;li&gt;orders by source system;&lt;/li&gt;
&lt;li&gt;average order value by country.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here is a simple dashboard built from the reporting views:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxc02czg77gr6rj99a5r7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxc02czg77gr6rj99a5r7.png" alt="Basic Metabase dashboard" width="800" height="688"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The dashboard is intentionally basic.&lt;/p&gt;

&lt;p&gt;Its job is not to impress with design. Its job is to prove that cleaned and reporting-ready data can be loaded into PostgreSQL and explored by a dashboard tool.&lt;/p&gt;

&lt;h2&gt;
  
  
  BI summary report
&lt;/h2&gt;

&lt;p&gt;The demo also writes a Markdown report:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data/output/bi/bi_summary_report.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The report records:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;raw row count;&lt;/li&gt;
&lt;li&gt;cleaned row count;&lt;/li&gt;
&lt;li&gt;analytics-ready row count;&lt;/li&gt;
&lt;li&gt;reporting tables created;&lt;/li&gt;
&lt;li&gt;reporting views created;&lt;/li&gt;
&lt;li&gt;local output files;&lt;/li&gt;
&lt;li&gt;recommended dashboard cards;&lt;/li&gt;
&lt;li&gt;scope notes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here is the BI summary report:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1s8s83romawbqw689sfk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1s8s83romawbqw689sfk.png" alt="BI summary report" width="800" height="1303"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This file is useful for handoff.&lt;/p&gt;

&lt;p&gt;Instead of only saying "the script ran", the project leaves behind a simple written record of what was created.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why use SQL views?
&lt;/h2&gt;

&lt;p&gt;The reporting views are small, but they are important.&lt;/p&gt;

&lt;p&gt;They separate raw or cleaned tables from reporting-facing queries.&lt;/p&gt;

&lt;p&gt;For example, a view such as &lt;code&gt;vw_revenue_by_country&lt;/code&gt; gives a dashboard tool a stable object to query. If the underlying table logic changes later, the dashboard can still point to the view.&lt;/p&gt;

&lt;p&gt;This is a common reporting pattern:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cleaned table
        ↓
summary table
        ↓
reporting view
        ↓
dashboard card
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For a small project, SQL views provide a good balance between simplicity and structure.&lt;/p&gt;

&lt;p&gt;They are easier to review than a hidden dashboard-only query, and they are lighter than introducing a full modeling framework too early.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this proves
&lt;/h2&gt;

&lt;p&gt;This v0.5.0 update demonstrates several practical capabilities:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;generating safe synthetic data;&lt;/li&gt;
&lt;li&gt;running data validation before reporting;&lt;/li&gt;
&lt;li&gt;cleaning and preparing analytics-ready rows;&lt;/li&gt;
&lt;li&gt;loading reporting tables into PostgreSQL;&lt;/li&gt;
&lt;li&gt;creating reusable SQL views;&lt;/li&gt;
&lt;li&gt;connecting a local BI tool to the reporting database;&lt;/li&gt;
&lt;li&gt;documenting output files and scope;&lt;/li&gt;
&lt;li&gt;keeping generated artifacts out of Git.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is especially relevant for small data workflow projects.&lt;/p&gt;

&lt;p&gt;Many clients do not need a complex platform at the beginning. They need a reliable workflow that turns messy files into something a reporting tool can use.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is intentionally still out of scope?
&lt;/h2&gt;

&lt;p&gt;This demo intentionally does not include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;production Metabase deployment;&lt;/li&gt;
&lt;li&gt;cloud hosting;&lt;/li&gt;
&lt;li&gt;user authentication;&lt;/li&gt;
&lt;li&gt;embedded analytics;&lt;/li&gt;
&lt;li&gt;multi-tenant dashboarding;&lt;/li&gt;
&lt;li&gt;data warehouse modeling;&lt;/li&gt;
&lt;li&gt;Airflow orchestration;&lt;/li&gt;
&lt;li&gt;dbt project structure;&lt;/li&gt;
&lt;li&gt;Spark processing;&lt;/li&gt;
&lt;li&gt;scheduled production jobs;&lt;/li&gt;
&lt;li&gt;real customer data;&lt;/li&gt;
&lt;li&gt;AI, LLM, RAG, or agent features.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those can be valid tools in other contexts.&lt;/p&gt;

&lt;p&gt;For this project, adding them too early would make the starter harder to run and harder to review.&lt;/p&gt;

&lt;p&gt;The current goal is narrower:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;small local workflow
validated and cleaned data
PostgreSQL reporting tables
simple SQL views
optional dashboard exploration
clear documentation
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Run the tests
&lt;/h2&gt;

&lt;p&gt;Run the full test suite:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;-m&lt;/span&gt; compileall &lt;span class="nt"&gt;-q&lt;/span&gt; src/dq_etl_starter
python &lt;span class="nt"&gt;-m&lt;/span&gt; compileall &lt;span class="nt"&gt;-q&lt;/span&gt; scripts
pytest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run v0.5 BI-ready tests:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pytest tests/test_bi.py
pytest tests/test_bi_reporting_sql.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;PostgreSQL integration tests should remain optional and should be skipped unless &lt;code&gt;DATABASE_URL&lt;/code&gt; is configured.&lt;/p&gt;

&lt;p&gt;This is important because the project should stay easy to test even when external services are not running.&lt;/p&gt;

&lt;h2&gt;
  
  
  Run the default Docker workflow
&lt;/h2&gt;

&lt;p&gt;The default Docker run remains simple:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker build &lt;span class="nt"&gt;-t&lt;/span&gt; data-quality-etl-starter:0.5.0 &lt;span class="nt"&gt;-t&lt;/span&gt; data-quality-etl-starter:latest &lt;span class="nb"&gt;.&lt;/span&gt;
docker run &lt;span class="nt"&gt;--rm&lt;/span&gt; data-quality-etl-starter:0.5.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The BI demo uses Docker Compose services for PostgreSQL and optional Metabase, but the default image still supports a reproducible CLI workflow.&lt;/p&gt;

&lt;p&gt;That separation keeps the project easier to understand.&lt;/p&gt;

&lt;h2&gt;
  
  
  How this maps to freelance client work
&lt;/h2&gt;

&lt;p&gt;This update maps to a realistic client request:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"We have messy order/customer exports. Can you clean them, load them into a database, and prepare a few reporting tables or dashboard-ready views?"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A first version does not always need a full data warehouse.&lt;/p&gt;

&lt;p&gt;A practical milestone might be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;generate or receive input data;&lt;/li&gt;
&lt;li&gt;validate the schema;&lt;/li&gt;
&lt;li&gt;clean duplicates and bad values;&lt;/li&gt;
&lt;li&gt;load reporting tables into PostgreSQL;&lt;/li&gt;
&lt;li&gt;create SQL views for common metrics;&lt;/li&gt;
&lt;li&gt;connect a dashboard tool;&lt;/li&gt;
&lt;li&gt;document how to rerun the workflow.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is exactly the kind of path this v0.5.0 demo is designed to show.&lt;/p&gt;

&lt;p&gt;For freelance work, the value is in the handoff:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;clear commands;&lt;/li&gt;
&lt;li&gt;local reproducibility;&lt;/li&gt;
&lt;li&gt;synthetic data for safe demos;&lt;/li&gt;
&lt;li&gt;database tables that can be inspected;&lt;/li&gt;
&lt;li&gt;SQL views that can be reviewed;&lt;/li&gt;
&lt;li&gt;screenshots that show the workflow;&lt;/li&gt;
&lt;li&gt;a summary report that documents the run;&lt;/li&gt;
&lt;li&gt;scope notes that prevent overclaiming.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What I would improve next
&lt;/h2&gt;

&lt;p&gt;Possible next improvements include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;adding a few more reporting views;&lt;/li&gt;
&lt;li&gt;adding richer schema profiling;&lt;/li&gt;
&lt;li&gt;adding better BI summary formatting;&lt;/li&gt;
&lt;li&gt;adding a Makefile for repeated demo commands;&lt;/li&gt;
&lt;li&gt;adding GitHub Actions for test runs;&lt;/li&gt;
&lt;li&gt;adding a small real-public-dataset validation note;&lt;/li&gt;
&lt;li&gt;adding optional scheduled local runs;&lt;/li&gt;
&lt;li&gt;adding clearer logging for each workflow stage.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The main constraint remains the same:&lt;/p&gt;

&lt;p&gt;Do not turn the starter into a heavy platform too early.&lt;/p&gt;

&lt;p&gt;It should stay small, reproducible, and easy to adapt.&lt;/p&gt;

&lt;h2&gt;
  
  
  Repository
&lt;/h2&gt;

&lt;p&gt;GitHub repository:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/OnerGit/data-quality-etl-starter" rel="noopener noreferrer"&gt;https://github.com/OnerGit/data-quality-etl-starter&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Previous article:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/bob_oner/from-data-quality-checks-to-analytics-ready-parquet-with-python-39bd"&gt;From Data Quality Checks to Analytics-Ready Parquet with Python&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This v0.5.0 update is a practical next step: from cleaned analytics-ready data to PostgreSQL reporting tables, SQL views, and an optional local dashboard demo.&lt;/p&gt;

</description>
      <category>python</category>
      <category>etl</category>
      <category>postgres</category>
      <category>docker</category>
    </item>
    <item>
      <title>From Data Quality Checks to Analytics-Ready Parquet with Python</title>
      <dc:creator>Bob Oner</dc:creator>
      <pubDate>Tue, 09 Jun 2026 04:47:54 +0000</pubDate>
      <link>https://dev.to/bob_oner/from-data-quality-checks-to-analytics-ready-parquet-with-python-39bd</link>
      <guid>https://dev.to/bob_oner/from-data-quality-checks-to-analytics-ready-parquet-with-python-39bd</guid>
      <description>&lt;p&gt;In the first article, I walked through a small Python data quality ETL starter that reads messy CSV, Excel, JSON, and API-style data, validates it, cleans it, exports it, and generates quality reports.&lt;/p&gt;

&lt;p&gt;Previous article:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/bob_oner/build-a-python-data-quality-etl-starter-for-messy-csv-excel-json-and-api-style-data-46b"&gt;Build a Python Data Quality ETL Starter for Messy CSV, Excel, JSON, and API-Style Data&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This follow-up focuses on the v0.4.0 update of the same project:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/OnerGit/data-quality-etl-starter" rel="noopener noreferrer"&gt;Data Quality ETL Starter on GitHub&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The new goal is still intentionally small. This is not a data warehouse, a BI platform, an Airflow project, a dbt project, or an AI data application.&lt;/p&gt;

&lt;p&gt;The goal is to extend the starter workflow from small sample files to a more realistic local analytics demo:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;generated messy order data
        ↓
existing validation and cleaning logic
        ↓
cleaned CSV
        ↓
cleaned Parquet
        ↓
DuckDB query demo
        ↓
summary CSV tables
        ↓
benchmark_report.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This makes the project more useful as a portfolio asset because it demonstrates not only data cleaning, but also the next handoff step: producing files that are easier to analyze locally.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why add an analytics-ready export path?
&lt;/h2&gt;

&lt;p&gt;Many small-team data workflows do not end with a cleaned CSV file.&lt;/p&gt;

&lt;p&gt;A cleaned CSV is useful, but a reporting workflow often needs one more layer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a file format suitable for repeated analysis;&lt;/li&gt;
&lt;li&gt;simple summary tables;&lt;/li&gt;
&lt;li&gt;SQL queries that can be reviewed and reused;&lt;/li&gt;
&lt;li&gt;a lightweight benchmark report;&lt;/li&gt;
&lt;li&gt;a way to test the workflow with more than a tiny demo file.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the reason v0.4.0 adds an optional analytics-ready path.&lt;/p&gt;

&lt;p&gt;The project still keeps the original CLI workflow as the source of truth. The analytics demo is an additional path, not a replacement for the existing CSV, Excel, JSON, mock API, SQLite, PostgreSQL, or FastAPI validation workflows.&lt;/p&gt;

&lt;h2&gt;
  
  
  What v0.4.0 adds
&lt;/h2&gt;

&lt;p&gt;The v0.4.0 update adds three main pieces.&lt;/p&gt;

&lt;p&gt;First, it adds a synthetic order data generator:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;scripts/generate_sample_data.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Second, it adds an analytics demo runner:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;scripts/run_analytics_demo.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Third, it adds analytics helpers under the package source:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;src/dq_etl_starter/analytics.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Together, these files show how to generate repeatable synthetic data, run it through the existing cleaning and validation logic, and produce analytics-ready outputs.&lt;/p&gt;

&lt;p&gt;The project version is now &lt;code&gt;0.4.0&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Project structure after the update
&lt;/h2&gt;

&lt;p&gt;The relevant structure now looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data-quality-etl-starter/
├── data/
│   ├── input/
│   ├── expected/
│   └── output/
├── docs/
│   ├── api.md
│   ├── analytics.md
│   └── postgres.md
├── screenshots/
├── scripts/
│   ├── generate_sample_data.py
│   └── run_analytics_demo.py
├── src/dq_etl_starter/
│   ├── analytics.py
│   ├── api.py
│   ├── clean.py
│   ├── cli.py
│   ├── exporters.py
│   ├── readers.py
│   ├── services.py
│   └── validate.py
└── tests/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is still a small project, but the workflow now has a clearer progression:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;local data quality workflow;&lt;/li&gt;
&lt;li&gt;optional PostgreSQL export;&lt;/li&gt;
&lt;li&gt;optional FastAPI validation wrapper;&lt;/li&gt;
&lt;li&gt;optional analytics-ready export demo.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That progression matters because it maps to how small client projects often grow.&lt;/p&gt;

&lt;p&gt;A client may first need a repeatable CSV cleanup script. Later, they may ask for database export, an API endpoint, or files that can feed reporting tools.&lt;/p&gt;

&lt;h2&gt;
  
  
  Install the project locally
&lt;/h2&gt;

&lt;p&gt;Clone the repository:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/OnerGit/data-quality-etl-starter.git
&lt;span class="nb"&gt;cd &lt;/span&gt;data-quality-etl-starter
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create a virtual environment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;-m&lt;/span&gt; venv .venv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Activate it on macOS or Linux:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;source&lt;/span&gt; .venv/bin/activate
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Activate it on Windows PowerShell:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;venv&lt;/span&gt;&lt;span class="n"&gt;\Scripts\activate&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Install dependencies and the local package:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The editable install step is useful because the repository uses a &lt;code&gt;src/&lt;/code&gt; layout.&lt;/p&gt;

&lt;p&gt;For the v0.4 analytics demo, the important optional dependencies are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;pyarrow&lt;/code&gt; for Parquet output;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;duckdb&lt;/code&gt; for local SQL queries over Parquet files;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;faker&lt;/code&gt; for generated demo data.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Generate synthetic order data
&lt;/h2&gt;

&lt;p&gt;The generator creates deterministic synthetic customer/order-style data.&lt;/p&gt;

&lt;p&gt;It does not download real customer data. It does not use a private dataset. It is designed only for testing and demonstration.&lt;/p&gt;

&lt;p&gt;Generate 1,000 rows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python scripts/generate_sample_data.py &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--rows&lt;/span&gt; 1000 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; data/generated/orders_1k.csv &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--seed&lt;/span&gt; 42
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Generate 10,000 rows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python scripts/generate_sample_data.py &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--rows&lt;/span&gt; 10000 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; data/generated/orders_10k.csv &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--seed&lt;/span&gt; 42
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Generate 100,000 rows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python scripts/generate_sample_data.py &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--rows&lt;/span&gt; 100000 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; data/generated/orders_100k.csv &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--seed&lt;/span&gt; 42
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Windows PowerShell example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;python&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;scripts/generate_sample_data.py&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--rows&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;100000&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--output&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;data/generated/orders_100k.csv&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--seed&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;42&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here is the 100,000-row generation step:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm16dpqkw3d0imla69o09.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm16dpqkw3d0imla69o09.png" alt="Generated 100k synthetic data" width="800" height="135"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The fixed seed makes the output reproducible. That is useful for documentation, tests, demos, and future comparisons.&lt;/p&gt;

&lt;h2&gt;
  
  
  What kind of issues are introduced?
&lt;/h2&gt;

&lt;p&gt;The generated data intentionally includes common data quality issues.&lt;/p&gt;

&lt;p&gt;Examples include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;missing email values;&lt;/li&gt;
&lt;li&gt;invalid email values;&lt;/li&gt;
&lt;li&gt;missing country values;&lt;/li&gt;
&lt;li&gt;inconsistent country casing;&lt;/li&gt;
&lt;li&gt;duplicate rows;&lt;/li&gt;
&lt;li&gt;invalid dates;&lt;/li&gt;
&lt;li&gt;negative quantities;&lt;/li&gt;
&lt;li&gt;zero prices.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is important because a data quality demo should not only process clean data.&lt;/p&gt;

&lt;p&gt;If the generated data is too perfect, the validation and cleaning workflow does not prove much. The point is to create enough realistic messiness to exercise the existing workflow.&lt;/p&gt;

&lt;h2&gt;
  
  
  Run the analytics-ready export demo
&lt;/h2&gt;

&lt;p&gt;After generating the input file, run the analytics demo:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python scripts/run_analytics_demo.py &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--input&lt;/span&gt; data/generated/orders_100k.csv &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--schema&lt;/span&gt; data/expected/generated_order_schema.json &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output-dir&lt;/span&gt; data/output/analytics
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Windows PowerShell:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;python&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;scripts/run_analytics_demo.py&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--input&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;data/generated/orders_100k.csv&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--schema&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;data/expected/generated_order_schema.json&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--output-dir&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;data/output/analytics&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The script runs the generated input through the existing validation and cleaning logic, then writes analytics-ready outputs.&lt;/p&gt;

&lt;p&gt;Expected output files:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data/output/analytics/cleaned_orders.csv
data/output/analytics/cleaned_orders.parquet
data/output/analytics/customer_summary.csv
data/output/analytics/revenue_by_country.csv
data/output/analytics/orders_by_month.csv
data/output/analytics/source_system_summary.csv
data/output/analytics/analytics_queries.sql
data/output/analytics/benchmark_report.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here is the analytics output and DuckDB query demo:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F21vpp0iq2gdze7pgo8e8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F21vpp0iq2gdze7pgo8e8.png" alt="Analytics outputs and DuckDB query" width="800" height="637"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The output directory is intentionally excluded from Git. Large generated files and local analytics outputs should not be committed to the repository.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why export Parquet?
&lt;/h2&gt;

&lt;p&gt;CSV is easy to inspect and share. It is a good default output for small workflows.&lt;/p&gt;

&lt;p&gt;Parquet is useful when the same cleaned dataset will be queried repeatedly or used by analytics tools. It preserves column types better than CSV and is commonly used in data workflows.&lt;/p&gt;

&lt;p&gt;In this project, Parquet is not used to make the project sound bigger than it is. It is used as a practical local export format after the cleaning step.&lt;/p&gt;

&lt;p&gt;The key handoff idea is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cleaned CSV for readability
cleaned Parquet for local analytics
summary CSV files for reporting
SQL file for repeatable queries
benchmark report for documentation
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That combination is still small, but it is more useful than a single cleaned CSV file.&lt;/p&gt;

&lt;h2&gt;
  
  
  Query Parquet locally with DuckDB
&lt;/h2&gt;

&lt;p&gt;The demo writes reusable SQL to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data/output/analytics/analytics_queries.sql
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;An example query looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;country&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ROUND&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;revenue&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_revenue&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;order_count&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;read_parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'data/output/analytics/cleaned_orders.parquet'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;country&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;total_revenue&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This query uses DuckDB to read the Parquet file directly.&lt;/p&gt;

&lt;p&gt;That is a useful pattern for small local workflows because it avoids setting up a separate database service just to inspect analytics-ready output.&lt;/p&gt;

&lt;p&gt;For a portfolio project, it also shows a clear bridge between Python data cleaning and SQL-based analysis.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary tables
&lt;/h2&gt;

&lt;p&gt;The analytics demo produces several summary CSV files:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;customer_summary.csv
revenue_by_country.csv
orders_by_month.csv
source_system_summary.csv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These are intentionally simple.&lt;/p&gt;

&lt;p&gt;They are not a BI dashboard. They are not a reporting platform. They are small output tables that show how the cleaned data can be prepared for the next step.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;revenue_by_country.csv&lt;/code&gt; can support a country-level revenue view;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;orders_by_month.csv&lt;/code&gt; can support a monthly trend view;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;source_system_summary.csv&lt;/code&gt; can help compare different input sources;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;customer_summary.csv&lt;/code&gt; can support customer-level reporting.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In client work, this kind of output is often enough for a first automation milestone. The client can open the CSV files, load them into a spreadsheet, connect them to a BI tool, or use them as the input for the next version.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmark report
&lt;/h2&gt;

&lt;p&gt;The analytics demo also writes a Markdown benchmark report:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data/output/analytics/benchmark_report.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here is the report screenshot:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhiqkgb05syf1xdhrqi5n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhiqkgb05syf1xdhrqi5n.png" alt="Benchmark report" width="800" height="1334"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The report records information such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;input file path;&lt;/li&gt;
&lt;li&gt;raw row count;&lt;/li&gt;
&lt;li&gt;cleaned row count;&lt;/li&gt;
&lt;li&gt;analytics-ready row count;&lt;/li&gt;
&lt;li&gt;validation issue count;&lt;/li&gt;
&lt;li&gt;runtime seconds;&lt;/li&gt;
&lt;li&gt;output file paths;&lt;/li&gt;
&lt;li&gt;DuckDB preview query result.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The report is not meant to be a formal performance benchmark.&lt;/p&gt;

&lt;p&gt;It is a lightweight run record. Its purpose is to make each run easier to review, compare, and hand off.&lt;/p&gt;

&lt;h2&gt;
  
  
  How validation still fits in
&lt;/h2&gt;

&lt;p&gt;The v0.4 analytics path does not bypass the existing validation workflow.&lt;/p&gt;

&lt;p&gt;The script still loads a schema file, reads the input CSV, validates expected fields, cleans the data, and then prepares the analytics output.&lt;/p&gt;

&lt;p&gt;That matters because analytics output should not be generated from uninspected raw input.&lt;/p&gt;

&lt;p&gt;The basic flow is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;read generated CSV
        ↓
load expected schema
        ↓
validate schema rules
        ↓
clean duplicate and text values
        ↓
prepare analytics columns
        ↓
write CSV, Parquet, summary tables, SQL, and benchmark report
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This keeps the analytics demo connected to the original purpose of the project: data quality before reporting.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is intentionally still out of scope?
&lt;/h2&gt;

&lt;p&gt;The v0.4.0 update does not add:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;user login;&lt;/li&gt;
&lt;li&gt;frontend application;&lt;/li&gt;
&lt;li&gt;async task queue;&lt;/li&gt;
&lt;li&gt;cloud deployment;&lt;/li&gt;
&lt;li&gt;BI dashboard;&lt;/li&gt;
&lt;li&gt;Metabase;&lt;/li&gt;
&lt;li&gt;data warehouse implementation;&lt;/li&gt;
&lt;li&gt;Airflow;&lt;/li&gt;
&lt;li&gt;dbt;&lt;/li&gt;
&lt;li&gt;Spark;&lt;/li&gt;
&lt;li&gt;SQLModel metadata layer;&lt;/li&gt;
&lt;li&gt;AI, LLM, RAG, or agent features.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those tools can be useful in the right context, but they would make this starter project heavier than necessary.&lt;/p&gt;

&lt;p&gt;The design goal remains:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;small
runnable
testable
documented
screenshot-ready
easy to inspect
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is more useful for this stage than adding a large platform stack.&lt;/p&gt;

&lt;h2&gt;
  
  
  Run the tests
&lt;/h2&gt;

&lt;p&gt;Run the full test suite:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;-m&lt;/span&gt; compileall &lt;span class="nt"&gt;-q&lt;/span&gt; src/dq_etl_starter
python &lt;span class="nt"&gt;-m&lt;/span&gt; compileall &lt;span class="nt"&gt;-q&lt;/span&gt; scripts
pytest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run only the v0.4 analytics-related tests:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pytest tests/test_generate_sample_data.py
pytest tests/test_analytics.py
pytest tests/test_exporters_parquet.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The optional PostgreSQL integration tests should remain optional and should be skipped unless &lt;code&gt;DATABASE_URL&lt;/code&gt; is configured.&lt;/p&gt;

&lt;p&gt;This is another reason to keep the workflow modular. The generator, analytics helpers, Parquet exporter, and original ETL workflow can be checked independently.&lt;/p&gt;

&lt;h2&gt;
  
  
  Run with Docker
&lt;/h2&gt;

&lt;p&gt;Build and run the default Docker workflow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker build &lt;span class="nt"&gt;-t&lt;/span&gt; data-quality-etl-starter:0.4.0 &lt;span class="nt"&gt;-t&lt;/span&gt; data-quality-etl-starter:latest &lt;span class="nb"&gt;.&lt;/span&gt;
docker run &lt;span class="nt"&gt;--rm&lt;/span&gt; data-quality-etl-starter:0.4.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The default Docker run remains a simple reproducible CLI workflow.&lt;/p&gt;

&lt;p&gt;This is intentional. The Docker path should not become complicated just because the project gained an optional analytics demo.&lt;/p&gt;

&lt;h2&gt;
  
  
  How this maps to freelance client work
&lt;/h2&gt;

&lt;p&gt;This update maps well to common small data workflow requests.&lt;/p&gt;

&lt;p&gt;Examples include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;generating test data before working with private client data;&lt;/li&gt;
&lt;li&gt;validating messy order or customer exports;&lt;/li&gt;
&lt;li&gt;cleaning data before monthly reporting;&lt;/li&gt;
&lt;li&gt;producing Parquet files for local analytics;&lt;/li&gt;
&lt;li&gt;writing repeatable DuckDB SQL queries;&lt;/li&gt;
&lt;li&gt;producing summary CSV files for spreadsheet or BI handoff;&lt;/li&gt;
&lt;li&gt;documenting each run with a benchmark or quality report.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a client, this kind of workflow is valuable because it is practical and reviewable.&lt;/p&gt;

&lt;p&gt;It answers questions like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What did the script read?&lt;/li&gt;
&lt;li&gt;What did it clean?&lt;/li&gt;
&lt;li&gt;What outputs did it produce?&lt;/li&gt;
&lt;li&gt;Can the same workflow run again next week?&lt;/li&gt;
&lt;li&gt;Can the output be inspected without a custom application?&lt;/li&gt;
&lt;li&gt;Can the workflow be extended without rewriting everything?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the level of reliability many small automation projects need before they become larger systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I would improve next
&lt;/h2&gt;

&lt;p&gt;The next version could move one step closer to reporting workflows without turning the project into a full BI platform.&lt;/p&gt;

&lt;p&gt;Possible next steps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;write PostgreSQL reporting tables or views;&lt;/li&gt;
&lt;li&gt;add more summary table examples;&lt;/li&gt;
&lt;li&gt;add a lightweight local BI demo;&lt;/li&gt;
&lt;li&gt;improve benchmark report formatting;&lt;/li&gt;
&lt;li&gt;add better schema profiling;&lt;/li&gt;
&lt;li&gt;add more realistic validation rules;&lt;/li&gt;
&lt;li&gt;add CI for automated test runs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The important constraint is to keep the project focused.&lt;/p&gt;

&lt;p&gt;The current project is a small Python data quality ETL starter. The v0.4.0 update makes it easier to demonstrate analytics-ready output, but it is still not trying to become a full data platform.&lt;/p&gt;

&lt;h2&gt;
  
  
  Repository
&lt;/h2&gt;

&lt;p&gt;GitHub repository:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/OnerGit/data-quality-etl-starter" rel="noopener noreferrer"&gt;https://github.com/OnerGit/data-quality-etl-starter&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Previous article:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/bob_oner/build-a-python-data-quality-etl-starter-for-messy-csv-excel-json-and-api-style-data-46b"&gt;Build a Python Data Quality ETL Starter for Messy CSV, Excel, JSON, and API-Style Data&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This v0.4.0 update is a practical next step: from cleaning and validation to analytics-ready local outputs that can be inspected, queried, and handed off.&lt;/p&gt;

</description>
      <category>python</category>
      <category>etl</category>
      <category>dataengineering</category>
      <category>docker</category>
    </item>
    <item>
      <title>Build a Python Data Quality ETL Starter for Messy CSV, Excel, JSON, and API-Style Data</title>
      <dc:creator>Bob Oner</dc:creator>
      <pubDate>Wed, 03 Jun 2026 07:43:59 +0000</pubDate>
      <link>https://dev.to/bob_oner/build-a-python-data-quality-etl-starter-for-messy-csv-excel-json-and-api-style-data-46b</link>
      <guid>https://dev.to/bob_oner/build-a-python-data-quality-etl-starter-for-messy-csv-excel-json-and-api-style-data-46b</guid>
      <description>&lt;p&gt;Small teams often receive data before it is ready for reporting.&lt;/p&gt;

&lt;p&gt;It may come from a CSV export, an Excel file, a JSON payload, or an API-style response. The structure is usually close enough to be useful, but not clean enough to trust directly.&lt;/p&gt;

&lt;p&gt;Common problems include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;inconsistent column names&lt;/li&gt;
&lt;li&gt;missing values&lt;/li&gt;
&lt;li&gt;duplicate rows&lt;/li&gt;
&lt;li&gt;invalid emails&lt;/li&gt;
&lt;li&gt;bad dates&lt;/li&gt;
&lt;li&gt;numeric fields stored as messy text&lt;/li&gt;
&lt;li&gt;nested JSON that needs to become a table&lt;/li&gt;
&lt;li&gt;repeated manual cleanup before every report&lt;/li&gt;
&lt;li&gt;no quality report for handoff&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This article walks through a small open-source Python project I built to handle that kind of problem:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/OnerGit/data-quality-etl-starter" rel="noopener noreferrer"&gt;Data Quality ETL Starter&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It is not a big data platform. It is not an Airflow or dbt project. It is not a production data warehouse. It is a small, reproducible starter workflow for data validation, cleaning, export, and reporting.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this project builds
&lt;/h2&gt;

&lt;p&gt;The project takes messy input data and runs it through a repeatable workflow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;messy CSV / Excel / JSON / mock API data
        ↓
read and flatten
        ↓
normalize columns
        ↓
validate expected schema rules
        ↓
clean duplicate rows and text values
        ↓
export cleaned CSV + SQLite
        ↓
generate data quality report
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The current version supports:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CSV input&lt;/li&gt;
&lt;li&gt;Excel input&lt;/li&gt;
&lt;li&gt;nested JSON input&lt;/li&gt;
&lt;li&gt;mock API-style JSON input&lt;/li&gt;
&lt;li&gt;column name normalization&lt;/li&gt;
&lt;li&gt;Pydantic-based workflow and schema models&lt;/li&gt;
&lt;li&gt;missing value, duplicate row, email, date, and number validation&lt;/li&gt;
&lt;li&gt;cleaned CSV output&lt;/li&gt;
&lt;li&gt;SQLite output by default&lt;/li&gt;
&lt;li&gt;optional PostgreSQL export&lt;/li&gt;
&lt;li&gt;Markdown and JSON data quality reports&lt;/li&gt;
&lt;li&gt;CLI execution&lt;/li&gt;
&lt;li&gt;pytest tests&lt;/li&gt;
&lt;li&gt;Docker-based execution&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The main goal is not to build a complex platform. The goal is to show how a small data workflow can be structured, tested, documented, and adapted.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters for small-team data workflows
&lt;/h2&gt;

&lt;p&gt;Many small-team data problems are not "big data" problems.&lt;/p&gt;

&lt;p&gt;They are repeatability problems.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a sales team exports a customer list every week&lt;/li&gt;
&lt;li&gt;an operations team receives Excel files from different sources&lt;/li&gt;
&lt;li&gt;a founder wants a simple CSV-to-report workflow&lt;/li&gt;
&lt;li&gt;an analyst needs JSON payloads flattened into reporting tables&lt;/li&gt;
&lt;li&gt;a freelancer needs to hand off a data cleanup script that another person can run&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A one-off script can solve one file once.&lt;/p&gt;

&lt;p&gt;A small workflow is more useful because it makes the steps explicit:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;What data did we receive?&lt;/li&gt;
&lt;li&gt;What rules did we expect?&lt;/li&gt;
&lt;li&gt;What changed during cleaning?&lt;/li&gt;
&lt;li&gt;What output files were created?&lt;/li&gt;
&lt;li&gt;What warnings should the next person review?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That is the reason this project writes both cleaned data and a quality report.&lt;/p&gt;

&lt;h2&gt;
  
  
  Project structure
&lt;/h2&gt;

&lt;p&gt;The repository keeps the workflow small and modular:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data-quality-etl-starter/
├── data/
│   ├── input/
│   ├── expected/
│   └── output/
├── docs/
├── screenshots/
├── src/dq_etl_starter/
├── tests/
├── Dockerfile
├── docker-compose.yml
├── pyproject.toml
└── README.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The most important source modules are:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;src/dq_etl_starter/
├── readers.py       # read CSV, Excel, and JSON files
├── mock_api.py      # simulate API-style JSON without network calls
├── normalize.py     # normalize columns and flatten JSON
├── validate.py      # validate expected columns and simple schema rules
├── clean.py         # trim text values and drop duplicates
├── exporters.py     # export cleaned data to CSV, SQLite, or PostgreSQL
├── report.py        # generate Markdown and JSON reports
├── models.py        # Pydantic models for workflow contracts
└── cli.py           # command-line entry point
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This separation is intentional. Each step can be tested, replaced, or extended without turning the project into one long script.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setup
&lt;/h2&gt;

&lt;p&gt;Clone the repository:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/OnerGit/data-quality-etl-starter.git
&lt;span class="nb"&gt;cd &lt;/span&gt;data-quality-etl-starter
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create a virtual environment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;-m&lt;/span&gt; venv .venv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Activate it on macOS or Linux:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;source&lt;/span&gt; .venv/bin/activate
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Activate it on Windows PowerShell:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;venv&lt;/span&gt;&lt;span class="n"&gt;\Scripts\activate&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Install dependencies and the local package:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The editable install step is useful because the source code uses a &lt;code&gt;src/&lt;/code&gt; layout.&lt;/p&gt;

&lt;h2&gt;
  
  
  Input data examples
&lt;/h2&gt;

&lt;p&gt;The project includes sample inputs under &lt;code&gt;data/input/&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The examples are designed to represent common data workflow formats:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data/input/messy_customers.csv
data/input/messy_orders.xlsx
data/input/nested_customers.json
data/input/mock_api_orders.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here is an example of the kind of messy source data the workflow is designed to handle:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg2xd3ghqlrkuhbm680qm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg2xd3ghqlrkuhbm680qm.png" alt="Raw messy data" width="798" height="133"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The mock API file does not call a real external API. It simulates an API-style JSON response so the workflow remains reproducible and does not require API keys.&lt;/p&gt;

&lt;p&gt;That is useful for a starter project because anyone can run it locally without creating accounts, setting secrets, or depending on an external service.&lt;/p&gt;

&lt;h2&gt;
  
  
  Run the CSV workflow
&lt;/h2&gt;

&lt;p&gt;The CSV example reads a messy customer file, validates it against an expected schema, cleans it, exports the result, and generates reports.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;-m&lt;/span&gt; dq_etl_starter.cli run &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--input&lt;/span&gt; data/input/messy_customers.csv &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--input-type&lt;/span&gt; csv &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--schema&lt;/span&gt; data/expected/customer_schema.json &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output-dir&lt;/span&gt; data/output/csv &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--db-target&lt;/span&gt; sqlite &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--table-name&lt;/span&gt; cleaned_customers
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected outputs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data/output/csv/cleaned_customers.csv
data/output/csv/etl_output.sqlite
data/output/csv/quality_report.md
data/output/csv/quality_report.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvt5pncsyu00pkdokshpk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvt5pncsyu00pkdokshpk.png" alt="CLI workflow run" width="800" height="187"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The important point is that the workflow does not only create a cleaned file. It also records what happened.&lt;/p&gt;

&lt;h2&gt;
  
  
  Run the Excel workflow
&lt;/h2&gt;

&lt;p&gt;Excel exports are common in small business and operations workflows.&lt;/p&gt;

&lt;p&gt;Run the sample Excel input:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;-m&lt;/span&gt; dq_etl_starter.cli run &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--input&lt;/span&gt; data/input/messy_orders.xlsx &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--input-type&lt;/span&gt; excel &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--schema&lt;/span&gt; data/expected/order_schema.json &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output-dir&lt;/span&gt; data/output/excel &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--db-target&lt;/span&gt; sqlite &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--table-name&lt;/span&gt; cleaned_orders
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The output pattern is the same:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data/output/excel/cleaned_orders.csv
data/output/excel/etl_output.sqlite
data/output/excel/quality_report.md
data/output/excel/quality_report.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Keeping the CSV and Excel output folders separate makes it easier to compare runs without overwriting previous reports.&lt;/p&gt;

&lt;h2&gt;
  
  
  Run nested JSON data
&lt;/h2&gt;

&lt;p&gt;JSON is useful for APIs and application exports, but nested JSON is not always reporting-ready.&lt;/p&gt;

&lt;p&gt;This project supports a &lt;code&gt;--records-path&lt;/code&gt; option so the workflow can extract a list of records from a nested payload.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;-m&lt;/span&gt; dq_etl_starter.cli run &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--input&lt;/span&gt; data/input/nested_customers.json &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--input-type&lt;/span&gt; json &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--records-path&lt;/span&gt; data.customers &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--schema&lt;/span&gt; data/expected/customer_schema.json &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output-dir&lt;/span&gt; data/output/json &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--db-target&lt;/span&gt; sqlite &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--table-name&lt;/span&gt; cleaned_customers_json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuq27ivha41a5dz0z5sn6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuq27ivha41a5dz0z5sn6.png" alt="Nested JSON flattened output" width="800" height="1548"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This step demonstrates a practical pattern: convert nested data into a tabular structure before validation, cleaning, and reporting.&lt;/p&gt;

&lt;h2&gt;
  
  
  Run mock API-style data
&lt;/h2&gt;

&lt;p&gt;The mock API workflow uses a local JSON file that looks like an API response.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;-m&lt;/span&gt; dq_etl_starter.cli run &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--input&lt;/span&gt; data/input/mock_api_orders.json &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--input-type&lt;/span&gt; mock-api &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--records-path&lt;/span&gt; data.orders &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--schema&lt;/span&gt; data/expected/order_schema.json &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output-dir&lt;/span&gt; data/output/mock_api &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--db-target&lt;/span&gt; sqlite &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--table-name&lt;/span&gt; cleaned_api_orders
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is intentionally not a real API integration in the current version.&lt;/p&gt;

&lt;p&gt;For a client project, this layer could later be replaced with a real API reader that handles authentication, pagination, retries, and rate limits. For a starter project, using a local mock payload keeps the workflow easy to run and test.&lt;/p&gt;

&lt;h2&gt;
  
  
  How validation works
&lt;/h2&gt;

&lt;p&gt;The project uses schema files under &lt;code&gt;data/expected/&lt;/code&gt; to define what the workflow expects.&lt;/p&gt;

&lt;p&gt;For example, a customer schema can describe expected columns and simple column rules. The workflow then checks the raw data before cleaning.&lt;/p&gt;

&lt;p&gt;Validation can detect issues such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;missing expected columns&lt;/li&gt;
&lt;li&gt;unexpected columns&lt;/li&gt;
&lt;li&gt;missing values&lt;/li&gt;
&lt;li&gt;duplicate rows&lt;/li&gt;
&lt;li&gt;invalid email formats&lt;/li&gt;
&lt;li&gt;invalid date values&lt;/li&gt;
&lt;li&gt;invalid number values&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The validation report uses source-oriented row references for CSV-style inputs. For example, if the source CSV has one header line and five data rows, a warning on &lt;code&gt;Row 6&lt;/code&gt; points to the fifth data record in the source file.&lt;/p&gt;

&lt;p&gt;Pydantic is used for the workflow and reporting contracts. The row-level cleaning and validation remain DataFrame-based because CSV, Excel, JSON, and API-style datasets often have changing columns.&lt;/p&gt;

&lt;p&gt;The validation step does not try to solve every business rule. That is a deliberate choice.&lt;/p&gt;

&lt;p&gt;In real client work, every dataset has different rules. A small starter should make validation easy to extend rather than hardcoding too many assumptions.&lt;/p&gt;

&lt;h2&gt;
  
  
  How cleaning works
&lt;/h2&gt;

&lt;p&gt;After validation, the workflow applies simple cleaning steps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;trim text values&lt;/li&gt;
&lt;li&gt;normalize selected text fields&lt;/li&gt;
&lt;li&gt;drop duplicate rows&lt;/li&gt;
&lt;li&gt;keep the cleaned data in a DataFrame&lt;/li&gt;
&lt;li&gt;export the cleaned result&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The project keeps cleaning intentionally conservative.&lt;/p&gt;

&lt;p&gt;It does not silently invent missing values. It does not guess complex business logic. It focuses on cleaning steps that are easy to explain and review.&lt;/p&gt;

&lt;p&gt;That matters for handoff. When someone else receives the output, they should be able to understand what the script changed and what still needs human review.&lt;/p&gt;

&lt;h2&gt;
  
  
  Output files
&lt;/h2&gt;

&lt;p&gt;Each run can create four main outputs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cleaned data as CSV
cleaned data in SQLite
quality report as Markdown
quality report as JSON
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data/output/csv/cleaned_customers.csv
data/output/csv/etl_output.sqlite
data/output/csv/quality_report.md
data/output/csv/quality_report.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Markdown report is useful for humans. The JSON report is useful if another tool needs to consume the result later.&lt;/p&gt;

&lt;p&gt;A typical report includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;raw row count&lt;/li&gt;
&lt;li&gt;cleaned row count&lt;/li&gt;
&lt;li&gt;column list&lt;/li&gt;
&lt;li&gt;missing values by column&lt;/li&gt;
&lt;li&gt;duplicate row count&lt;/li&gt;
&lt;li&gt;missing expected columns&lt;/li&gt;
&lt;li&gt;unexpected columns&lt;/li&gt;
&lt;li&gt;validation issues&lt;/li&gt;
&lt;li&gt;output file paths&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6c9v96s4fkflrlrfoic2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6c9v96s4fkflrlrfoic2.png" alt="Data quality report" width="800" height="1261"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The cleaned output can then be reviewed directly or used in a later reporting workflow.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frcl81deca50nij4ikkrh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frcl81deca50nij4ikkrh.png" alt="Cleaned output" width="796" height="99"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This turns the workflow from "a script produced a file" into "a repeatable run produced data and a reviewable report."&lt;/p&gt;

&lt;h2&gt;
  
  
  Inspect SQLite output
&lt;/h2&gt;

&lt;p&gt;SQLite is the default database target because it is local, portable, and easy to inspect.&lt;/p&gt;

&lt;p&gt;After running the CSV workflow, you can open the SQLite file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;sqlite3 data/output/csv/etl_output.sqlite
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then inspect the tables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tables&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;cleaned_customers&lt;/span&gt; &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is useful when the cleaned data should later feed a dashboard, internal tool, or reporting process.&lt;/p&gt;

&lt;h2&gt;
  
  
  Optional PostgreSQL export
&lt;/h2&gt;

&lt;p&gt;The project also includes optional PostgreSQL export support.&lt;/p&gt;

&lt;p&gt;This is useful when cleaned data needs to be loaded into a shared database instead of a local SQLite file. SQLite remains the default local target.&lt;/p&gt;

&lt;p&gt;Start PostgreSQL with Docker Compose:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker compose up &lt;span class="nt"&gt;-d&lt;/span&gt; postgres
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run the workflow with PostgreSQL as the target:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;DATABASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;postgresql+psycopg://dq_user:dq_password@localhost:5432/dq_demo &lt;span class="se"&gt;\&lt;/span&gt;
python &lt;span class="nt"&gt;-m&lt;/span&gt; dq_etl_starter.cli run &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--input&lt;/span&gt; data/input/messy_customers.csv &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--input-type&lt;/span&gt; csv &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--schema&lt;/span&gt; data/expected/customer_schema.json &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output-dir&lt;/span&gt; data/output/postgres &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--db-target&lt;/span&gt; postgres &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--table-name&lt;/span&gt; cleaned_customers
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For Windows PowerShell, set the environment variable separately:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$&lt;/span&gt;&lt;span class="nn"&gt;env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nv"&gt;DATABASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"postgresql+psycopg://dq_user:dq_password@localhost:5432/dq_demo"&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="n"&gt;python&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-m&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;dq_etl_starter.cli&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;run&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--input&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;data/input/messy_customers.csv&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--input-type&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;csv&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--schema&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;data/expected/customer_schema.json&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--output-dir&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;data/output/postgres&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--db-target&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;postgres&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--table-name&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;cleaned_customers&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I would still treat PostgreSQL as an optional extension for this starter. SQLite is enough for the default local workflow.&lt;/p&gt;

&lt;h2&gt;
  
  
  Run tests
&lt;/h2&gt;

&lt;p&gt;Run the test suite with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pytest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh1vflmf0la1fv17ztdgx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh1vflmf0la1fv17ztdgx.png" alt="Passing tests" width="800" height="239"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The tests cover the workflow pieces that matter most for a small ETL starter:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;reading input files&lt;/li&gt;
&lt;li&gt;normalizing data&lt;/li&gt;
&lt;li&gt;validating data&lt;/li&gt;
&lt;li&gt;cleaning rows&lt;/li&gt;
&lt;li&gt;exporting data&lt;/li&gt;
&lt;li&gt;generating reports&lt;/li&gt;
&lt;li&gt;running the CLI&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Tests are important here because data cleanup scripts are easy to break when formats change.&lt;/p&gt;

&lt;p&gt;A small test suite makes the workflow safer to modify.&lt;/p&gt;

&lt;h2&gt;
  
  
  Run with Docker
&lt;/h2&gt;

&lt;p&gt;The project can also run inside Docker.&lt;/p&gt;

&lt;p&gt;Build the image:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker build &lt;span class="nt"&gt;-t&lt;/span&gt; data-quality-etl-starter &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run the container:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;--rm&lt;/span&gt; &lt;span class="nt"&gt;-v&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;PWD&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/data/output:/app/data/output"&lt;/span&gt; data-quality-etl-starter
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On Windows PowerShell, the same command format can be used:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;docker&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;run&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;--rm&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-v&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;${PWD}&lt;/span&gt;&lt;span class="s2"&gt;/data/output:/app/data/output"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;data-quality-etl-starter&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnncdewc47y0kltu6x08h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnncdewc47y0kltu6x08h.png" alt="Docker run" width="797" height="106"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Docker is useful for handoff because it reduces local environment differences. A reviewer or client can run the workflow without manually recreating the same Python environment.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I intentionally kept out of v0.1
&lt;/h2&gt;

&lt;p&gt;This project is intentionally small.&lt;/p&gt;

&lt;p&gt;The current version does not include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a FastAPI service&lt;/li&gt;
&lt;li&gt;user authentication&lt;/li&gt;
&lt;li&gt;scheduled jobs&lt;/li&gt;
&lt;li&gt;a web dashboard&lt;/li&gt;
&lt;li&gt;Airflow orchestration&lt;/li&gt;
&lt;li&gt;dbt models&lt;/li&gt;
&lt;li&gt;cloud deployment&lt;/li&gt;
&lt;li&gt;real external API authentication&lt;/li&gt;
&lt;li&gt;complex business-specific validation rules&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;FastAPI would be a natural future layer, but it is not part of the v0.1 core.&lt;/p&gt;

&lt;p&gt;The CLI workflow remains the source of truth. A later API layer could expose endpoints for upload, validation, and report retrieval, but that should come after the core workflow is stable.&lt;/p&gt;

&lt;p&gt;Keeping the first version small makes the project easier to run, review, test, and adapt.&lt;/p&gt;

&lt;h2&gt;
  
  
  How this maps to freelance client work
&lt;/h2&gt;

&lt;p&gt;This kind of starter maps well to small Python data workflow tasks.&lt;/p&gt;

&lt;p&gt;Examples include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;cleaning CSV exports before reporting&lt;/li&gt;
&lt;li&gt;converting Excel files into normalized CSV output&lt;/li&gt;
&lt;li&gt;flattening JSON into tables&lt;/li&gt;
&lt;li&gt;validating required columns before import&lt;/li&gt;
&lt;li&gt;producing simple data quality reports&lt;/li&gt;
&lt;li&gt;loading cleaned data into SQLite or PostgreSQL&lt;/li&gt;
&lt;li&gt;preparing data for dashboards or internal tools&lt;/li&gt;
&lt;li&gt;turning a manual weekly cleanup process into a repeatable command&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For freelance work, the value is not only the code.&lt;/p&gt;

&lt;p&gt;The value is the handoff:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;clear commands&lt;/li&gt;
&lt;li&gt;sample inputs&lt;/li&gt;
&lt;li&gt;predictable outputs&lt;/li&gt;
&lt;li&gt;validation warnings&lt;/li&gt;
&lt;li&gt;reports&lt;/li&gt;
&lt;li&gt;tests&lt;/li&gt;
&lt;li&gt;Docker support&lt;/li&gt;
&lt;li&gt;documented limitations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is what makes a small automation project easier for another person to trust.&lt;/p&gt;

&lt;h2&gt;
  
  
  Next steps
&lt;/h2&gt;

&lt;p&gt;The next improvements I would consider are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;add more schema rule types&lt;/li&gt;
&lt;li&gt;improve report formatting&lt;/li&gt;
&lt;li&gt;add richer error messages&lt;/li&gt;
&lt;li&gt;add logging&lt;/li&gt;
&lt;li&gt;add run IDs for report history&lt;/li&gt;
&lt;li&gt;add a real API reader example&lt;/li&gt;
&lt;li&gt;add a FastAPI wrapper around the CLI workflow&lt;/li&gt;
&lt;li&gt;add more PostgreSQL examples&lt;/li&gt;
&lt;li&gt;add CI for automated testing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The important constraint is to keep the project practical. It should remain small enough for a developer, analyst, or client to understand without needing a full data platform.&lt;/p&gt;

&lt;h2&gt;
  
  
  GitHub repository
&lt;/h2&gt;

&lt;p&gt;The full project is available here:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/OnerGit/data-quality-etl-starter" rel="noopener noreferrer"&gt;https://github.com/OnerGit/data-quality-etl-starter&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you work with messy CSV, Excel, JSON, or API-style data, this kind of starter can be a useful base for building repeatable data cleaning and reporting workflows.&lt;/p&gt;

</description>
      <category>python</category>
      <category>etl</category>
      <category>pandas</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Build a CSV Data Quality API with FastAPI, Pandas, Pytest, and Docker</title>
      <dc:creator>Bob Oner</dc:creator>
      <pubDate>Fri, 29 May 2026 05:55:19 +0000</pubDate>
      <link>https://dev.to/bob_oner/build-a-csv-data-quality-api-with-fastapi-pandas-pytest-and-docker-28ld</link>
      <guid>https://dev.to/bob_oner/build-a-csv-data-quality-api-with-fastapi-pandas-pytest-and-docker-28ld</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0z3vop4upezfgutsy6tm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0z3vop4upezfgutsy6tm.png" alt="Swagger UI" width="800" height="588"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;CSV files are still everywhere.&lt;/p&gt;

&lt;p&gt;They appear in internal operations, analytics workflows, data exports, business reports, and small automation pipelines. Even when a team already uses databases or modern data platforms, CSV is often the format used to move data between people, tools, and systems.&lt;/p&gt;

&lt;p&gt;The problem is that CSV files are easy to create but not always safe to trust.&lt;/p&gt;

&lt;p&gt;Before a CSV file enters a pipeline, it is useful to answer a few basic questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How many rows and columns does it have?&lt;/li&gt;
&lt;li&gt;Are there missing values?&lt;/li&gt;
&lt;li&gt;Are there duplicate rows?&lt;/li&gt;
&lt;li&gt;Are any columns completely empty?&lt;/li&gt;
&lt;li&gt;Do the columns match what the next system expects?&lt;/li&gt;
&lt;li&gt;Can another script or service consume the result in a predictable format?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In this article, I will walk through a small project that turns those checks into a reusable API:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;FastAPI CSV Quality API&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;GitHub repo:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/OnerGit/fastapi-csv-quality-api" rel="noopener noreferrer"&gt;https://github.com/OnerGit/fastapi-csv-quality-api&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The goal is not to build a full data quality platform. The goal is to show a practical engineering path from a local Python workflow to a small backend service that is documented, testable, and containerized.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we will build
&lt;/h2&gt;

&lt;p&gt;The API accepts a CSV file upload and returns a structured JSON quality report.&lt;/p&gt;

&lt;p&gt;The report includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;row count&lt;/li&gt;
&lt;li&gt;column count&lt;/li&gt;
&lt;li&gt;column names&lt;/li&gt;
&lt;li&gt;missing values by column&lt;/li&gt;
&lt;li&gt;missing value ratio by column&lt;/li&gt;
&lt;li&gt;duplicate row count&lt;/li&gt;
&lt;li&gt;duplicate row ratio&lt;/li&gt;
&lt;li&gt;empty columns&lt;/li&gt;
&lt;li&gt;column name issues&lt;/li&gt;
&lt;li&gt;optional expected-column validation&lt;/li&gt;
&lt;li&gt;warnings&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The project also includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;structured error responses&lt;/li&gt;
&lt;li&gt;pytest tests&lt;/li&gt;
&lt;li&gt;sample CSV files&lt;/li&gt;
&lt;li&gt;Swagger UI&lt;/li&gt;
&lt;li&gt;Dockerfile&lt;/li&gt;
&lt;li&gt;Docker Compose support&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here is an example of the quality report shown in Swagger UI:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4tjlf2doqf8ryc484woj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4tjlf2doqf8ryc484woj.png" alt="CSV quality report" width="800" height="606"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Tech stack
&lt;/h2&gt;

&lt;p&gt;This project uses:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;FastAPI&lt;/strong&gt; for the web API&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pandas&lt;/strong&gt; for CSV analysis&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pydantic&lt;/strong&gt; for response models&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;pytest&lt;/strong&gt; for automated tests&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Uvicorn&lt;/strong&gt; as the ASGI server&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docker&lt;/strong&gt; and &lt;strong&gt;Docker Compose&lt;/strong&gt; for containerized execution&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The project structure is intentionally small:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;fastapi-csv-quality-api/
├── README.md
├── LICENSE
├── requirements.txt
├── Dockerfile
├── docker-compose.yml
├── app/
│   ├── __init__.py
│   ├── main.py
│   ├── models.py
│   ├── analyzer.py
│   └── errors.py
├── tests/
│   ├── __init__.py
│   ├── test_health.py
│   ├── test_analyze.py
│   └── fixtures/
├── sample_data/
├── screenshots/
├── docs/
└── article_assets/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The separation is simple:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;main.py&lt;/code&gt; exposes the API routes.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;analyzer.py&lt;/code&gt; contains the CSV analysis logic.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;models.py&lt;/code&gt; defines typed response structures.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;errors.py&lt;/code&gt; keeps error response helpers separate.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;tests/&lt;/code&gt; verifies the expected behavior.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Step 1: Run the API locally
&lt;/h2&gt;

&lt;p&gt;Clone the repository:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/OnerGit/fastapi-csv-quality-api.git
&lt;span class="nb"&gt;cd &lt;/span&gt;fastapi-csv-quality-api
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create a virtual environment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;-m&lt;/span&gt; venv .venv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Activate it.&lt;/p&gt;

&lt;p&gt;On macOS or Linux:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;source&lt;/span&gt; .venv/bin/activate
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On Windows PowerShell:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;\.venv\Scripts\Activate.ps1&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Install dependencies:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;-m&lt;/span&gt; pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--upgrade&lt;/span&gt; pip
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run the API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uvicorn app.main:app &lt;span class="nt"&gt;--reload&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then open:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;http://127.0.0.1:8000/docs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You should see the FastAPI Swagger UI.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Add a health check endpoint
&lt;/h2&gt;

&lt;p&gt;A small service should have a simple health check endpoint. It gives us a quick way to verify that the API is running.&lt;/p&gt;

&lt;p&gt;Example request:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl http://127.0.0.1:8000/health
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Example response:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ok"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"service"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"fastapi-csv-quality-api"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"0.1.0"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This endpoint is also useful in tests, Docker checks, and future deployment environments.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Build the CSV upload endpoint
&lt;/h2&gt;

&lt;p&gt;The main endpoint is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;POST /analyze
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It accepts a CSV file as multipart form data.&lt;/p&gt;

&lt;p&gt;Example request:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="s2"&gt;"http://127.0.0.1:8000/analyze"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-F&lt;/span&gt; &lt;span class="s2"&gt;"file=@sample_data/good_sample.csv"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On Windows PowerShell, use &lt;code&gt;curl.exe&lt;/code&gt; instead of &lt;code&gt;curl&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;curl.exe&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-X&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;POST&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"http://127.0.0.1:8000/analyze"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;-F&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"file=@sample_data/good_sample.csv"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is important because PowerShell may treat &lt;code&gt;curl&lt;/code&gt; as an alias rather than the standard curl executable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4: Implement basic CSV quality checks
&lt;/h2&gt;

&lt;p&gt;Once the uploaded file is accepted, the analyzer reads it and computes a set of practical metrics.&lt;/p&gt;

&lt;p&gt;For a small MVP, the most useful checks are often simple:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;row_count
column_count
column_names
missing_values_by_column
missing_value_ratio_by_column
duplicate_row_count
duplicate_row_ratio
empty_columns
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These checks are enough to catch many common CSV problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;unexpected empty fields&lt;/li&gt;
&lt;li&gt;duplicated records&lt;/li&gt;
&lt;li&gt;fully empty columns&lt;/li&gt;
&lt;li&gt;files with the wrong shape&lt;/li&gt;
&lt;li&gt;files that look valid but are not useful downstream&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A simplified example response looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"filename"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"bad_sample.csv"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"row_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"column_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"column_names"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"email"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"age"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"signup_date"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"notes"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"missing_values_by_column"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"email"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"age"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"signup_date"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"notes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"missing_value_ratio_by_column"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.1667&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"email"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.3333&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"age"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.1667&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"signup_date"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.3333&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"notes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"duplicate_row_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"duplicate_row_ratio"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.1667&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"empty_columns"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"notes"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"warnings"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"The CSV file contains 12 missing value(s)."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"The CSV file contains 1 duplicate row(s)."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"The CSV file contains empty column(s): notes."&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key design choice is that the API returns structured JSON instead of plain text.&lt;/p&gt;

&lt;p&gt;That makes the result easier to consume from:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;another script&lt;/li&gt;
&lt;li&gt;a data pipeline&lt;/li&gt;
&lt;li&gt;a dashboard&lt;/li&gt;
&lt;li&gt;a workflow automation tool&lt;/li&gt;
&lt;li&gt;a monitoring job&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Step 5: Add expected-column validation
&lt;/h2&gt;

&lt;p&gt;In many workflows, the next system expects a fixed set of columns.&lt;/p&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;id,name,email,age,signup_date
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The API supports optional expected-column validation through a form field:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="s2"&gt;"http://127.0.0.1:8000/analyze"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-F&lt;/span&gt; &lt;span class="s2"&gt;"file=@sample_data/good_sample.csv"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-F&lt;/span&gt; &lt;span class="s2"&gt;"expected_columns=id,name,email,age,signup_date"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This allows the service to compare the uploaded CSV headers against the expected headers.&lt;/p&gt;

&lt;p&gt;The response can then tell you whether the file matches the expected shape, and which columns are missing or unexpected.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Forw2vc9aa4wswpozt1kk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Forw2vc9aa4wswpozt1kk.png" alt="Expected columns validation" width="800" height="660"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is a small feature, but it changes the API from a generic CSV inspector into something more useful for real workflows.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 6: Return structured errors
&lt;/h2&gt;

&lt;p&gt;CSV upload workflows can fail in many ways.&lt;/p&gt;

&lt;p&gt;Some examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the user uploads a non-CSV file&lt;/li&gt;
&lt;li&gt;the file is empty&lt;/li&gt;
&lt;li&gt;the file cannot be parsed&lt;/li&gt;
&lt;li&gt;the encoding is unsupported&lt;/li&gt;
&lt;li&gt;the file is too large&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Instead of returning inconsistent error messages, this project returns structured errors.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"error"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"code"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"invalid_file_type"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"message"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Only .csv files are supported."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"details"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"filename"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"not_csv.txt"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This format is useful because clients can check the &lt;code&gt;code&lt;/code&gt; field programmatically.&lt;/p&gt;

&lt;p&gt;For example, a frontend can show a friendly message for &lt;code&gt;invalid_file_type&lt;/code&gt;, while a pipeline can log the error and stop processing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 7: Add tests with pytest
&lt;/h2&gt;

&lt;p&gt;A small API becomes much more useful when its behavior is protected by tests.&lt;/p&gt;

&lt;p&gt;This project includes tests for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;/health&lt;/code&gt; returning &lt;code&gt;200&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;normal CSV analysis&lt;/li&gt;
&lt;li&gt;missing value detection&lt;/li&gt;
&lt;li&gt;duplicate row detection&lt;/li&gt;
&lt;li&gt;non-CSV error handling&lt;/li&gt;
&lt;li&gt;expected-column validation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Run the tests:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pytest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Example test result:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhu44sh7fro5n22uc6pfv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhu44sh7fro5n22uc6pfv.png" alt="Pytest passed" width="800" height="164"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For a small demo project, tests are not just a formality. They make the project easier to refactor and safer to extend.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 8: Containerize the API with Docker
&lt;/h2&gt;

&lt;p&gt;After the API works locally, the next step is to package it.&lt;/p&gt;

&lt;p&gt;Build the Docker image:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker build &lt;span class="nt"&gt;-t&lt;/span&gt; fastapi-csv-quality-api &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run the container:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;--rm&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; 8000:8000 fastapi-csv-quality-api
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then open:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;http://127.0.0.1:8000/docs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The container listens on &lt;code&gt;0.0.0.0:8000&lt;/code&gt;, while your local machine accesses it through &lt;code&gt;127.0.0.1:8000&lt;/code&gt; after port mapping.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwzsxohv1c35pqv3geh3v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwzsxohv1c35pqv3geh3v.png" alt="Docker run" width="636" height="76"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 9: Use Docker Compose
&lt;/h2&gt;

&lt;p&gt;Docker Compose is included for a simpler local workflow.&lt;/p&gt;

&lt;p&gt;Start the service:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker compose up &lt;span class="nt"&gt;--build&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Stop the service:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker compose down
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is useful when you want a repeatable local runtime without manually typing the full &lt;code&gt;docker run&lt;/code&gt; command.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this project is intentionally small
&lt;/h2&gt;

&lt;p&gt;This project is an MVP.&lt;/p&gt;

&lt;p&gt;It intentionally does not include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;authentication&lt;/li&gt;
&lt;li&gt;database storage&lt;/li&gt;
&lt;li&gt;frontend UI&lt;/li&gt;
&lt;li&gt;background jobs&lt;/li&gt;
&lt;li&gt;large-file streaming&lt;/li&gt;
&lt;li&gt;Kubernetes deployment&lt;/li&gt;
&lt;li&gt;production cloud infrastructure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is deliberate.&lt;/p&gt;

&lt;p&gt;The purpose is to demonstrate a complete but lightweight backend workflow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CSV upload → validation → analysis → structured response → tests → Docker packaging
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This makes the project easier to read, test, and extend.&lt;/p&gt;

&lt;h2&gt;
  
  
  Possible next improvements
&lt;/h2&gt;

&lt;p&gt;There are many ways to extend this project:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;configurable file size limits&lt;/li&gt;
&lt;li&gt;date format checks&lt;/li&gt;
&lt;li&gt;numeric column checks&lt;/li&gt;
&lt;li&gt;JSON schema export&lt;/li&gt;
&lt;li&gt;GitHub Actions CI workflow&lt;/li&gt;
&lt;li&gt;cloud deployment tutorial&lt;/li&gt;
&lt;li&gt;larger file handling&lt;/li&gt;
&lt;li&gt;dashboard integration&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A natural next step would be to deploy the containerized service to a small cloud VM or Kubernetes platform. But that should be treated as a separate tutorial rather than added to the first MVP.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;This project shows how to turn a simple local CSV inspection workflow into a reusable API.&lt;/p&gt;

&lt;p&gt;The main engineering ideas are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;keep the first version small&lt;/li&gt;
&lt;li&gt;return structured JSON instead of text&lt;/li&gt;
&lt;li&gt;separate API routing from analysis logic&lt;/li&gt;
&lt;li&gt;define response models clearly&lt;/li&gt;
&lt;li&gt;test the behavior with pytest&lt;/li&gt;
&lt;li&gt;package the service with Docker&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Even though the example is small, the pattern is useful:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;local script → API service → tested component → containerized tool
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That pattern can be reused for many internal developer tools, data workflow utilities, and automation services.&lt;/p&gt;

&lt;p&gt;You can find the full project here:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/OnerGit/fastapi-csv-quality-api" rel="noopener noreferrer"&gt;https://github.com/OnerGit/fastapi-csv-quality-api&lt;/a&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>fastapi</category>
      <category>docker</category>
      <category>testing</category>
    </item>
    <item>
      <title>AI-Assisted Development Is Not Autopilot</title>
      <dc:creator>Bob Oner</dc:creator>
      <pubDate>Fri, 29 May 2026 04:48:48 +0000</pubDate>
      <link>https://dev.to/bob_oner/ai-assisted-development-is-not-autopilot-15ie</link>
      <guid>https://dev.to/bob_oner/ai-assisted-development-is-not-autopilot-15ie</guid>
      <description>&lt;p&gt;AI can make coding faster. It can also make messy code faster.&lt;/p&gt;

&lt;p&gt;That is the part of AI-assisted development that does not get discussed enough. A model can generate a route handler, a browser userscript, a README section, or a test idea in seconds. But speed is not the same as engineering progress. If the output is not scoped, tested, reviewed, and documented, the project may become harder to maintain even though it was faster to start.&lt;/p&gt;

&lt;p&gt;My working rule is simple:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;AI can draft, but engineering must decide.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I have been using this rule while building two small developer tools:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/OnerGit/fastapi-csv-quality-api" rel="noopener noreferrer"&gt;FastAPI CSV Quality API&lt;/a&gt;, a minimal FastAPI service that accepts CSV uploads and returns a structured data quality report.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/OnerGit/ChatGPT-Long-Conversation-Helper" rel="noopener noreferrer"&gt;ChatGPT Long Conversation Helper&lt;/a&gt;, a privacy-first Tampermonkey userscript for collapsing and navigating long ChatGPT conversations locally in the browser.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are intentionally small projects. That is the point. Small projects are useful for learning where AI helps, where it needs limits, and how to turn generated drafts into reviewable engineering work.&lt;/p&gt;

&lt;p&gt;This article is not a prompt guide. It is also not a benchmark, a productivity claim, or a recipe for replacing code review. It is a practical reflection on how I keep AI-assisted development useful without giving up scope control, testing, documentation, and human review.&lt;/p&gt;

&lt;h2&gt;
  
  
  AI can draft, but engineering must decide
&lt;/h2&gt;

&lt;p&gt;I do not treat AI-generated code as finished code. I treat it as a draft that needs to pass through normal engineering gates.&lt;/p&gt;

&lt;p&gt;For small tools, those gates do not need to be heavy. They can be simple:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is the scope clear?&lt;/li&gt;
&lt;li&gt;Is the interface small?&lt;/li&gt;
&lt;li&gt;Is the behavior testable?&lt;/li&gt;
&lt;li&gt;Are errors handled consistently?&lt;/li&gt;
&lt;li&gt;Is the privacy boundary explicit?&lt;/li&gt;
&lt;li&gt;Can another developer reproduce the project from the README?&lt;/li&gt;
&lt;li&gt;Can I explain what the code does without relying on the original prompt?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That last question matters. If I cannot explain the code after reading it, I do not own the implementation yet.&lt;/p&gt;

&lt;p&gt;In my workflow, AI is helpful during exploration. I may ask it to compare approaches, list edge cases, suggest a project structure, draft a README, or propose test cases. But I do not let it decide the final shape of the project. That decision belongs to the developer, because the developer is responsible for the behavior that gets published.&lt;/p&gt;

&lt;p&gt;This is the difference between using AI as a drafting tool and treating AI as an autopilot.&lt;/p&gt;

&lt;h2&gt;
  
  
  Start with a small interface, not a large prompt
&lt;/h2&gt;

&lt;p&gt;The biggest mistake I see in AI-assisted coding is starting with a large prompt that asks for an entire application.&lt;/p&gt;

&lt;p&gt;That often produces code, but not necessarily a design.&lt;/p&gt;

&lt;p&gt;For the CSV quality API, the useful boundary was not:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Build a complete data quality platform.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That would have been too broad. The useful boundary was much smaller:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A user uploads a CSV file. The API returns a structured JSON report.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That boundary made the project reviewable. It forced the implementation to answer concrete questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What is the endpoint?&lt;/li&gt;
&lt;li&gt;What does the response model contain?&lt;/li&gt;
&lt;li&gt;What happens with an empty file?&lt;/li&gt;
&lt;li&gt;What happens with a non-CSV file?&lt;/li&gt;
&lt;li&gt;What does a duplicate row count mean?&lt;/li&gt;
&lt;li&gt;How should missing values be represented?&lt;/li&gt;
&lt;li&gt;Which errors should be structured?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;AI could help draft the FastAPI route and suggest pandas checks, but the important engineering work was defining the contract. Once the response shape was clear, the code had something to obey.&lt;/p&gt;

&lt;p&gt;The browser userscript had a different kind of interface. It was not an HTTP API. It was a local UI boundary inside the browser.&lt;/p&gt;

&lt;p&gt;The useful boundary was:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Add local collapse and expand controls to long conversation messages without sending, uploading, exporting, or scraping conversation content.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That boundary was just as important as an API contract. It prevented the project from drifting into a more sensitive tool. It also made implementation choices easier. The script could use DOM selectors, CSS, MutationObserver, and localStorage, but it should not use external requests, analytics, backend sync, or API calls.&lt;/p&gt;

&lt;p&gt;In both projects, the small interface came before the implementation. That gave AI a box to work inside.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tests turn AI output into reviewable code
&lt;/h2&gt;

&lt;p&gt;AI-generated code becomes safer when it is forced to satisfy tests.&lt;/p&gt;

&lt;p&gt;For the FastAPI CSV Quality API, automated tests were the main review tool. The tests were not only checking whether the app started. They were checking behavior that mattered to the API contract:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;health endpoint behavior&lt;/li&gt;
&lt;li&gt;valid CSV upload behavior&lt;/li&gt;
&lt;li&gt;missing value reporting&lt;/li&gt;
&lt;li&gt;duplicate row reporting&lt;/li&gt;
&lt;li&gt;expected column validation&lt;/li&gt;
&lt;li&gt;invalid file handling&lt;/li&gt;
&lt;li&gt;empty upload handling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This matters because an API can look correct while silently changing its response shape. A field can be renamed. A ratio can be calculated differently. An error response can become inconsistent. Without tests, those changes are easy to miss.&lt;/p&gt;

&lt;p&gt;A simplified test might look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_analyze_csv_returns_quality_report&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;csv_file&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/analyze&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;files&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sample.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;csv_file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text/csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)},&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;row_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;missing_values_by_column&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;duplicate_row_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The exact test is less important than the habit. The test says: this is the behavior I expect, and future changes must respect it.&lt;/p&gt;

&lt;p&gt;The userscript needed a different testing strategy. Browser UI behavior is harder to protect with a quick pytest suite, especially when the page is dynamic and not controlled by the project. So I used a manual test checklist instead.&lt;/p&gt;

&lt;p&gt;That checklist covered installation, single-message collapse and expand, global controls, dynamic messages, refresh behavior, localStorage state, and privacy checks. It also included cases that are easy to forget: code blocks, long lines, Markdown tables, streaming replies, and messages added after the initial page load.&lt;/p&gt;

&lt;p&gt;This is still testing. It is just the right level of testing for the project.&lt;/p&gt;

&lt;p&gt;The point is not that every small tool needs a full CI pipeline. The point is that AI output needs a review mechanism. For an API, that mechanism may be automated tests. For a browser userscript, it may start with a disciplined manual checklist.&lt;/p&gt;

&lt;h2&gt;
  
  
  Logs are part of the review surface
&lt;/h2&gt;

&lt;p&gt;Logs are often treated as an afterthought in small tools. In AI-assisted development, I think they are more important.&lt;/p&gt;

&lt;p&gt;When a project uses generated code, logs help answer a basic question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What is the code actually doing at runtime?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For the API project, logs are useful when checking upload handling, parsing failures, and unexpected errors. For a small FastAPI service, I do not need a complex observability stack. But I do need error paths that are visible and understandable.&lt;/p&gt;

&lt;p&gt;For the userscript, console warnings are useful when selectors fail or expected message containers are not found. This is especially important because DOM-based tools depend on a page structure that may change. If the script silently stops working, debugging becomes frustrating. A small, clear warning is better than silent failure.&lt;/p&gt;

&lt;p&gt;Logs should not leak sensitive content. That is especially important for a tool that runs on conversation pages. Logging message text would violate the project’s own privacy boundary. Logging a generic warning such as “message container not found” is enough.&lt;/p&gt;

&lt;p&gt;Good logs do not make the project bigger. They make it easier to review.&lt;/p&gt;

&lt;h2&gt;
  
  
  Privacy and permission boundaries must be explicit
&lt;/h2&gt;

&lt;p&gt;AI is useful at suggesting features. That is also why the developer needs to say no.&lt;/p&gt;

&lt;p&gt;For the long conversation helper, it would be easy to add more features:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;export conversations&lt;/li&gt;
&lt;li&gt;summarize previous replies&lt;/li&gt;
&lt;li&gt;sync state across devices&lt;/li&gt;
&lt;li&gt;send content to an API&lt;/li&gt;
&lt;li&gt;add search over all messages&lt;/li&gt;
&lt;li&gt;collect usage analytics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Some of those features may be useful in other products, but they do not belong in this MVP.&lt;/p&gt;

&lt;p&gt;The project is intentionally local. It modifies the browser view. It stores local UI state. It does not transmit conversation content. It does not call the ChatGPT API. It does not automate message sending. It does not export conversations.&lt;/p&gt;

&lt;p&gt;That is not only a privacy statement. It is an engineering constraint.&lt;/p&gt;

&lt;p&gt;Once the privacy boundary is explicit, every new feature can be reviewed against it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Does this feature require external requests?&lt;/li&gt;
&lt;li&gt;Does it store message text?&lt;/li&gt;
&lt;li&gt;Does it touch cookies, tokens, or account data?&lt;/li&gt;
&lt;li&gt;Does it turn a local UI helper into a data extraction tool?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the answer is yes, it is outside the scope.&lt;/p&gt;

&lt;p&gt;This is where AI needs the most supervision. A model may suggest a technically possible feature without understanding the product boundary. The developer must decide whether the feature should exist at all.&lt;/p&gt;

&lt;h2&gt;
  
  
  Ask AI for alternatives, not final decisions
&lt;/h2&gt;

&lt;p&gt;I get better results when I ask AI for options rather than final answers.&lt;/p&gt;

&lt;p&gt;For example, in the API project, AI can suggest several response model shapes. But I still need to choose the one that is easiest to understand, test, and document.&lt;/p&gt;

&lt;p&gt;In the userscript project, AI can suggest several selector strategies. But selector choice requires judgment. Deep class-name chains may work today and break tomorrow. Shallow role-based selectors may be more stable, but they still need manual testing. There is no perfect answer, only a trade-off that should be documented.&lt;/p&gt;

&lt;p&gt;The same applies to error handling, README structure, release notes, and limitations. AI can produce a draft quickly. The developer decides what is accurate.&lt;/p&gt;

&lt;p&gt;A useful AI prompt is not:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Build the final solution.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A better prompt is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Give me three implementation options, their risks, and how I should test each one.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That kind of prompt keeps the developer in control. It turns AI into a reviewer, not an autopilot.&lt;/p&gt;

&lt;h2&gt;
  
  
  Documentation is part of development
&lt;/h2&gt;

&lt;p&gt;For small projects, documentation is often postponed until the code is done. I try to do the opposite.&lt;/p&gt;

&lt;p&gt;A README is not only a marketing page. It is a reproducibility contract. It should tell a reader what the project does, what it does not do, how to run it, how to test it, and where the limitations are.&lt;/p&gt;

&lt;p&gt;For the CSV API, documentation needed to explain the endpoint, the response fields, sample data, test commands, Docker usage, and screenshots. Without that, the project would be much harder to evaluate from the outside.&lt;/p&gt;

&lt;p&gt;For the userscript, documentation needed to explain installation, privacy, manual testing, troubleshooting, limitations, and release scope. That documentation is part of the engineering work because the tool runs in a sensitive context: a user’s browser session.&lt;/p&gt;

&lt;p&gt;AI is useful for documentation drafts. It can help organize sections and turn rough notes into readable text. But documentation still needs technical review.&lt;/p&gt;

&lt;p&gt;If the README says the project does not send external requests, the code must support that statement. If the limitations section says selectors may break when the page changes, the implementation should be structured so selectors are easy to update.&lt;/p&gt;

&lt;p&gt;Documentation should not make the project sound bigger than it is. Good documentation reduces uncertainty. It does not inflate the project.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stop before the project becomes too big
&lt;/h2&gt;

&lt;p&gt;AI makes scope creep easier.&lt;/p&gt;

&lt;p&gt;Once the first version works, it is tempting to ask for more: a dashboard, a Chrome extension, cloud sync, user accounts, analytics, background jobs, advanced configuration, and so on.&lt;/p&gt;

&lt;p&gt;For portfolio projects, that can be dangerous. A small finished tool is often more convincing than a large unfinished platform.&lt;/p&gt;

&lt;p&gt;The CSV API did not need to become a full data quality platform. It needed to show a clean API boundary, structured output, meaningful checks, tests, Docker packaging, and documentation.&lt;/p&gt;

&lt;p&gt;The conversation helper did not need to become a full browser extension or AI workspace. It needed to solve one local navigation problem with a privacy-first boundary.&lt;/p&gt;

&lt;p&gt;Stopping is an engineering skill.&lt;/p&gt;

&lt;p&gt;A clear MVP makes the project easier to review. It also makes the writing stronger. Instead of explaining a large incomplete system, I can explain the trade-offs behind a small complete one.&lt;/p&gt;

&lt;h2&gt;
  
  
  My lightweight AI-assisted development loop
&lt;/h2&gt;

&lt;p&gt;After building these two small tools, I now prefer a simple loop:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Define the smallest useful interface.&lt;/li&gt;
&lt;li&gt;Ask AI for options and risks.&lt;/li&gt;
&lt;li&gt;Write or review the first implementation.&lt;/li&gt;
&lt;li&gt;Add tests or a manual test checklist.&lt;/li&gt;
&lt;li&gt;Add logs where debugging would otherwise be unclear.&lt;/li&gt;
&lt;li&gt;Document usage, limitations, and non-goals.&lt;/li&gt;
&lt;li&gt;Review the code against the original boundary.&lt;/li&gt;
&lt;li&gt;Stop before the project becomes unnecessarily large.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is not a heavy process. It is a lightweight review loop for small tools.&lt;/p&gt;

&lt;p&gt;The order matters. I do not want to start with a large prompt and then search for a structure afterward. I want the structure first, then use AI inside that structure.&lt;/p&gt;

&lt;p&gt;For the CSV API, that structure was the upload endpoint and response contract. For the userscript, it was the local-only browser interaction model and the privacy boundary. In both cases, the AI-assisted parts were useful because the project already had a small reviewable shape.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I learned from two small tools
&lt;/h2&gt;

&lt;p&gt;The two projects are different, but they taught me the same lesson:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;AI-assisted development works best when the surrounding process is disciplined.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;From the FastAPI CSV Quality API, I learned that AI is helpful for turning a rough script idea into an API draft. But the real value comes from defining the response contract and protecting it with tests.&lt;/p&gt;

&lt;p&gt;From the ChatGPT Long Conversation Helper, I learned that AI is helpful for exploring DOM logic and browser APIs. But the real value comes from privacy boundaries, manual testing, selector judgment, and clear limitations.&lt;/p&gt;

&lt;p&gt;In both cases, the workflow mattered more than the initial code generation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;AI-assisted development is not autopilot.&lt;/p&gt;

&lt;p&gt;It is useful when it helps developers move faster through drafts, alternatives, edge cases, and documentation. It becomes risky when generated code bypasses scope control, testing, privacy review, and human judgment.&lt;/p&gt;

&lt;p&gt;For me, the practical answer is not to avoid AI. It is to wrap AI inside an engineering workflow.&lt;/p&gt;

&lt;p&gt;Small interfaces keep the project understandable. Tests protect behavior. Logs make runtime behavior visible. Documentation makes the project reproducible. Limitations prevent overclaiming. Human review keeps the final responsibility where it belongs.&lt;/p&gt;

&lt;p&gt;AI can help write code.&lt;/p&gt;

&lt;p&gt;Engineering decides what code is worth keeping.&lt;/p&gt;

&lt;h2&gt;
  
  
  Related projects
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/OnerGit/fastapi-csv-quality-api" rel="noopener noreferrer"&gt;FastAPI CSV Quality API&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/OnerGit/ChatGPT-Long-Conversation-Helper" rel="noopener noreferrer"&gt;ChatGPT Long Conversation Helper&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/bob_oner/build-a-privacy-first-tampermonkey-script-for-long-chatgpt-conversations-2765"&gt;Build a Privacy-First Tampermonkey Script for Long ChatGPT Conversations&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>productivity</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Build a Privacy-First Tampermonkey Script for Long ChatGPT Conversations</title>
      <dc:creator>Bob Oner</dc:creator>
      <pubDate>Thu, 28 May 2026 12:54:50 +0000</pubDate>
      <link>https://dev.to/bob_oner/build-a-privacy-first-tampermonkey-script-for-long-chatgpt-conversations-2765</link>
      <guid>https://dev.to/bob_oner/build-a-privacy-first-tampermonkey-script-for-long-chatgpt-conversations-2765</guid>
      <description>&lt;h1&gt;
  
  
  Build a Privacy-First Tampermonkey Script for Long ChatGPT Conversations
&lt;/h1&gt;

&lt;p&gt;Long AI conversations are useful, but they become hard to scan.&lt;/p&gt;

&lt;p&gt;If you use ChatGPT for technical planning, code review, writing drafts, debugging, or research, a single conversation can easily grow into dozens of turns. At that point, the problem is no longer generating more content. The problem is navigation.&lt;/p&gt;

&lt;p&gt;You may want to jump back to an earlier question. You may want to hide a long assistant answer after you have already used it. You may want to keep only the most important parts visible while reviewing the whole thread.&lt;/p&gt;

&lt;p&gt;I wanted a small tool for that specific problem: collapse and expand long ChatGPT questions and answers in the local browser view.&lt;/p&gt;

&lt;p&gt;The result is &lt;strong&gt;ChatGPT Long Conversation Helper&lt;/strong&gt;, a Tampermonkey userscript that adds per-message collapse controls, global collapse / expand controls, a three-line preview, and local UI state.&lt;/p&gt;

&lt;p&gt;Companion repository:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;https://github.com/OnerGit/ChatGPT-Long-Conversation-Helper
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is a third-party local userscript. It is not an official OpenAI or ChatGPT feature.&lt;/p&gt;

&lt;p&gt;It only changes the local browser view. It does not upload, transmit, collect, export, or send conversation content. It does not call the ChatGPT API. It does not automate sending messages. It stores only local UI state in &lt;code&gt;localStorage&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem: long conversations are hard to review
&lt;/h2&gt;

&lt;p&gt;A long conversation is useful while you are building it. It becomes less useful when you need to review it later.&lt;/p&gt;

&lt;p&gt;The page can contain long prompts, detailed answers, code blocks, checklists, and repeated planning notes. Scrolling through everything makes it harder to compare earlier decisions with later results.&lt;/p&gt;

&lt;p&gt;The tool does not try to summarize the conversation. It keeps the content exactly where it is and adds a local way to hide or show each message.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhbtlba7jjdlizjz9njfm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhbtlba7jjdlizjz9njfm.png" alt="Before using the helper, long conversations can take a lot of vertical space" width="800" height="916"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What this userscript does
&lt;/h2&gt;

&lt;p&gt;The first version focuses on one narrow workflow improvement: make long conversations easier to review.&lt;/p&gt;

&lt;p&gt;The userscript adds:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a &lt;code&gt;Collapse question&lt;/code&gt; / &lt;code&gt;Expand question&lt;/code&gt; button for user messages;&lt;/li&gt;
&lt;li&gt;a &lt;code&gt;Collapse answer&lt;/code&gt; / &lt;code&gt;Expand answer&lt;/code&gt; button for assistant messages;&lt;/li&gt;
&lt;li&gt;a three-line preview when a message is collapsed;&lt;/li&gt;
&lt;li&gt;a subtle fade mask near the preview boundary;&lt;/li&gt;
&lt;li&gt;a floating global control panel;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Collapse all&lt;/code&gt; and &lt;code&gt;Expand all&lt;/code&gt; buttons;&lt;/li&gt;
&lt;li&gt;a compact &lt;code&gt;LCH&lt;/code&gt; launcher after hiding the full panel;&lt;/li&gt;
&lt;li&gt;local collapsed / expanded state with &lt;code&gt;localStorage&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It deliberately does not provide export, scraping, summarization, automation, cloud sync, or API integration.&lt;/p&gt;

&lt;p&gt;That scope matters. A browser UI helper should not silently become a data extraction tool.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj02qd6lgtuumao8igh79.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj02qd6lgtuumao8igh79.png" alt="A single message-level collapse control is added above a conversation message" width="800" height="679"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why I started with Tampermonkey
&lt;/h2&gt;

&lt;p&gt;This project could eventually become a browser extension, but I did not start there.&lt;/p&gt;

&lt;p&gt;A Tampermonkey userscript was a better MVP boundary for three reasons.&lt;/p&gt;

&lt;p&gt;First, it is quick to test. I can paste a single &lt;code&gt;.user.js&lt;/code&gt; file into Tampermonkey, open ChatGPT, and validate the DOM behavior immediately.&lt;/p&gt;

&lt;p&gt;Second, it avoids extension packaging too early. A Chrome or Edge extension would require more decisions around permissions, manifest configuration, distribution, review, and long-term maintenance.&lt;/p&gt;

&lt;p&gt;Third, the real uncertainty was not packaging. The real uncertainty was whether the DOM-based interaction would feel useful and stable enough.&lt;/p&gt;

&lt;p&gt;So the first goal was simple: validate the interaction model locally before turning it into a heavier browser extension.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setting the privacy boundary
&lt;/h2&gt;

&lt;p&gt;Before writing the DOM code, I defined what the tool must not do.&lt;/p&gt;

&lt;p&gt;The script should not:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;upload conversation content;&lt;/li&gt;
&lt;li&gt;transmit conversation content;&lt;/li&gt;
&lt;li&gt;collect conversation content;&lt;/li&gt;
&lt;li&gt;export conversations;&lt;/li&gt;
&lt;li&gt;call the ChatGPT API;&lt;/li&gt;
&lt;li&gt;automate sending messages;&lt;/li&gt;
&lt;li&gt;read cookies;&lt;/li&gt;
&lt;li&gt;read account tokens;&lt;/li&gt;
&lt;li&gt;read payment information;&lt;/li&gt;
&lt;li&gt;collect telemetry;&lt;/li&gt;
&lt;li&gt;use analytics;&lt;/li&gt;
&lt;li&gt;load remote scripts.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The only persisted data should be local UI state: whether a message is collapsed and whether the global panel is hidden.&lt;/p&gt;

&lt;p&gt;That boundary influenced the implementation. The script uses browser APIs such as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;querySelectorAll
MutationObserver
localStorage
classList
addEventListener
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It does not need &lt;code&gt;fetch&lt;/code&gt;, &lt;code&gt;XMLHttpRequest&lt;/code&gt;, &lt;code&gt;WebSocket&lt;/code&gt;, &lt;code&gt;sendBeacon&lt;/code&gt;, &lt;code&gt;document.cookie&lt;/code&gt;, or external dependencies.&lt;/p&gt;

&lt;h2&gt;
  
  
  Userscript metadata
&lt;/h2&gt;

&lt;p&gt;A userscript starts with metadata. This block tells Tampermonkey where the script should run and which special permissions it needs.&lt;/p&gt;

&lt;p&gt;For this project, the metadata is intentionally small:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// ==UserScript==&lt;/span&gt;
&lt;span class="c1"&gt;// @name         ChatGPT Long Conversation Helper&lt;/span&gt;
&lt;span class="c1"&gt;// @namespace    chatgpt-long-conversation-helper&lt;/span&gt;
&lt;span class="c1"&gt;// @version      0.1.1&lt;/span&gt;
&lt;span class="c1"&gt;// @description  A privacy-first local UI helper that collapses and expands long ChatGPT questions and answers.&lt;/span&gt;
&lt;span class="c1"&gt;// @author       OnerGit&lt;/span&gt;
&lt;span class="c1"&gt;// @match        https://chatgpt.com/*&lt;/span&gt;
&lt;span class="c1"&gt;// @grant        none&lt;/span&gt;
&lt;span class="c1"&gt;// @run-at       document-idle&lt;/span&gt;
&lt;span class="c1"&gt;// @license      MIT&lt;/span&gt;
&lt;span class="c1"&gt;// ==/UserScript==&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The important lines are:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// @match        https://chatgpt.com/*&lt;/span&gt;
&lt;span class="c1"&gt;// @grant        none&lt;/span&gt;
&lt;span class="c1"&gt;// @run-at       document-idle&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;@match&lt;/code&gt; limits the script to ChatGPT pages.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;@grant none&lt;/code&gt; keeps the script in a simple mode without requesting special Tampermonkey APIs.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;@run-at document-idle&lt;/code&gt; waits until the page is mostly loaded before running. This is useful for UI scripts because many target elements may not exist at the earliest loading stage.&lt;/p&gt;

&lt;p&gt;This does not guarantee all conversation messages are already present. ChatGPT is a dynamic web app, so the script still needs a &lt;code&gt;MutationObserver&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Finding message nodes in a dynamic page
&lt;/h2&gt;

&lt;p&gt;The script needs to find user questions and assistant answers.&lt;/p&gt;

&lt;p&gt;A tempting approach would be to copy a long selector chain from DevTools. For example, you might inspect a message and copy a selector that includes many nested class names.&lt;/p&gt;

&lt;p&gt;That is usually fragile.&lt;/p&gt;

&lt;p&gt;Modern web apps often change generated class names, wrapper elements, or layout structure. A selector that is too deep may break after a small UI update.&lt;/p&gt;

&lt;p&gt;Instead, this script prefers shallow role-based selectors:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;CONFIG&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;roleSelectors&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;[data-message-author-role="user"]&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;[data-message-author-role="assistant"]&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
  &lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="na"&gt;ignoredAncestors&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;form, textarea, input, nav, aside, header, footer, [role="dialog"]&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;processedAttr&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;data-clch-processed&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is still a DOM dependency, and it can break if ChatGPT changes its page structure. But it is more maintainable than relying on a long chain of layout classes.&lt;/p&gt;

&lt;p&gt;The script also avoids processing input boxes, dialogs, headers, footers, sidebars, and other non-conversation areas.&lt;/p&gt;

&lt;p&gt;A simplified message finder looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;getConversationMessageNodes&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;found&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Set&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

  &lt;span class="nx"&gt;CONFIG&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;roleSelectors&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;forEach&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;selector&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nb"&gt;document&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;querySelectorAll&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;selector&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;forEach&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;node&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;isLikelyConversationMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;node&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;found&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;node&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nb"&gt;Array&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;from&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;found&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;Set&lt;/code&gt; prevents duplicates if selectors overlap.&lt;/p&gt;

&lt;h2&gt;
  
  
  Avoiding duplicate processing
&lt;/h2&gt;

&lt;p&gt;A dynamic page can be scanned many times.&lt;/p&gt;

&lt;p&gt;If the script adds a toolbar to a message every time it scans, the UI will quickly become broken. The solution is to mark processed nodes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;processMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;messageNode&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;messageNode&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nx"&gt;messageNode&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getAttribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;CONFIG&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;processedAttr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;true&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;role&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;getRole&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;messageNode&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;role&lt;/span&gt; &lt;span class="o"&gt;!==&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;user&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;role&lt;/span&gt; &lt;span class="o"&gt;!==&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;assistant&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;messageNode&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;classList&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;clch-message&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;messageNode&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setAttribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;CONFIG&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;processedAttr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;true&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="nf"&gt;addMessageToolbar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;messageNode&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nf"&gt;restoreState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;messageNode&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This makes scanning idempotent. Running &lt;code&gt;scanMessages()&lt;/code&gt; multiple times should not keep adding more buttons to the same message.&lt;/p&gt;

&lt;p&gt;That is important when using &lt;code&gt;MutationObserver&lt;/code&gt;, because DOM changes may trigger scans repeatedly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Adding collapse controls
&lt;/h2&gt;

&lt;p&gt;For each message, the script inserts a small toolbar before the message node.&lt;/p&gt;

&lt;p&gt;The toolbar contains one button:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;addMessageToolbar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;messageNode&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;role&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;getRole&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;messageNode&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;message&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;toolbar&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;document&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createElement&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;div&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;toolbar&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;className&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;clch-toolbar&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;button&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;document&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createElement&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;button&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;button&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;button&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nx"&gt;button&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;className&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;clch-toggle-button&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nx"&gt;button&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;textContent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;getToggleLabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;role&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;button&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setAttribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;aria-expanded&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;true&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="nx"&gt;button&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;addEventListener&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;click&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;currentlyCollapsed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
      &lt;span class="nx"&gt;messageNode&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getAttribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;data-clch-collapsed&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;true&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="nf"&gt;setCollapsed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;messageNode&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;currentlyCollapsed&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="nx"&gt;toolbar&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;appendChild&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;button&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;messageNode&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;parentNode&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;insertBefore&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;toolbar&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;messageNode&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The button does not move or rewrite the message content. It only toggles a collapsed class on the existing message node.&lt;/p&gt;

&lt;p&gt;That design choice matters. Moving or wrapping message nodes can introduce layout risk with Markdown tables, code blocks, and wide answer containers. This version avoids re-parenting ChatGPT message DOM nodes and applies the collapsed state directly to the message node.&lt;/p&gt;

&lt;h2&gt;
  
  
  Styling the collapsed state
&lt;/h2&gt;

&lt;p&gt;The collapsed state is mostly CSS.&lt;/p&gt;

&lt;p&gt;The script applies a class such as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;clch-collapsed-message
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then CSS limits the visible height:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight css"&gt;&lt;code&gt;&lt;span class="nc"&gt;.clch-collapsed-message&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;max-height&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;calc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt; &lt;span class="err"&gt;*&lt;/span&gt; &lt;span class="m"&gt;1.55em&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nl"&gt;overflow&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;hidden&lt;/span&gt; &lt;span class="cp"&gt;!important&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;position&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;relative&lt;/span&gt; &lt;span class="cp"&gt;!important&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A fade mask makes the preview feel less abrupt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight css"&gt;&lt;code&gt;&lt;span class="nc"&gt;.clch-collapsed-message&lt;/span&gt;&lt;span class="nd"&gt;::after&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;""&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;position&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;absolute&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;left&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;right&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;bottom&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;height&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1.9em&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;pointer-events&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;none&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;background&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;linear-gradient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="nb"&gt;bottom&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;rgba&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;255&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;255&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;255&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;var&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;--clch-fade-bg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;#ffffff&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is intentionally simple. The script does not try to summarize the message. It does not parse the text. It does not store the content. It only changes how much of the existing message is visible.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F80lgvsssek9dklngn78v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F80lgvsssek9dklngn78v.png" alt="Collapsed messages keep a short preview instead of disappearing completely" width="800" height="766"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Watching new messages with MutationObserver
&lt;/h2&gt;

&lt;p&gt;ChatGPT conversations are dynamic. New user messages and assistant replies appear after the initial page load.&lt;/p&gt;

&lt;p&gt;A one-time scan is not enough.&lt;/p&gt;

&lt;p&gt;The script uses &lt;code&gt;MutationObserver&lt;/code&gt; to watch for newly inserted content:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;startObserver&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;target&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;document&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;querySelector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;main&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;document&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;body&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;observer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;MutationObserver&lt;/span&gt;&lt;span class="p"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nb"&gt;window&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;clearTimeout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;observerTimer&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nx"&gt;observerTimer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;window&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setTimeout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;scheduleScan&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;CONFIG&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;observerThrottleMs&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="nx"&gt;observer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;observe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;target&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;childList&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;subtree&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The observer does not process every mutation immediately. It schedules a scan with a small delay.&lt;/p&gt;

&lt;p&gt;That delay matters because dynamic apps may produce several DOM changes during a single interaction. A small throttle/debounce keeps the script from doing unnecessary repeated work.&lt;/p&gt;

&lt;p&gt;The scan function can then process any new message nodes that do not already have the &lt;code&gt;data-clch-processed&lt;/code&gt; marker.&lt;/p&gt;

&lt;h2&gt;
  
  
  Saving local UI state
&lt;/h2&gt;

&lt;p&gt;If you collapse several messages and refresh the page, it is useful for the local view to remember that state.&lt;/p&gt;

&lt;p&gt;The script uses &lt;code&gt;localStorage&lt;/code&gt; for this.&lt;/p&gt;

&lt;p&gt;A simplified storage key looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;clch:v0.1.1:/c/example-conversation:assistant:4:collapsed = 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;script namespace and version;&lt;/li&gt;
&lt;li&gt;current URL path;&lt;/li&gt;
&lt;li&gt;message role;&lt;/li&gt;
&lt;li&gt;message index;&lt;/li&gt;
&lt;li&gt;collapsed state.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The value is only a UI flag.&lt;/p&gt;

&lt;p&gt;It does not store message text.&lt;/p&gt;

&lt;p&gt;The storage helpers are wrapped in &lt;code&gt;try/catch&lt;/code&gt; because browser storage can fail or be disabled:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;safeGetStorage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nb"&gt;window&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;localStorage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getItem&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;key&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;warn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;[CLCH] Failed to read localStorage.&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;safeSetStorage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nb"&gt;window&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;localStorage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setItem&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;warn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;[CLCH] Failed to write localStorage.&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This state recovery is best-effort. Because it is index-based, it may not restore perfectly if the conversation order changes or if the page DOM changes.&lt;/p&gt;

&lt;p&gt;That limitation is acceptable for an MVP because the script is a local UI helper, not a data management system.&lt;/p&gt;

&lt;h2&gt;
  
  
  Global controls and the LCH launcher
&lt;/h2&gt;

&lt;p&gt;Individual controls help when reviewing one message. Global controls help when a conversation is already long.&lt;/p&gt;

&lt;p&gt;The floating panel provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Collapse all&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Expand all&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Hide controls&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmdjjp2k1ilu5tw2m65ij.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmdjjp2k1ilu5tw2m65ij.png" alt="The floating control panel provides global collapse and expand actions" width="399" height="347"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If the panel itself becomes visual noise, it can be hidden into a small &lt;code&gt;LCH&lt;/code&gt; launcher.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdxa4shuawxu3t8lvj2et.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdxa4shuawxu3t8lvj2et.png" alt="The LCH launcher reopens the hidden global panel" width="241" height="187"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is a small UI detail, but it matters for a browser helper. A tool that reduces visual noise should not create too much noise of its own.&lt;/p&gt;

&lt;h2&gt;
  
  
  Manual testing
&lt;/h2&gt;

&lt;p&gt;For a small userscript, manual testing is still important.&lt;/p&gt;

&lt;p&gt;The test plan I used focuses on behavior rather than unit tests:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Install Tampermonkey.&lt;/li&gt;
&lt;li&gt;Paste and enable the userscript.&lt;/li&gt;
&lt;li&gt;Open a ChatGPT conversation at &lt;code&gt;https://chatgpt.com/&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Confirm the floating control panel appears.&lt;/li&gt;
&lt;li&gt;Confirm long user questions get &lt;code&gt;Collapse question&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Confirm assistant answers get &lt;code&gt;Collapse answer&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Collapse and expand individual messages.&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;Collapse all&lt;/code&gt; and &lt;code&gt;Expand all&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Hide the panel and reopen it with &lt;code&gt;LCH&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Send a new message and confirm dynamic content receives controls.&lt;/li&gt;
&lt;li&gt;Refresh the page and check best-effort state recovery.&lt;/li&gt;
&lt;li&gt;Test messages containing Markdown tables, code blocks, lists, and long lines.&lt;/li&gt;
&lt;li&gt;Confirm no message content disappears after expanding.&lt;/li&gt;
&lt;li&gt;Check that localStorage contains only UI state keys.&lt;/li&gt;
&lt;li&gt;Confirm there are no script-triggered external requests.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The privacy test is part of the functional test. For this project, “it works” is not enough. It also needs to stay within the local-only boundary.&lt;/p&gt;

&lt;h2&gt;
  
  
  Known limitations
&lt;/h2&gt;

&lt;p&gt;This is a best-effort UI enhancement.&lt;/p&gt;

&lt;p&gt;The main limitation is DOM dependency. The script depends on the visible ChatGPT web page structure. If ChatGPT changes its DOM, selectors may need to be updated.&lt;/p&gt;

&lt;p&gt;Other limitations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;streaming replies may not always receive controls immediately;&lt;/li&gt;
&lt;li&gt;local state recovery may be imperfect after page changes;&lt;/li&gt;
&lt;li&gt;message indexing can shift if the conversation structure changes;&lt;/li&gt;
&lt;li&gt;the script is manually tested, not tested against an official ChatGPT extension API;&lt;/li&gt;
&lt;li&gt;it is not an official feature;&lt;/li&gt;
&lt;li&gt;it is not affiliated with OpenAI.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These limitations are not hidden because they are part of the engineering reality of a DOM-based userscript.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I would improve next
&lt;/h2&gt;

&lt;p&gt;I would keep the next version small.&lt;/p&gt;

&lt;p&gt;Useful improvements include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;configurable preview line count;&lt;/li&gt;
&lt;li&gt;optional keyboard shortcuts;&lt;/li&gt;
&lt;li&gt;more robust selector fallback;&lt;/li&gt;
&lt;li&gt;better settings UI;&lt;/li&gt;
&lt;li&gt;improved dark-mode visual tuning;&lt;/li&gt;
&lt;li&gt;clearer reset controls for local UI state.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A Chrome or Edge extension could also be considered later, but only after the userscript behavior stabilizes.&lt;/p&gt;

&lt;p&gt;Moving from userscript to extension would require a new review of permissions, storage behavior, privacy documentation, packaging, and distribution. It should not be treated as a simple file conversion.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Small local tools can improve AI workflows, but the boundary matters.&lt;/p&gt;

&lt;p&gt;For this project, the useful feature is not automation. It is navigation. The script does not send messages, call APIs, scrape conversations, or export data. It only changes the local browser view so long conversations are easier to scan.&lt;/p&gt;

&lt;p&gt;That made Tampermonkey a good starting point. It allowed the core interaction to be tested quickly while keeping the project small enough to review.&lt;/p&gt;

&lt;p&gt;The broader lesson is simple: when building AI workflow tools, productivity should not come at the cost of unclear data behavior. A small browser tool can still be useful if it has a narrow scope, a clear privacy boundary, and honest limitations.&lt;/p&gt;

&lt;p&gt;GitHub repository:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;https://github.com/OnerGit/ChatGPT-Long-Conversation-Helper
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is a third-party local userscript, not an official OpenAI or ChatGPT feature.&lt;/p&gt;

</description>
      <category>javascript</category>
      <category>webdev</category>
      <category>productivity</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
