Bob Oner

Posted on Jun 12

Preparing AI-Ready Data Without Calling an LLM API

#python #docker #postgres #ai

In the previous article, I extended a small Python data quality ETL starter from cleaned data into BI-ready reporting tables with PostgreSQL, SQL views, and an optional Metabase dashboard.

From Clean Data to BI-Ready Reporting Tables with Python, PostgreSQL, and Metabase

This follow-up focuses on the v0.6.0 update of the same project:

Data Quality ETL Starter on GitHub

The v0.6.0 update adds an optional AI-ready data preparation demo.

That phrase can easily become vague, so I want to define it clearly.

In this project, "AI-ready" does not mean:

calling an LLM API;
generating embeddings;
creating a vector database;
building a RAG chatbot;
training a machine learning model;
adding an AI agent;
automatically cleaning data with an LLM.

Instead, AI-ready means something more practical and earlier in the workflow:

cleaned
validated
documented
machine-readable
safe to inspect before downstream BI, ML, or AI use

The goal is not to build an AI application. The goal is to prepare data artifacts that another workflow could review and use later.

Why this step matters

Many teams want to "use AI on their data" before they have a reliable data preparation layer.

That usually creates a gap.

Before a dataset is useful for BI, ML, LLM, RAG, or any other AI-related workflow, a few basic questions still need to be answered:

What columns exist?
What does each column mean?
Which fields are identifiers or contact fields?
Which values are missing?
Which columns are numeric, categorical, datetime, or text-like?
What validation issues were found?
What data was removed or transformed?
What files were generated?
Did this process call any external AI service?

The v0.6.0 demo answers these questions by producing several small, reviewable output files.

This is especially useful for small-team workflows. A client or operator may not need a full ML platform. They may first need a clean handoff package that explains the dataset and makes downstream use safer.

What v0.6.0 adds

The v0.6.0 update adds a new optional workflow:

generated messy order data
        ↓
existing validation and cleaning workflow
        ↓
cleaned orders dataset
        ↓
schema profile JSON
        ↓
data dictionary JSON
        ↓
validation summary JSON
        ↓
feature-ready CSV
        ↓
embedding-ready text field extract
        ↓
AI-ready manifest + Markdown summary report

The main files added for this path are:

scripts/run_ai_ready_demo.py
src/dq_etl_starter/ai_ready.py
docs/ai_ready.md
tests/test_ai_ready.py
tests/test_ai_ready_outputs.py

The expected local output directory is:

data/output/ai_ready/

And the generated files are:

data/output/ai_ready/
├── ai_ready_summary_report.md
├── ai_ready_manifest.json
├── data_dictionary.json
├── schema_profile.json
├── validation_summary.json
├── feature_ready_orders.csv
└── embedding_ready_text_fields.csv

These are local artifacts. They should not be committed to the repository.

Project path so far

This project has grown in small steps:

v0.1.0  local data quality ETL baseline
v0.2.0  optional PostgreSQL export
v0.3.0  optional FastAPI validation wrapper
v0.4.0  analytics-ready Parquet + DuckDB demo
v0.5.0  BI-ready PostgreSQL + Metabase demo
v0.6.0  AI-ready data preparation demo

That sequence is intentional.

The project does not jump directly from messy CSV files to an AI application. It first builds the data workflow foundations:

reading input data;
validating schemas;
cleaning rows;
exporting data;
generating reports;
preparing analytics outputs;
loading reporting tables;
documenting data for downstream use.

The v0.6.0 update continues that path.

Install the project locally

Clone the repository:

git clone https://github.com/OnerGit/data-quality-etl-starter.git
cd data-quality-etl-starter

Create a virtual environment:

python -m venv .venv

Activate it on macOS or Linux:

source .venv/bin/activate

Activate it on Windows PowerShell:

.venv\Scripts\activate

Install dependencies and the local package:

pip install -r requirements.txt
pip install -e .

The editable install step is useful because the project uses a src/ layout.

Step 1: Generate synthetic input data

The AI-ready demo starts from generated synthetic order data.

It does not use real customer data. It does not download external datasets. It does not require API keys.

Generate 100,000 rows:

python scripts/generate_sample_data.py \
  --rows 100000 \
  --output data/generated/orders_100k.csv \
  --seed 42

Windows PowerShell:

python scripts/generate_sample_data.py `
  --rows 100000 `
  --output data/generated/orders_100k.csv `
  --seed 42

The fixed seed keeps the demo reproducible.

The generated data intentionally includes common data quality issues such as missing values, invalid email values, duplicate rows, invalid dates, negative quantities, zero prices, and inconsistent country values.

That makes the downstream preparation step more meaningful than running the workflow on a perfectly clean sample file.

Step 2: Run the AI-ready preparation demo

Run the v0.6.0 demo:

python scripts/run_ai_ready_demo.py \
  --input data/generated/orders_100k.csv \
  --schema data/expected/generated_order_schema.json \
  --output-dir data/output/ai_ready \
  --dataset-name cleaned_orders

Windows PowerShell:

python scripts/run_ai_ready_demo.py `
  --input data/generated/orders_100k.csv `
  --schema data/expected/generated_order_schema.json `
  --output-dir data/output/ai_ready `
  --dataset-name cleaned_orders

The script prints a completion message and lists the generated outputs.

A successful run should create:

schema_profile.json
data_dictionary.json
validation_summary.json
feature_ready_orders.csv
embedding_ready_text_fields.csv
ai_ready_manifest.json
ai_ready_summary_report.md

This workflow uses the existing project pieces first:

read the generated CSV;
load the expected schema;
validate the input;
clean the DataFrame;
prepare order data for downstream use.

Then the new AI-ready layer creates metadata, summaries, and handoff files.

Output 1: Schema profile

The first output is:

data/output/ai_ready/schema_profile.json

This file is a machine-readable profile of the prepared dataset.

It includes information such as:

dataset name;
row count;
column count;
column names;
inferred types;
pandas dtypes;
null counts;
null ratios;
unique counts;
unique ratios;
example values;
recommended column roles.

A simplified example looks like this:

{
  "dataset_name": "cleaned_orders",
  "row_count": 100000,
  "column_count": 12,
  "columns": [
    {
      "name": "order_id",
      "dtype": "string",
      "recommended_role": "identifier",
      "null_count": 0,
      "unique_count": 100000
    }
  ]
}

The exact values depend on the generated input and cleaning result.

This file is useful because downstream users can inspect structure before deciding how to use the dataset.

For example, a BI user may check date and numeric fields. An ML user may check identifiers and contact fields. An AI/RAG workflow may check which text fields exist before deciding whether embeddings are appropriate.

Output 2: Data dictionary

The second output is:

data/output/ai_ready/data_dictionary.json

This file explains what each column means.

It includes:

column name;
human-readable description;
type;
recommended role;
nullable flag;
example values;
usage notes.

For example, identifier fields are marked differently from numeric features or text-like fields.

This matters because a cleaned table is still not self-explanatory.

A field such as customer_id may be technically clean, but it should usually not be treated as a numeric feature. A field such as email may be useful for validation examples, but it should be reviewed carefully before any downstream AI or ML use.

The data dictionary makes those notes explicit.

Output 3: Validation summary

The third output is:

data/output/ai_ready/validation_summary.json

This file gives a compact machine-readable summary of the validation and cleaning stage.

It includes:

source file;
schema file;
row count before cleaning;
row count after preparation;
rows removed during preparation;
duplicate rows removed;
columns with missing values;
validation issue count;
validation issue codes;
AI-readiness notes.

This output is useful for auditability.

When a dataset is handed off to another workflow, the receiver should not only get a CSV file. They should also get a summary of what happened before the file was produced.

Output 4: Feature-ready CSV

The fourth output is:

data/output/ai_ready/feature_ready_orders.csv

This is a simple tabular output for downstream feature exploration.

By default, the workflow removes identifier and contact fields such as:

order_id
customer_id
email

It also transforms order_date into simple time-based fields such as:

order_year
order_month

This file does not train a model. It does not decide which features are correct for a business use case.

It only creates a cleaner starting point for later review.

That distinction is important. Feature-ready does not mean model-ready for every use case. It means the output is more suitable for feature exploration than the original raw file.

Output 5: Embedding-ready text field extract

The fifth output is:

data/output/ai_ready/embedding_ready_text_fields.csv

This file extracts text-like fields into a compact structure:

record_id,text,source_columns

The project does not generate embeddings.

It only prepares text fields so a downstream workflow can decide later whether embeddings are appropriate.

Contact fields such as email are excluded by default.

That is a deliberate design choice. It keeps the project focused on data preparation and avoids pretending that every text field should automatically go into a vector database.

Output 6: AI-ready manifest

The sixth output is:

data/output/ai_ready/ai_ready_manifest.json

This is the most important scope-control file in the v0.6.0 update.

It explicitly records that the workflow did not call AI services:

{
  "llm_api_called": false,
  "embedding_generated": false,
  "model_trained": false
}

This may look simple, but it is useful for a public technical project.

The AI label can easily create confusion. A manifest prevents overclaiming by documenting what the workflow did and did not do.

The manifest also lists intended downstream uses, such as:

BI handoff;
ML feature exploration;
LLM/RAG preparation outside this project;
data quality review.

And it lists out-of-scope items such as:

LLM API calls;
embeddings generation;
model training;
RAG chatbot;
AI agent;
vector database.

Output 7: AI-ready summary report

The final output is:

data/output/ai_ready/ai_ready_summary_report.md

This is a human-readable Markdown report.

It includes:

dataset name;
prepared row count;
generated output files;
scope note;
recommended downstream use;
out-of-scope items.

The summary report is meant for handoff.

A technical reviewer can open the JSON files. A less technical stakeholder can start with the Markdown report and understand the purpose of the run.

Why no LLM API call?

This project intentionally stops before the expensive or model-specific part of an AI workflow.

There are several reasons.

First, AI APIs introduce cost and credential management. A small data workflow starter should run without paid API keys.

Second, embedding and modeling decisions depend on the use case. A dataset prepared for sales forecasting is different from a dataset prepared for semantic search.

Third, calling an LLM does not remove the need for validation, profiling, documentation, and governance. Those steps are still required.

Fourth, this project is meant to demonstrate a Python data workflow skill set: data cleaning, validation, transformation, reporting, testing, and handoff.

For this version, preparing better data is more important than adding an AI wrapper.

How this maps to client work

A realistic client request may sound like this:

We want to use our order/customer data for reporting, analytics, or maybe AI later. Can you clean it and prepare a documented dataset first?

A practical first milestone could be:

inspect input files;
validate required fields;
clean duplicates and bad values;
remove obvious identifier or contact fields from feature exploration outputs;
generate a schema profile;
create a data dictionary;
write a validation summary;
prepare a feature-ready CSV;
prepare a text-field extract for later review;
document what was and was not done.

That is exactly the kind of handoff this v0.6.0 demo is designed to show.

It is not a full AI system. It is a data preparation layer that makes later AI-related work more responsible and easier to review.

Tests

Run the AI-ready tests:

pytest tests/test_ai_ready.py
pytest tests/test_ai_ready_outputs.py

Run the full test suite:

python -m compileall -q src/dq_etl_starter
python -m compileall -q scripts
pytest

The default tests do not require PostgreSQL, Metabase, external datasets, or any LLM API key.

That keeps the workflow easy to verify locally.

Local artifact policy

The generated AI-ready output files are local artifacts.

Do not commit:

data/generated/
data/output/ai_ready/
data/output/analytics/
data/output/bi/
*.parquet
*.duckdb
metabase.db/
metabase-data/
postgres_data/

The repository should keep source code, tests, documentation, schemas, lightweight sample files, and screenshots.

This matters for public portfolio projects. The repository should be easy to clone and review without carrying large generated outputs.

What is intentionally out of scope?

The v0.6.0 demo does not include:

OpenAI, Claude, Gemini, or other paid AI APIs;
local LLM integration;
embeddings generation;
vector databases;
RAG chatbots;
AI agents;
automatic SQL generation;
automatic data cleaning by LLM;
model training;
AutoML;
feature stores;
MLflow or MLOps tooling;
cloud deployment;
custom frontend code.

These tools can be useful in the right project.

They are simply not the goal of this starter.

The goal is to prepare documented, machine-readable data that downstream workflows can inspect and decide how to use.

What I would improve next

Possible next improvements include:

add more configurable column role rules;
add stronger data dictionary templates;
generate a small HTML summary report;
add richer schema drift checks;
add more realistic public dataset validation notes;
add a Makefile for common demo commands;
add CI for test execution;
add clearer examples for BI, ML, and RAG handoff paths.

The main constraint remains the same:

keep the project small, runnable, testable, documented, and honest about scope

That is more useful than adding an AI feature that hides the underlying data preparation work.

Repository

GitHub repository:

https://github.com/OnerGit/data-quality-etl-starter

From Clean Data to BI-Ready Reporting Tables with Python, PostgreSQL, and Metabase

This v0.6.0 update is a practical next step: preparing clean, validated, documented, machine-readable data for downstream BI, ML, or AI workflows without pretending that data preparation alone is a complete AI application.

DEV Community

Preparing AI-Ready Data Without Calling an LLM API

Why this step matters

What v0.6.0 adds

Project path so far

Install the project locally

Step 1: Generate synthetic input data

Step 2: Run the AI-ready preparation demo

Output 1: Schema profile

Output 2: Data dictionary

Output 3: Validation summary

Output 4: Feature-ready CSV

Output 5: Embedding-ready text field extract

Output 6: AI-ready manifest

Output 7: AI-ready summary report

Why no LLM API call?

How this maps to client work

Tests

Local artifact policy

What is intentionally out of scope?

What I would improve next

Repository

Top comments (0)