Andrea

Posted on Mar 10

How I automated text dataset cleaning for ML training with AI (and why regex wasn't enough)

#machinelearning #ai #python #nlp

The problem nobody talks about

Every ML engineer knows the principle: garbage in, garbage out. But somehow, most teams still spend weeks manually cleaning text data before training — or worse, they skip the cleaning and wonder why their model underperforms.

I've been working with text datasets for years, and the pattern is always the same. You get data from a CRM, an ERP, scanned documents, web scraping, automated feeds. And the data looks mostly fine. Until you look closely.

Double spaces everywhere. Punctuation that's technically Unicode but renders wrong. Words repeated in sequence ("the the company"). Apostrophes that are sometimes ', sometimes ', sometimes ´. Capitalization that changes mid-sentence. Encoding artifacts from a database migration five years ago.

Each error seems harmless. Multiply them by a million records, and your model is learning noise as signal.

"Just write a regex"

Sure, for some things. Lowercase everything? text.lower(). Strip HTML tags? Easy regex. Remove double spaces? re.sub(r' +', ' ', text).

But what about:

OCR artifacts that vary from page to page ("rn" → "m" sometimes, not always)
Free-text customer notes where every record is different
Typos that aren't in any standard dictionary
Mixed-language text where the rules change mid-field
Encoding errors interleaved with valid special characters

These require judgment. A human can spot them instantly but can't process 100K records. A regex can process 100K records but can't make judgment calls.

This is exactly where LLMs excel.

What I built

PurifyFactory is a CLI pipeline that uses AI language models to clean text datasets at scale. The workflow is deliberately simple:

# 1. Split your JSONL dataset into optimal batches
./purifyfactory split --input messy_data.jsonl --config my_config.json

# 2. Queue the work
./purifyfactory orchestrate --config my_config.json

# 3. Process with AI (parallel workers, auto-recovery)
./purifyfactory process --config my_config.json

The output is a JSONL file with original and cleaned text side by side:

{
  "original_text": "The  company's  product was  very very  popular",
  "cleaned_text": "The company's product was very popular",
  "provider": "openai",
  "tokens": 45,
  "cost": 0.000010
}

How it actually works

You define the rules. Your cleaning logic lives in the system prompt — natural language instructions that the AI applies consistently to every record. "Remove duplicate words. Fix punctuation. Normalize apostrophes to standard Unicode. Correct obvious OCR errors."

The key insight: what takes a human 10 seconds per record and is impossible to scale, takes the LLM milliseconds and costs fractions of a cent.

Architecture:

Split: Your dataset gets chunked into optimal batch sizes
Orchestrate: Batches are queued for parallel processing
Process: Multiple workers process batches simultaneously in parallel. Failed batches can be automatically recovered when the supervisor daemon is running, or re-queued manually with a single command

Multi-provider: Works with OpenAI, Anthropic Claude, Google Gemini, or local models via Ollama/vLLM. Switch providers by changing one line in the config. Automatic fallback if a provider fails.

On-premise: The binary runs entirely on your machine. Your data never touches any server except the API calls to your chosen provider. Essential for sensitive corporate datasets.

Cost tracking: Every record in the output includes token count and cost. The report command gives you total cost, average cost per record, and processing time. Estimated cost for 10,000 records with gpt-4o-mini: $0.50–1.50 depending on average text length.

Estimated costs

Costs vary with average text length. Reference estimates based on provider pricing:

Dataset size	Provider	Estimated cost
1,000 records	gpt-4o-mini	~$0.05–0.15
10,000 records	gpt-4o-mini	~$0.50–1.50
10,000 records	claude-haiku-4-5	~$0.40–1.20

The cost-per-quality tradeoff is remarkable. A model trained on 5K ultra-clean records consistently outperforms one trained on 10K messy records — you're potentially saving days of post-training debugging for just a few dollars in API costs.

Open beta

PurifyFactory is now in open beta (v9.1.6). Currently Linux x86_64 only — Windows and macOS are on the roadmap.

If you work with text datasets and want to try it, you can apply for the beta program here:

https://purifyfactory.com

You'll need:

Linux x86_64
An API key from OpenAI, Anthropic, Google Gemini, or a local model setup
A dataset you'd like to clean (1,000+ records recommended)

Free access, direct feedback channel with the dev team, and your input shapes the final product.

Built by Mentora Technologies.

What's your experience with text data quality in ML pipelines? I'd love to hear how others are handling this — especially at scale.

DEV Community