DEV Community

Ven Iyer
Ven Iyer

Posted on

I stopped calling GPT-4 for the same classification task 10,000 times

I kept running into the same pattern building internal tools: calling an LLM API thousands of times with the same prompt template, just swapping in different text.

  • Classify this contract clause
  • Route this support ticket
  • Categorize this log line

Same task. Different input. Over and over.

Two problems kept coming up:

Data sensitivity. For teams handling contracts, patient records, or internal logs, sending that data to a third-party API isn't always an option.

Cost. At scale, you're paying per-token for what is essentially structured pattern matching.

So I built an open-source CLI that trains a small local text classifier from labeled examples. You give it ~50 input/output pairs, it trains a ~230KB model on your machine, and you run inference locally. No network calls, ever.

What it looks like

npm install -g expressible

expressible distill init clause-detector
cd clause-detector
Enter fullscreen mode Exit fullscreen mode

Add some labeled examples:

expressible distill add --file ./labeled-clauses.json
Enter fullscreen mode Exit fullscreen mode

The JSON is just an array of { "input": "...", "output": "..." } objects — 50 or so pairs.

Train:

expressible distill train
Enter fullscreen mode Exit fullscreen mode
Training Complete
  Samples              87
  Validation accuracy  93.2%
  Time elapsed         2.8s

✓ Model saved to model/
Enter fullscreen mode Exit fullscreen mode

Run:

expressible distill run "Either party may terminate this Agreement at any time
  for any reason by providing 90 days written notice"
Enter fullscreen mode Exit fullscreen mode
{
  "output": "termination-for-convenience",
  "confidence": 0.94
}
Enter fullscreen mode Exit fullscreen mode

That ran locally. No API call. The text never left the machine.

How it works under the hood

Distill uses all-MiniLM-L6-v2, a sentence embedding model that runs locally. It converts text into 384-dimensional vectors that capture semantic meaning. A small two-layer neural network trained on your labeled examples learns to map those vectors to your categories.

The embedding model downloads once (~80MB) and is cached. Everything after that is fully offline.

No Python. No GPU. No Docker. Just Node.js 18+.

Where it works well and where it doesn't

This was important for me to document honestly.

Works well (80–95% accuracy): Topic and domain classification — tasks where the categories are about different things. Support tickets about billing vs shipping. News about sports vs politics. Contract clauses about indemnification vs non-compete.

Struggles (44–50% accuracy): Sentiment and tone. "This camera takes amazing photos" and "This camera takes terrible photos" produce nearly identical embedding vectors — same topic, same structure. The model can't tell them apart because the embeddings capture what text is about, not how it evaluates.

This is a fundamental limitation of the embedding approach, and I documented it openly in the benchmarks.

Benchmarks

All benchmarks use 50 training examples — roughly 30 minutes of labeling work:

Scenario Accuracy Data Source
Support ticket routing (4 categories) 95.0% Synthetic
Content moderation (3 categories) 90.0% Synthetic
News categorization (5 categories) 88.0% Synthetic
20 Newsgroups (5 categories) 80.0% Public dataset
AG News (4 categories) 64.0% Public dataset

The public dataset results use real-world text from datasets with 120,000+ entries — we only used 50 training samples from each. AG News improves to 80% with 100 samples.

You can reproduce every number:

git clone https://github.com/expressibleai/expressible-cli.git
cd expressible-cli
npm install
npx tsx tests/harness/run.ts
Enter fullscreen mode Exit fullscreen mode

The review-retrain loop

This is where it gets practical. After training, you can review predictions through a local web UI:

expressible distill review
Enter fullscreen mode Exit fullscreen mode

Correct mistakes, approve good results, then retrain:

expressible distill retrain
# → Previous accuracy: 89% → New accuracy: 94%
Enter fullscreen mode Exit fullscreen mode

Your accuracy improves iteratively as you catch edge cases the model gets wrong.

When to use this instead of an LLM

This is not a replacement for LLMs. It's specifically for the repetitive, pattern-based subset of LLM calls — the ones where the same prompt template processes different data every time.

If your task requires reasoning, creativity, or open-ended generation, use an LLM. If you're classifying thousands of inputs into the same N categories, a 230KB local model might be enough.

Everything stays local

  • Training data never leaves your filesystem
  • The embedding model runs locally
  • Trained models are files you own
  • Zero telemetry, zero analytics, zero phone-home
  • Export to standalone inference.js + model files for production

Links

Still early. I'd genuinely appreciate feedback on the approach — especially if you've tried similar embedding + classifier patterns in your own workflows, or if multi-label classification would be useful for your use case.

Top comments (0)