Ven Iyer

Posted on Feb 24

I stopped calling GPT-4 for the same classification task 10,000 times

#llm #cli #machinelearning #opensource

I kept running into the same pattern building internal tools: calling an LLM API thousands of times with the same prompt template, just swapping in different text.

Classify this contract clause
Route this support ticket
Categorize this log line

Same task. Different input. Over and over.

Two problems kept coming up:

Data sensitivity. For teams handling contracts, patient records, or internal logs, sending that data to a third-party API isn't always an option.

Cost. At scale, you're paying per-token for what is essentially structured pattern matching.

So I built an open-source CLI that trains a small local text classifier from labeled examples. You give it ~50 input/output pairs, it trains a ~230KB model on your machine, and you run inference locally. No network calls, ever.

What it looks like

npm install -g expressible

expressible distill init clause-detector
cd clause-detector

Add some labeled examples:

expressible distill add --file ./labeled-clauses.json

The JSON is just an array of { "input": "...", "output": "..." } objects — 50 or so pairs.

Train:

expressible distill train

Training Complete
  Samples              87
  Validation accuracy  93.2%
  Time elapsed         2.8s

✓ Model saved to model/

Run:

expressible distill run "Either party may terminate this Agreement at any time
  for any reason by providing 90 days written notice"

{
  "output": "termination-for-convenience",
  "confidence": 0.94
}

That ran locally. No API call. The text never left the machine.

How it works under the hood

Distill uses all-MiniLM-L6-v2, a sentence embedding model that runs locally. It converts text into 384-dimensional vectors that capture semantic meaning. A small two-layer neural network trained on your labeled examples learns to map those vectors to your categories.

The embedding model downloads once (~80MB) and is cached. Everything after that is fully offline.

No Python. No GPU. No Docker. Just Node.js 18+.

Where it works well and where it doesn't

This was important for me to document honestly.

Works well (80–95% accuracy): Topic and domain classification — tasks where the categories are about different things. Support tickets about billing vs shipping. News about sports vs politics. Contract clauses about indemnification vs non-compete.

Struggles (44–50% accuracy): Sentiment and tone. "This camera takes amazing photos" and "This camera takes terrible photos" produce nearly identical embedding vectors — same topic, same structure. The model can't tell them apart because the embeddings capture what text is about, not how it evaluates.

This is a fundamental limitation of the embedding approach, and I documented it openly in the benchmarks.

Benchmarks

All benchmarks use 50 training examples — roughly 30 minutes of labeling work:

Scenario	Accuracy	Data Source
Support ticket routing (4 categories)	95.0%	Synthetic
Content moderation (3 categories)	90.0%	Synthetic
News categorization (5 categories)	88.0%	Synthetic
20 Newsgroups (5 categories)	80.0%	Public dataset
AG News (4 categories)	64.0%	Public dataset

The public dataset results use real-world text from datasets with 120,000+ entries — we only used 50 training samples from each. AG News improves to 80% with 100 samples.

You can reproduce every number:

git clone https://github.com/expressibleai/expressible-cli.git
cd expressible-cli
npm install
npx tsx tests/harness/run.ts

The review-retrain loop

This is where it gets practical. After training, you can review predictions through a local web UI:

expressible distill review

Correct mistakes, approve good results, then retrain:

expressible distill retrain
# → Previous accuracy: 89% → New accuracy: 94%

Your accuracy improves iteratively as you catch edge cases the model gets wrong.

When to use this instead of an LLM

This is not a replacement for LLMs. It's specifically for the repetitive, pattern-based subset of LLM calls — the ones where the same prompt template processes different data every time.

If your task requires reasoning, creativity, or open-ended generation, use an LLM. If you're classifying thousands of inputs into the same N categories, a 230KB local model might be enough.

Everything stays local

Training data never leaves your filesystem
The embedding model runs locally
Trained models are files you own
Zero telemetry, zero analytics, zero phone-home
Export to standalone inference.js + model files for production

Links

GitHub: github.com/expressibleai/expressible-cli
Benchmarks: docs/benchmarks.md
License: Apache 2.0

Still early. I'd genuinely appreciate feedback on the approach — especially if you've tried similar embedding + classifier patterns in your own workflows, or if multi-label classification would be useful for your use case.

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.