I kept running into the same pattern building internal tools: calling an LLM API thousands of times with the same prompt template, just swapping in different text.
- Classify this contract clause
- Route this support ticket
- Categorize this log line
Same task. Different input. Over and over.
Two problems kept coming up:
Data sensitivity. For teams handling contracts, patient records, or internal logs, sending that data to a third-party API isn't always an option.
Cost. At scale, you're paying per-token for what is essentially structured pattern matching.
So I built an open-source CLI that trains a small local text classifier from labeled examples. You give it ~50 input/output pairs, it trains a ~230KB model on your machine, and you run inference locally. No network calls, ever.
What it looks like
npm install -g expressible
expressible distill init clause-detector
cd clause-detector
Add some labeled examples:
expressible distill add --file ./labeled-clauses.json
The JSON is just an array of { "input": "...", "output": "..." } objects — 50 or so pairs.
Train:
expressible distill train
Training Complete
Samples 87
Validation accuracy 93.2%
Time elapsed 2.8s
✓ Model saved to model/
Run:
expressible distill run "Either party may terminate this Agreement at any time
for any reason by providing 90 days written notice"
{
"output": "termination-for-convenience",
"confidence": 0.94
}
That ran locally. No API call. The text never left the machine.
How it works under the hood
Distill uses all-MiniLM-L6-v2, a sentence embedding model that runs locally. It converts text into 384-dimensional vectors that capture semantic meaning. A small two-layer neural network trained on your labeled examples learns to map those vectors to your categories.
The embedding model downloads once (~80MB) and is cached. Everything after that is fully offline.
No Python. No GPU. No Docker. Just Node.js 18+.
Where it works well and where it doesn't
This was important for me to document honestly.
Works well (80–95% accuracy): Topic and domain classification — tasks where the categories are about different things. Support tickets about billing vs shipping. News about sports vs politics. Contract clauses about indemnification vs non-compete.
Struggles (44–50% accuracy): Sentiment and tone. "This camera takes amazing photos" and "This camera takes terrible photos" produce nearly identical embedding vectors — same topic, same structure. The model can't tell them apart because the embeddings capture what text is about, not how it evaluates.
This is a fundamental limitation of the embedding approach, and I documented it openly in the benchmarks.
Benchmarks
All benchmarks use 50 training examples — roughly 30 minutes of labeling work:
| Scenario | Accuracy | Data Source |
|---|---|---|
| Support ticket routing (4 categories) | 95.0% | Synthetic |
| Content moderation (3 categories) | 90.0% | Synthetic |
| News categorization (5 categories) | 88.0% | Synthetic |
| 20 Newsgroups (5 categories) | 80.0% | Public dataset |
| AG News (4 categories) | 64.0% | Public dataset |
The public dataset results use real-world text from datasets with 120,000+ entries — we only used 50 training samples from each. AG News improves to 80% with 100 samples.
You can reproduce every number:
git clone https://github.com/expressibleai/expressible-cli.git
cd expressible-cli
npm install
npx tsx tests/harness/run.ts
The review-retrain loop
This is where it gets practical. After training, you can review predictions through a local web UI:
expressible distill review
Correct mistakes, approve good results, then retrain:
expressible distill retrain
# → Previous accuracy: 89% → New accuracy: 94%
Your accuracy improves iteratively as you catch edge cases the model gets wrong.
When to use this instead of an LLM
This is not a replacement for LLMs. It's specifically for the repetitive, pattern-based subset of LLM calls — the ones where the same prompt template processes different data every time.
If your task requires reasoning, creativity, or open-ended generation, use an LLM. If you're classifying thousands of inputs into the same N categories, a 230KB local model might be enough.
Everything stays local
- Training data never leaves your filesystem
- The embedding model runs locally
- Trained models are files you own
- Zero telemetry, zero analytics, zero phone-home
- Export to standalone
inference.js+ model files for production
Links
- GitHub: github.com/expressibleai/expressible-cli
- Benchmarks: docs/benchmarks.md
- License: Apache 2.0
Still early. I'd genuinely appreciate feedback on the approach — especially if you've tried similar embedding + classifier patterns in your own workflows, or if multi-label classification would be useful for your use case.
Top comments (0)