I built a request classifier that decides which LLM tier a prompt needs before it's sent to a provider. The goal is cost optimization: route simple requests to cheap models, keep complex ones on premium.
Architecture
- Feature extraction: token count, estimated complexity, conversation depth, presence of code/math/reasoning markers, language detection
- Model: MLP trained on ~50K labeled samples (rule-based scorer as teacher), exported to ONNX for fast inference
- Inference: <2ms per classification, runs inline with the request
- Three output tiers: economy (e.g. Gemini Flash), standard (e.g. GPT-4o-mini), premium (e.g. GPT-4o/Claude Sonnet)
- Semantic cache: Qdrant-based layer that catches near-duplicate prompts (cosine similarity threshold 0.95)
Training pipeline
The rule-based scorer acts as a teacher model to generate labels, then distills into the MLP. Retraining happens via outcome signals from downstream quality checks.
Try it
The routing engine is open source: https://github.com/andber6/kestrel
Hosted version with billing/caching: https://usekestrel.io
Question for the community
Has anyone experimented with similar prompt classification approaches? The hardest part has been defining what makes a prompt "need" a premium model. Currently using hand-engineered features but I'm wondering if anyone has had success with learned embeddings for this kind of routing decision.
Top comments (0)