DEV Community

André Bergan
André Bergan

Posted on

ML-based LLM request classifier for cost-optimized routing (~2ms inference)

I built a request classifier that decides which LLM tier a prompt needs before it's sent to a provider. The goal is cost optimization: route simple requests to cheap models, keep complex ones on premium.

Architecture

  • Feature extraction: token count, estimated complexity, conversation depth, presence of code/math/reasoning markers, language detection
  • Model: MLP trained on ~50K labeled samples (rule-based scorer as teacher), exported to ONNX for fast inference
  • Inference: <2ms per classification, runs inline with the request
  • Three output tiers: economy (e.g. Gemini Flash), standard (e.g. GPT-4o-mini), premium (e.g. GPT-4o/Claude Sonnet)
  • Semantic cache: Qdrant-based layer that catches near-duplicate prompts (cosine similarity threshold 0.95)

Training pipeline

The rule-based scorer acts as a teacher model to generate labels, then distills into the MLP. Retraining happens via outcome signals from downstream quality checks.

Try it

The routing engine is open source: https://github.com/andber6/kestrel
Hosted version with billing/caching: https://usekestrel.io

Question for the community

Has anyone experimented with similar prompt classification approaches? The hardest part has been defining what makes a prompt "need" a premium model. Currently using hand-engineered features but I'm wondering if anyone has had success with learned embeddings for this kind of routing decision.

Top comments (0)