Problem: LLM usage multiplies cost and variance across models. Teams want cheaper defaults but must keep an auditable trail of what model was used for each applied edit.
Approach: We added deterministic request classification (classify_request), a backend-backed routing policy cache, and an in-proxy rewrite that records every routing decision into the provenance payload.
Implementation notes
Classifier: classify_request lives in classifier.py and evaluates tools/functions, prompt size, system keywords, and code fences to return simple|standard|complex.
Policy cache: routing_cache.py fetches workspace-level RoutingPolicy from the backend and refreshes in the background; use get_policy(workspace_id, provider) to read a policy at request-time.
Proxy integration: the proxy calls the classifier, consults the cache, rewrites the outbound model when a policy is enabled, and attaches the decision to every pending edit so the backend stores provenance_records.routing_decision — see proxy.py.
Savings estimate: pricing.py contains the static pricing table and estimate_savings() used for dashboard cards like “AI Cost Saved by Routing (30d)”.
Tests: Run pytest targets in test_routing.py and test_routing_integration.py to validate classification, mapping, and savings calculation.
Operational considerations
V1 intentionally avoids cross-provider rewrites and automatic fallbacks — that keeps timing and correlation simpler for audit logs.
Policy propagation is cached; policy edits take up to the cache TTL to reach every proxy instance.
Watch for parity issues: rewrites may change model behavior (latency/quality). Evaluate on shadow traffic or enable routing only for low-risk simple tiers first.
Practical takeaway: Flip on routing for simple tier first, measure savings via the backend card, and expand mappings once you validate quality parity.
For further actions, you may consider blocking this person and/or reporting abuse
Top comments (0)