DEV Community

Praveen
Praveen

Posted on

How LineageLens routes LLM requests for cost savings — without losing provenance

Problem: LLM usage multiplies cost and variance across models. Teams want cheaper defaults but must keep an auditable trail of what model was used for each applied edit.
Approach: We added deterministic request classification (classify_request), a backend-backed routing policy cache, and an in-proxy rewrite that records every routing decision into the provenance payload.
Implementation notes
Classifier: classify_request lives in classifier.py and evaluates tools/functions, prompt size, system keywords, and code fences to return simple|standard|complex.
Policy cache: routing_cache.py fetches workspace-level RoutingPolicy from the backend and refreshes in the background; use get_policy(workspace_id, provider) to read a policy at request-time.
Proxy integration: the proxy calls the classifier, consults the cache, rewrites the outbound model when a policy is enabled, and attaches the decision to every pending edit so the backend stores provenance_records.routing_decision — see proxy.py.
Savings estimate: pricing.py contains the static pricing table and estimate_savings() used for dashboard cards like “AI Cost Saved by Routing (30d)”.
Tests: Run pytest targets in test_routing.py and test_routing_integration.py to validate classification, mapping, and savings calculation.
Operational considerations
V1 intentionally avoids cross-provider rewrites and automatic fallbacks — that keeps timing and correlation simpler for audit logs.
Policy propagation is cached; policy edits take up to the cache TTL to reach every proxy instance.
Watch for parity issues: rewrites may change model behavior (latency/quality). Evaluate on shadow traffic or enable routing only for low-risk simple tiers first.
Practical takeaway: Flip on routing for simple tier first, measure savings via the backend card, and expand mappings once you validate quality parity.

Top comments (0)