James Lee

Posted on Mar 23 • Edited on Jun 14

Production Optimization: Inference Cost and Performance Control

#ai #architecture #llm #performance

1. Introduction: The Dual Pain Points of Inference Cost and Performance in Production

This is Part 7 of the series 8 Weeks from Zero to One: Full-Stack Engineering Practice for a Production-Grade LLM Application. In the first six parts, we completed the full-pipeline closure of the system's core capabilities. However, in enterprise-grade production deployments, runaway costs and performance instability are more operationally fatal than incomplete features. Our real production logs and load-test data from the e-commerce reference implementation revealed the following:

Over 70% of user queries are repetitive or semantically similar (e.g., "What is the return process?", "How do I return an item?", "What steps do I need to follow to return something?"). Calling the LLM indiscriminately for every request wastes significant resources.
Before optimization, all requests were routed uniformly to the DeepSeek-R1:14B private deployment. Monthly inference costs (calculated across GPU compute, electricity, and operations) exceeded ¥70,000.
During high-concurrency periods (e.g., 618 and Double 11 shopping festivals), heavy LLM inference pushed average response latency to 1.5s, with GPU OOM errors and service cascading failures occurring under peak load.
Simple queries (e.g., "How do I turn on the smart bulb?") and complex queries (e.g., "There's a quality issue with the product in Order #123 — analyze the refund process and compensation options based on the after-sales policy") consumed identical model resources, making resource allocation highly inefficient.

Core question: How do we dramatically reduce inference costs and optimize response speed while improving high-concurrency throughput — without sacrificing answer quality?

Our approach: We rejected the "single optimization strategy" mindset and designed a three-layer full-pipeline optimization architecture: Dual-Layer Semantic Caching + Tiered Model Routing + Scene-Aware Prompt Compression. Caching eliminates over 70% of redundant inference calls; tiered routing ensures the right model handles the right query; Prompt compression further reduces per-request token consumption. The three layers work in concert to achieve production-grade cost and performance balance — not through any single technique in isolation.

2. Three-Layer Full-Pipeline Optimization Architecture

We embed optimization capabilities throughout the entire system pipeline — from user input to final output, every step is governed by cost and performance controls. The architecture fully inherits the technology stack from the previous six parts (Redis Cluster, Ollama, DeepSeek-R1 private deployment, vLLM reserved interface), requiring no refactoring of the core architecture:

┌──────────────────────────────────────────────────────────┐
│               User Input + User Identity Info             │
└─────────────────────────┬────────────────────────────────┘
                           │
┌─────────────────────────▼────────────────────────────────┐
│     [Layer 1] Dual-Layer Semantic Cache                   │
│     (Intercept First — Zero Inference Cost)               │
│  · Exact Match Cache: MD5/Hash direct lookup              │
│  · Semantic Similarity Cache: Lightweight Embedding +     │
│    Cosine Similarity                                      │
│  · Keyword Fallback Validation: No cross-intent           │
│    cache sharing                                          │
└──────────┬──────────────────────────────┬───────────────┘
           │ Cache Hit (75%)              │ Cache Miss (25%)
           ▼                              ▼
┌──────────────────────┐   ┌─────────────────────────────────┐
│  Return Cached Answer │   │ [Layer 2] Scene-Aware           │
└──────────────────────┘   │ Prompt Compression               │
                            │ History summarization +          │
                            │ Structured query pre-fetch       │
                            └───────────────┬─────────────────┘
                                            │
                                            ▼
                            ┌───────────────────────────────────┐
                            │ [Layer 3] Tiered Model Routing     │
                            │  · Ollama small model:             │
                            │    Simple FAQ / small talk         │
                            │  · DeepSeek-R1:                    │
                            │    Complex reasoning               │
                            │  · vLLM batch inference:           │
                            │    High-concurrency fallback       │
                            └───────────────┬───────────────────┘
                                            │
                                            ▼
                            ┌───────────────────────────────────┐
                            │ Async Cache Update + Full-Pipeline │
                            │ Monitoring & Logging               │
                            └───────────────────────────────────┘

Diagram note: User input is first processed by the dual-layer semantic cache. On a cache hit, the answer is returned immediately at zero inference cost. On a miss, scene-aware Prompt compression is applied, followed by tiered model routing to the appropriate model. Results are then written back to the cache asynchronously while full-pipeline monitoring data is recorded.

3. Production-Grade Engineering Implementation of Core Modules

3.1 Dual-Layer Semantic Cache: Intercept First, Maximize Hit Rate

We rejected a "single cache strategy" and designed a dual-layer cache tailored to different query types. Keyword fallback validation, hot/cold storage separation, and intelligent invalidation mechanisms together ensure production-grade stability and a low false-match rate.

3.1.1 Cache Types and Design Rationale

Cache Type	Target Scenario	Core Design	Strengths	Limitations
Exact Match Cache	Identical queries (e.g., "What is the return process?")	Applies configurable text preprocessing (whitespace removal, punctuation stripping, case normalization), computes a Hash key, and performs direct lookup in Redis Cluster	Extremely fast (<10ms), zero false matches	Low coverage — only handles fully identical queries (~15%)
Semantic Similarity Cache	Semantically equivalent but differently phrased queries (e.g., "How do I return?" vs. "What's the return procedure?")	Encodes queries using a lightweight Embedding model fine-tuned on the reference domain dataset; computes cosine similarity against cached vectors using a configurable threshold; returns cached answer on hit	High coverage — handles 70%+ of similar queries	Minor false-match risk; requires threshold tuning and keyword fallback

3.1.2 Production-Grade Core Mechanisms

Storage Layer Architecture: Fully inherits the Redis Cluster deployment from Part 1, supporting 100,000+ QPS. Key naming follows a configurable convention:
- Exact cache: exact_cache:{business_scene}:{hash_value}
- Semantic cache: semantic_cache:{business_scene}:{embedding_vector_hash} (stores vector, answer, access count, creation timestamp, version)
Hot/Cold Storage Separation:
- Hot cache (Redis in-memory): High-frequency queries exceeding a configurable access threshold (~20% of entries), response <50ms
- Cold cache (Redis persistence + local disk index): Low-frequency queries below the threshold (~80% of entries), response <100ms
Cache Update and Invalidation:
- Update: After the LLM generates a new answer, it is written asynchronously to a delay queue. Within a configurable time window, the same query triggers at most one cache update, preventing cache thrashing.
- Invalidation:
  - Active invalidation: When business rules change (e.g., return policy update), related cache entries are bulk-deleted by version number or keyword match.
  - Passive invalidation: LRU eviction clears cold cache entries not accessed within a configurable number of days; hot cache entries carry a configurable TTL.
  - False-match invalidation: When a user marks an answer as "unhelpful," the corresponding cache entry is immediately invalidated and flagged for manual review.

3.2 Tiered Model Routing: The Right Model for the Right Query

Our core thesis is that cost reduction cannot rely on caching alone. Tiered model routing ensures rational resource allocation and delivers an additional ~20% reduction in inference cost beyond what caching achieves — while fully inheriting the technology stack from the previous six parts (Ollama MVP, DeepSeek-R1 private deployment, vLLM reserved interface).

3.2.1 Routing Rule Design

We designed clear tiered routing rules across three dimensions: query complexity, business priority, and concurrency level:

MODEL_ROUTING_RULES = """
You are the model routing component of an enterprise LLM application system.
Your responsibility is to select the most appropriate model for each query.

Core rules (in priority order):
1. [HIGHEST PRIORITY] High-concurrency periods (configurable QPS threshold):
   - Non-complex queries → vLLM batch inference queue
   - Complex queries → DeepSeek-R1 private deployment
2. [SECONDARY PRIORITY] Simple queries / small talk:
   - Route to lightweight small model deployed via Ollama
   - Simple query: FAQ-type, single-turn, no context, clear keywords
   - Small talk: greetings, thanks, complaints unrelated to business
3. [DEFAULT PRIORITY] Complex queries → DeepSeek-R1 private deployment
   - Complex query: multi-turn with context, mixed structured/unstructured,
     requires reasoning or analysis
4. Output ONLY the model name. Do NOT output anything else:
   ollama_small_model / deepseek_r1_private / vllm_batch_queue
"""

3.2.2 Production-Grade Core Mechanisms

Model Pool Management:
- Ollama small model pool: Lightweight GPU servers supporting 5,000+ QPS, handling simple queries
- DeepSeek-R1 private pool: High-performance GPU servers (A10G-class), supporting 200+ QPS for complex queries
- vLLM batch inference pool: Pre-wired vLLM adapter interface from Part 1; auto-starts during high-concurrency periods, supporting 1,000+ QPS batch throughput
Routing Jitter Protection:
- A secondary complexity check is applied before routing to prevent misclassification based on a single keyword
- After a high-concurrency period ends, the system smoothly transitions back to normal routing mode to avoid service instability
Graceful Degradation:
- If one model pool becomes unavailable, traffic is automatically rerouted to a backup pool
- If all model pools are unavailable, the system falls back to a predefined FAQ answer library

3.3 Scene-Aware Prompt Compression: Reducing Per-Request Token Consumption

Generic compression is not enough. Prompt compression must be customized for the target domain (e-commerce in our reference implementation) to reduce per-request token consumption by 30%+ without losing semantic fidelity.

3.3.1 Core Compression Strategies

Conversation History Summarization:
- When a conversation exceeds a configurable number of turns, a lightweight small model automatically summarizes the history, retaining only core business information (order numbers, product names, prior questions and answers)
- The summarization Prompt framework enforces retention of core business fields and explicitly prohibits preserving irrelevant details
Structured Query Pre-fetch Compression:
- For queries with structured data intent (e.g., "logistics for Order #123"), Text2Cypher is called first to retrieve the structured data, which is then injected as context into the Prompt — eliminating redundant LLM inference over structured information
Redundancy Filtering:
- Automatically strips redundant whitespace, punctuation, and repeated instructions from the Prompt, retaining only core business rules and user input

3.4 Production-Grade Monitoring and Alerting

To ensure continuous visibility into optimization effectiveness and maintain production-grade stability, we designed a three-tier monitoring and alerting system that fully inherits the OpenTelemetry + Prometheus + Grafana stack from the previous six parts:

Core Metric Monitoring:
- Cache layer: Total hit rate, exact match hit rate, semantic similarity hit rate, false-match rate, cache update/invalidation counts
- Routing layer: Per-pool call distribution, routing jitter count, degradation fallback count
- Cost layer: Average token consumption per request, average inference cost per request, monthly inference cost
- Performance layer: Average response latency, P50/P95/P99 latency, peak QPS capacity
Visualization Dashboard:
- Grafana real-time monitoring panels with filtering by time range, business scene, and model pool
Threshold Alerting:
- Alerts are automatically triggered when: total cache hit rate falls below a configurable threshold, false-match rate exceeds a configurable threshold, monthly inference cost exceeds budget, or P99 latency exceeds a configurable threshold

4. Production Pitfalls and Solutions

4.1 Semantic Cache Stampede

Symptom: Bulk business rule updates before a major shopping festival triggered a full cache invalidation, causing all requests to hit the LLM simultaneously — resulting in GPU OOM and a service cascading failure.
Root Cause: Full cache invalidation had no smooth transition; the instantaneous request spike exceeded the model pool's capacity.
Solution:
1. Adopt a gradual invalidation strategy during business rule updates — invalidate a configurable percentage of cache entries per day rather than all at once
2. Pre-warm the Top 10,000 high-frequency query cache entries before any full invalidation
3. Automatically activate the vLLM batch inference pool during full invalidation windows to absorb the surge

4.2 Tiered Model Routing Jitter

Symptom: A user asked "How do I turn on the smart bulb?" (simple → Ollama) followed 10 seconds later by "There's a quality issue with the smart bulb — analyze the refund process based on the after-sales policy" (complex → DeepSeek). The model switch mid-conversation degraded the user experience.
Root Cause: Routing decisions were based solely on the current query, without considering the user's conversation history.
Solution:
1. Add a historical conversation complexity assessment before routing — if the user recently asked a complex query, the current query is preferentially routed to DeepSeek
2. Within a configurable time window, maintain a consistent routing model for the same user's conversation to prevent frequent switching

4.3 Over-Compression Causing Semantic Loss

Symptom: Aggressive conversation history summarization caused the model to forget that "the product in Order #123 was purchased during the 618 festival and is eligible for additional compensation," leading to answers that violated business rules.
Root Cause: The generic summarization Prompt did not enforce retention of domain-specific business fields (order numbers, purchase timestamps, promotional activities).
Solution:
1. Customize the summarization Prompt framework for the target domain, explicitly requiring retention of order numbers, purchase timestamps, promotional activities, and product quality issues
2. Add a post-compression business field validation check — if any required field is missing, re-compress or fall back to retaining the full conversation history

5. End-to-End Effectiveness Validation

We sampled 10,000 user queries from real production logs (6,000 simple, 2,000 structured, 2,000 complex) and conducted a 7-day live production validation during the 618 shopping festival. Key quantitative results are as follows:

5.1 Core Metrics: Before vs. After Optimization

Metric	Before (Pure DeepSeek-R1)	After (Three-Layer Optimization)	Change
Total Cache Hit Rate	N/A (no cache)	75%	+75%
Cost Reduction Attributed to Tiered Routing	N/A (no routing)	~20%	—
Avg. Token Consumption per Request	1,200 tokens	840 tokens	-30%
Avg. Inference Cost per Request	¥0.014	¥0.0042	-70%
Avg. Response Latency (ms)	1,500	800	-46.7%
P99 Latency (ms)	3,500	1,200	-65.7%
Peak Concurrency Capacity (QPS)	500	1,500	+200%
Monthly Inference Cost	¥72,000	¥21,600	-70%
False-Match Rate	N/A (no cache)	0.9%	Acceptable

5.2 Live Production Validation

Cache hit rate: 7-day average held steady at 73%–77% with no significant fluctuation
Cost: During the 618 festival (7 days), inference costs dropped from an expected ¥16,800 to ¥5,040 — in line with projections
Latency: 99.9% of requests completed in under 1.5s; no timeouts or cascading failures at peak load (1,200 QPS)
User satisfaction: Based on post-conversation 5-point rating collection, satisfaction improved from 4.6/5 to 4.8/5, with zero complaints attributable to routing or caching behavior

6. Differentiation: Our Production-Grade Advantages

Compared to general-purpose open-source optimization solutions (e.g., LangChain Cache, native vLLM routing), our three-layer full-pipeline architecture delivers four key advantages in enterprise LLM deployment scenarios:

Dimension	General Open-Source Solutions	Our Three-Layer Architecture
Scene Adaptability	Generic use cases, no industry customization	Deep adaptation to the reference domain (e-commerce): customized semantic cache, tiered routing, and Prompt compression; directly extensible to other domains
Full-Pipeline Coordination	Single optimization modules requiring manual integration	Dual-layer cache + tiered routing + Prompt compression working in concert for compounding cost reduction
Production Stability	Basic functionality only; monitoring, alerting, and fallback must be self-implemented	Complete production-grade monitoring, alerting, graceful degradation, jitter protection, and stampede prevention
Stack Integration	Requires custom integration with business systems	Fully inherits the technology stack from the previous six parts — no core architecture refactoring required

Core value: Our solution is not a simple assembly of isolated optimization modules. It is a complete enterprise-grade optimization system ready for direct production deployment — genuinely solving the three critical requirements of deployability, stability, and meaningful cost reduction.

7. Deployment Boundaries and Series Continuity

7.1 Deployment Boundaries

This three-layer full-pipeline optimization architecture is validated against an e-commerce reference implementation. Deployments in heavily regulated industries such as healthcare or finance will require adjustments to cache content policies, routing rules, and Prompt compression strategies to meet industry-specific compliance requirements. Full production deployment also requires customized integration with your business system's monitoring, alerting, and fallback infrastructure.

7.2 Series Continuity

GitHub repository: llm-customer-service, (Tag: v1.3.0-cost-optimization)
Backward References: Builds on the MVP architecture, data pipeline, GraphRAG service layer, multi-agent workflow, safety guardrail system, and hybrid knowledge retrieval system from Parts 1–6, completing the production-grade cost and performance optimization layer
Coming Up — Part 8: The series finale. A complete retrospective covering every architectural decision from MVP to production, a full post-mortem of pitfalls encountered, and a consolidated record of quantifiable outcomes — forming a complete end-to-end engineering practice reference. Stay tuned.

DEV Community