1. Introduction: The Dual Pain Points of Inference Cost and Performance in Customer Service
This is Part 7 of the series 8 Weeks from Zero to One: Full-Stack Engineering Practice for a Production-Grade LLM Customer Service System. In the first six parts, we completed the full-pipeline closure of the system's core capabilities. However, in enterprise-grade production deployments, runaway costs and performance instability are more operationally fatal than incomplete features. Our real production logs and load-test data from the e-commerce customer service system revealed the following:
- Over 70% of user queries are repetitive or semantically similar (e.g., "What is the return process?", "How do I return an item?", "What steps do I need to follow to return something?"). Calling the LLM indiscriminately for every request wastes significant resources.
- Before optimization, all requests were routed uniformly to the DeepSeek-R1:14B private deployment. Monthly inference costs (calculated across GPU compute, electricity, and operations) exceeded ¥70,000.
- During high-concurrency periods (e.g., 618 and Double 11 shopping festivals), heavy LLM inference pushed average response latency to 1.5s, with GPU OOM errors and service cascading failures occurring under peak load.
- Simple queries (e.g., "How do I turn on the smart bulb?") and complex queries (e.g., "There's a quality issue with the product in Order #123 — analyze the refund process and compensation options based on the after-sales policy") consumed identical model resources, making resource allocation highly inefficient.
Core question: How do we dramatically reduce inference costs and optimize response speed while improving high-concurrency throughput — without sacrificing answer quality?
Our approach: We rejected the "single optimization strategy" mindset and designed a three-layer full-pipeline optimization architecture: Dual-Layer Semantic Caching + Tiered Model Routing + Scene-Aware Prompt Compression. Caching eliminates over 70% of redundant inference calls; tiered routing ensures the right model handles the right query; Prompt compression further reduces per-request token consumption. The three layers work in concert to achieve production-grade cost and performance balance — not through any single technique in isolation.
2. Three-Layer Full-Pipeline Optimization Architecture
We embed optimization capabilities throughout the entire system pipeline — from user input to final output, every step is governed by cost and performance controls. The architecture fully inherits the technology stack from the previous six parts (Redis Cluster, Ollama, DeepSeek-R1 private deployment, vLLM reserved interface), requiring no refactoring of the core architecture:
┌──────────────────────────────────────────────────────────┐
│ User Input + User Identity Info │
└─────────────────────────┬────────────────────────────────┘
│
┌─────────────────────────▼────────────────────────────────┐
│ [Layer 1] Dual-Layer Semantic Cache │
│ (Intercept First — Zero Inference Cost) │
│ · Exact Match Cache: MD5/Hash direct lookup │
│ · Semantic Similarity Cache: Lightweight Embedding + │
│ Cosine Similarity │
│ · Keyword Fallback Validation: No cross-intent │
│ cache sharing │
└──────────┬──────────────────────────────┬───────────────┘
│ Cache Hit (75%) │ Cache Miss (25%)
▼ ▼
┌──────────────────────┐ ┌─────────────────────────────────┐
│ Return Cached Answer │ │ [Layer 2] Scene-Aware │
└──────────────────────┘ │ Prompt Compression │
│ History summarization + │
│ Structured query pre-fetch │
└───────────────┬─────────────────┘
│
▼
┌───────────────────────────────────┐
│ [Layer 3] Tiered Model Routing │
│ · Ollama small model: │
│ Simple FAQ / small talk │
│ · DeepSeek-R1: │
│ Complex reasoning │
│ · vLLM batch inference: │
│ High-concurrency fallback │
└───────────────┬───────────────────┘
│
▼
┌───────────────────────────────────┐
│ Async Cache Update + Full-Pipeline │
│ Monitoring & Logging │
└───────────────────────────────────┘
Diagram note: User input is first processed by the dual-layer semantic cache. On a cache hit, the answer is returned immediately at zero inference cost. On a miss, scene-aware Prompt compression is applied, followed by tiered model routing to the appropriate model. Results are then written back to the cache asynchronously while full-pipeline monitoring data is recorded.
3. Production-Grade Engineering Implementation of Core Modules
3.1 Dual-Layer Semantic Cache: Intercept First, Maximize Hit Rate
We rejected a "single cache strategy" and designed a dual-layer cache tailored to different query types. Keyword fallback validation, hot/cold storage separation, and intelligent invalidation mechanisms together ensure production-grade stability and a low false-match rate.
3.1.1 Cache Types and Design Rationale
| Cache Type | Target Scenario | Core Design | Strengths | Limitations |
|---|---|---|---|---|
| Exact Match Cache | Identical queries (e.g., "What is the return process?") | Applies configurable text preprocessing (whitespace removal, punctuation stripping, case normalization), computes a Hash key, and performs direct lookup in Redis Cluster | Extremely fast (<10ms), zero false matches | Low coverage — only handles fully identical queries (~15%) |
| Semantic Similarity Cache | Semantically equivalent but differently phrased queries (e.g., "How do I return?" vs. "What's the return procedure?") | Encodes queries using a lightweight Embedding model fine-tuned on e-commerce customer service data; computes cosine similarity against cached vectors using a configurable threshold; returns cached answer on hit | High coverage — handles 70%+ of similar queries | Minor false-match risk; requires threshold tuning and keyword fallback |
3.1.2 Production-Grade Core Mechanisms
-
Storage Layer Architecture: Fully inherits the Redis Cluster deployment from Part 1, supporting 100,000+ QPS. Key naming follows a configurable convention:
- Exact cache:
exact_cache:{business_scene}:{hash_value} - Semantic cache:
semantic_cache:{business_scene}:{embedding_vector_hash}(stores vector, answer, access count, creation timestamp, version)
- Exact cache:
-
Hot/Cold Storage Separation:
- Hot cache (Redis in-memory): High-frequency queries exceeding a configurable access threshold (~20% of entries), response <50ms
- Cold cache (Redis persistence + local disk index): Low-frequency queries below the threshold (~80% of entries), response <100ms
-
Cache Update and Invalidation:
- Update: After the LLM generates a new answer, it is written asynchronously to a delay queue. Within a configurable time window, the same query triggers at most one cache update, preventing cache thrashing.
-
Invalidation:
- Active invalidation: When business rules change (e.g., return policy update), related cache entries are bulk-deleted by version number or keyword match.
- Passive invalidation: LRU eviction clears cold cache entries not accessed within a configurable number of days; hot cache entries carry a configurable TTL.
- False-match invalidation: When a user marks an answer as "unhelpful," the corresponding cache entry is immediately invalidated and flagged for manual review.
3.2 Tiered Model Routing: The Right Model for the Right Query
Our core thesis is that cost reduction cannot rely on caching alone. Tiered model routing ensures rational resource allocation and delivers an additional ~20% reduction in inference cost beyond what caching achieves — while fully inheriting the technology stack from the previous six parts (Ollama MVP, DeepSeek-R1 private deployment, vLLM reserved interface).
3.2.1 Routing Rule Design
We designed clear tiered routing rules across three dimensions: query complexity, business priority, and concurrency level:
MODEL_ROUTING_RULES = """
You are the model routing component of an e-commerce intelligent customer service system.
Your responsibility is to select the most appropriate model for each query.
Core rules (in priority order):
1. [HIGHEST PRIORITY] High-concurrency periods (configurable QPS threshold):
- Non-complex queries → vLLM batch inference queue
- Complex queries → DeepSeek-R1 private deployment
2. [SECONDARY PRIORITY] Simple queries / small talk:
- Route to lightweight small model deployed via Ollama
- Simple query: FAQ-type, single-turn, no context, clear keywords
- Small talk: greetings, thanks, complaints unrelated to business
3. [DEFAULT PRIORITY] Complex queries → DeepSeek-R1 private deployment
- Complex query: multi-turn with context, mixed structured/unstructured,
requires reasoning or analysis
4. Output ONLY the model name. Do NOT output anything else:
ollama_small_model / deepseek_r1_private / vllm_batch_queue
"""
3.2.2 Production-Grade Core Mechanisms
-
Model Pool Management:
- Ollama small model pool: Lightweight GPU servers supporting 5,000+ QPS, handling simple queries
- DeepSeek-R1 private pool: High-performance GPU servers (A10G-class), supporting 200+ QPS for complex queries
- vLLM batch inference pool: Pre-wired vLLM adapter interface from Part 1; auto-starts during high-concurrency periods, supporting 1,000+ QPS batch throughput
-
Routing Jitter Protection:
- A secondary complexity check is applied before routing to prevent misclassification based on a single keyword
- After a high-concurrency period ends, the system smoothly transitions back to normal routing mode to avoid service instability
-
Graceful Degradation:
- If one model pool becomes unavailable, traffic is automatically rerouted to a backup pool
- If all model pools are unavailable, the system falls back to a predefined FAQ answer library
3.3 Scene-Aware Prompt Compression: Reducing Per-Request Token Consumption
Generic compression is not enough. Prompt compression must be customized for the e-commerce customer service context to reduce per-request token consumption by 30%+ without losing semantic fidelity.
3.3.1 Core Compression Strategies
-
Conversation History Summarization:
- When a conversation exceeds a configurable number of turns, a lightweight small model automatically summarizes the history, retaining only core business information (order numbers, product names, prior questions and answers)
- The summarization Prompt framework enforces retention of core business fields and explicitly prohibits preserving irrelevant details
-
Structured Query Pre-fetch Compression:
- For queries with structured data intent (e.g., "logistics for Order #123"), Text2Cypher is called first to retrieve the structured data, which is then injected as context into the Prompt — eliminating redundant LLM inference over structured information
-
Redundancy Filtering:
- Automatically strips redundant whitespace, punctuation, and repeated instructions from the Prompt, retaining only core business rules and user input
3.4 Production-Grade Monitoring and Alerting
To ensure continuous visibility into optimization effectiveness and maintain production-grade stability, we designed a three-tier monitoring and alerting system that fully inherits the OpenTelemetry + Prometheus + Grafana stack from the previous six parts:
-
Core Metric Monitoring:
- Cache layer: Total hit rate, exact match hit rate, semantic similarity hit rate, false-match rate, cache update/invalidation counts
- Routing layer: Per-pool call distribution, routing jitter count, degradation fallback count
- Cost layer: Average token consumption per request, average inference cost per request, monthly inference cost
- Performance layer: Average response latency, P50/P95/P99 latency, peak QPS capacity
-
Visualization Dashboard:
- Grafana real-time monitoring panels with filtering by time range, business scene, and model pool
-
Threshold Alerting:
- Alerts are automatically triggered when: total cache hit rate falls below a configurable threshold, false-match rate exceeds a configurable threshold, monthly inference cost exceeds budget, or P99 latency exceeds a configurable threshold
4. Production Pitfalls and Solutions
4.1 Semantic Cache Stampede
- Symptom: Bulk business rule updates before a major shopping festival triggered a full cache invalidation, causing all requests to hit the LLM simultaneously — resulting in GPU OOM and a service cascading failure.
- Root Cause: Full cache invalidation had no smooth transition; the instantaneous request spike exceeded the model pool's capacity.
-
Solution:
- Adopt a gradual invalidation strategy during business rule updates — invalidate a configurable percentage of cache entries per day rather than all at once
- Pre-warm the Top 10,000 high-frequency query cache entries before any full invalidation
- Automatically activate the vLLM batch inference pool during full invalidation windows to absorb the surge
4.2 Tiered Model Routing Jitter
- Symptom: A user asked "How do I turn on the smart bulb?" (simple → Ollama) followed 10 seconds later by "There's a quality issue with the smart bulb — analyze the refund process based on the after-sales policy" (complex → DeepSeek). The model switch mid-conversation degraded the user experience.
- Root Cause: Routing decisions were based solely on the current query, without considering the user's conversation history.
-
Solution:
- Add a historical conversation complexity assessment before routing — if the user recently asked a complex query, the current query is preferentially routed to DeepSeek
- Within a configurable time window, maintain a consistent routing model for the same user's conversation to prevent frequent switching
4.3 Over-Compression Causing Semantic Loss
- Symptom: Aggressive conversation history summarization caused the model to forget that "the product in Order #123 was purchased during the 618 festival and is eligible for additional compensation," leading to answers that violated business rules.
- Root Cause: The generic summarization Prompt did not enforce retention of e-commerce-specific business fields (order numbers, purchase timestamps, promotional activities).
-
Solution:
- Customize the summarization Prompt framework for the e-commerce customer service context, explicitly requiring retention of order numbers, purchase timestamps, promotional activities, and product quality issues
- Add a post-compression business field validation check — if any required field is missing, re-compress or fall back to retaining the full conversation history
5. End-to-End Effectiveness Validation
We sampled 10,000 user queries from real production logs (6,000 simple, 2,000 structured, 2,000 complex) and conducted a 7-day live production validation during the 618 shopping festival. Key quantitative results are as follows:
5.1 Core Metrics: Before vs. After Optimization
| Metric | Before (Pure DeepSeek-R1) | After (Three-Layer Optimization) | Change |
|---|---|---|---|
| Total Cache Hit Rate | N/A (no cache) | 75% | +75% |
| Cost Reduction Attributed to Tiered Routing | N/A (no routing) | ~20% | — |
| Avg. Token Consumption per Request | 1,200 tokens | 840 tokens | -30% |
| Avg. Inference Cost per Request | ¥0.014 | ¥0.0042 | -70% |
| Avg. Response Latency (ms) | 1,500 | 800 | -46.7% |
| P99 Latency (ms) | 3,500 | 1,200 | -65.7% |
| Peak Concurrency Capacity (QPS) | 500 | 1,500 | +200% |
| Monthly Inference Cost | ¥72,000 | ¥21,600 | -70% |
| False-Match Rate | N/A (no cache) | 0.9% | Acceptable |
5.2 Live Production Validation
- Cache hit rate: 7-day average held steady at 73%–77% with no significant fluctuation
- Cost: During the 618 festival (7 days), inference costs dropped from an expected ¥16,800 to ¥5,040 — in line with projections
- Latency: 99.9% of requests completed in under 1.5s; no timeouts or cascading failures at peak load (1,200 QPS)
- User satisfaction: Based on post-conversation 5-point rating collection, satisfaction improved from 4.6/5 to 4.8/5, with zero complaints attributable to routing or caching behavior
6. Differentiation: Our Production-Grade Advantages
Compared to general-purpose open-source optimization solutions (e.g., LangChain Cache, native vLLM routing), our three-layer full-pipeline architecture delivers four key advantages in enterprise e-commerce customer service deployments:
| Dimension | General Open-Source Solutions | Our Three-Layer Architecture |
|---|---|---|
| Scene Adaptability | Generic use cases, no industry customization | Deep adaptation to e-commerce customer service: customized semantic cache, tiered routing, and Prompt compression |
| Full-Pipeline Coordination | Single optimization modules requiring manual integration | Dual-layer cache + tiered routing + Prompt compression working in concert for compounding cost reduction |
| Production Stability | Basic functionality only; monitoring, alerting, and fallback must be self-implemented | Complete production-grade monitoring, alerting, graceful degradation, jitter protection, and stampede prevention |
| Stack Integration | Requires custom integration with business systems | Fully inherits the technology stack from the previous six parts — no core architecture refactoring required |
Core value: Our solution is not a simple assembly of isolated optimization modules. It is a complete enterprise-grade optimization system ready for direct production deployment — genuinely solving the three critical requirements of deployability, stability, and meaningful cost reduction.
7. Deployment Boundaries and Series Continuity
7.1 Deployment Boundaries
This three-layer full-pipeline optimization architecture is deeply adapted to e-commerce customer service scenarios. Deployments in heavily regulated industries such as healthcare or finance will require adjustments to cache content policies, routing rules, and Prompt compression strategies to meet industry-specific compliance requirements. Full production deployment also requires customized integration with your business system's monitoring, alerting, and fallback infrastructure.
7.2 Series Continuity
- GitHub Repository: Link TBD
- Backward References: Builds on the MVP architecture, data pipeline, GraphRAG service layer, multi-agent workflow, safety guardrail system, and hybrid knowledge retrieval system from Parts 1–6, completing the production-grade cost and performance optimization layer
- Coming Up — Part 8: The series finale. A complete retrospective covering every architectural decision from MVP to production, a full post-mortem of pitfalls encountered, and a consolidated record of quantifiable outcomes — forming a complete end-to-end engineering practice reference. Stay tuned.
Top comments (0)