James Lee

Posted on Mar 23 • Edited on Jun 14

Building a Production-Grade LLM Application in 8 Weeks: Architecture Decisions, Pitfalls, and Best Practices

#ai #architecture #llm #systemdesign

1. Introduction: 8 Weeks from Zero to Production

When I set out to build an enterprise-grade LLM application for e-commerce, the goal was never to ship a "toy demo that runs on my laptop." The real objective was to deliver a stable, secure, and cost-efficient production service — one that could handle peak traffic during high-load periods, meet data privacy compliance requirements, and significantly reduce the rate of human agent escalations.

Over 8 weeks, I took this system from zero to a stable production deployment through continuous iteration. This article is the capstone of the preceding seven parts, offering a complete view of how a production-grade LLM system is architected, iterated, and hardened — from a single-agent MVP to a multi-agent, cost-optimized, safety-compliant production service. The full series articles and GitHub repository are linked at the end for deep dives into each module.

1.1 Final System Architecture

The production system is built on a three-layer decoupled architecture, internally subdivided into Application, Feature, Technology, Model, Data, and Infrastructure layers — fully separating the underlying platform from the upper business logic. This design enabled rapid MVP delivery while supporting the full scope of production-grade iteration.

┌─────────────────────────────────────────────────────────────────────────────┐
│                        LLM Application Architecture Layer                    │
│                                                                               │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  Application Layer                                                   │    │
│  │  · User Service (Login / Register)  · Session Service               │    │
│  │  · Knowledge Base Service                                            │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                               │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  Feature Layer                                                       │    │
│  │  · Multi-Agent Architecture    · Safety Guardrails                   │    │
│  │  · Text2Cypher Debug           · Offline/Online Index Construction   │    │
│  │  · Hybrid Knowledge Retrieval                                        │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                         LLM Technology Architecture Layer                    │
│                                                                               │
│  ┌───────────────┐        ┌───────────────┐        ┌───────────────┐        │
│  │     Agent     │        │      RAG      │        │   Workflow    │        │
│  └───────────────┘        └───────────────┘        └───────────────┘        │
│                                                                               │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │            LangChain / LangGraph / Microsoft GraphRAG                │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                               │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                  Vue / FastAPI / SSE / Open API                      │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                          LLM Platform Architecture Layer                     │
│                                                                               │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  Model Layer                                                         │    │
│  │  · DeepSeek Online Model              · vLLM Model Deployment        │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                               │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  Data Layer                                                          │    │
│  │  · MySQL    · Redis    · Neo4J    · Memory    · Local Disk · LanceDB │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                               │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  Infrastructure Layer                                                │    │
│  │  · Cloud Server          · GPU Server          · Docker Platform     │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────────────┘

2. Architecture Evolution: 4 Iterations from MVP to Production

One of the core differences between a senior LLM engineer and a junior one is the discipline to resist over-engineering on day one — and instead iterate progressively, solving the most critical pain point at each stage. Here is the complete evolution of this system:

System Architecture Evolution Overview

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 v0.1 MVP (Week 1)                v0.5 Knowledge Graph (Weeks 2–3)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

 ┌─────────────────┐         ┌──────────────────────────────────┐
 │   User Input    │         │   User Input                      │
 └────────┬────────┘         └──────────────┬───────────────────┘
          │                                 │
 ┌────────▼────────┐         ┌──────────────▼───────────────────┐
 │  Single-Agent   │         │  Single-Agent Dialogue            │
 │  Dialogue       │         └──────────────┬───────────────────┘
 └────────┬────────┘                        │
          │                  ┌──────────────▼───────────────────┐
 ┌────────▼────────┐         │  Vector Retrieval (LanceDB)       │
 │  Vector Search  │         │  + Graph Reasoning                │
 │  (LanceDB)      │         │    (Neo4j / GraphRAG) ★           │
 └────────┬────────┘         └──────────────┬───────────────────┘
          │                                 │
 ┌────────▼────────┐         ┌──────────────▼───────────────────┐
 │  CLI Output     │         │  CLI Output                       │
 └─────────────────┘         │  Data Pipeline: MinerU+LitServe ★ │
                             │  Chunking: Dynamic-Aware Split ★  │
 ✗ No structured queries      └──────────────────────────────────┘
 ✗ Accuracy: 70%
 ✗ No API interface            ✗ No automated incremental indexing
 ✗ No safety / cost controls   ✗ Internal testing only, not released


━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 v1.0 Multi-Agent + API (Weeks 4–5)   v2.0 Production-Grade (Weeks 6–8)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

 ┌─────────────────────────┐   ┌──────────────────────────────────┐
 │       User Input        │   │   User Input                      │
 └───────────┬─────────────┘   └──────────────┬───────────────────┘
             │                                │
 ┌───────────▼─────────────┐   ┌──────────────▼───────────────────┐
 │   RESTful API Layer ★   │   │   RESTful API Layer               │
 └───────────┬─────────────┘   └──────────────┬───────────────────┘
             │                                │
 ┌───────────▼─────────────┐   ┌──────────────▼───────────────────┐
 │  Intent Routing Agent ★ │   │  3-Layer Safety Guardrails ★      │
 └──────┬────┬─────┬───────┘   │  Input → Execution → Output       │
        │    │     │           └──────────────┬───────────────────┘
   ┌────▼─┐ ┌▼────┐ ┌▼──────┐               │
   │Tool  │ │KB   │ │Safety │  ┌──────────────▼───────────────────┐
   │Call  │ │Search│ │Guard  │  │  Intent Routing Agent             │
   │Agent │ │Agent │ │Agent  │  └──────┬──────┬──────┬─────────────┘
   └────┬─┘ └──┬──┘ └┬──────┘         │      │      │
        └───────┴─────┘           ┌────▼─┐ ┌──▼──┐ ┌▼──────┐
             │                    │Tool  │ │KB   │ │Safety │
 ┌───────────▼─────────────┐      │Call  │ │Search│ │Guard  │
 │   Hybrid Knowledge Base  │      │Agent │ │Agent │ │Agent  │
 │   Vector+GraphRAG+       │      └────┬─┘ └──┬──┘ └┬──────┘
 │   Text2Cypher            │           └───────┴──────┘
 └───────────┬─────────────┘                   │
             │                  ┌──────────────▼───────────────────┐
 ┌───────────▼─────────────┐    │  Semantic Cache Layer ★           │
 │  Streaming Response ★   │    │  Tiered Model Routing ★           │
 └─────────────────────────┘    │  (Small model / LLM auto-switch)  │
                                └──────────────┬───────────────────┘
 ✗ No production safety compliance             │
 ✗ Cost overrun risk at scale      ┌───────────▼───────────────────┐
                                   │  Streaming + Monitoring &      │
                                   │  Alerting                      │
                                   └────────────────────────────────┘

                                   ✓ Accuracy: 94%  ✓ Cost reduced 70%
                                   ✓ 99.9% availability  ✓ 1500 QPS

★ marks the key new components introduced in each version

v0.1 MVP (Week 1): A Functional Baseline

Core capability: Pure vector retrieval + single-agent dialogue, capable of answering simple FAQ queries such as return policies and product specifications.

Core limitations:

No support for structured data queries (orders, inventory, etc.)
Answer accuracy only 70%, with frequent hallucinations
CLI-only interface — no API, no integration with business systems
No safety controls or cost optimization

v0.5 Knowledge Graph Upgrade (Weeks 2–3): Solving Structured Reasoning

Core upgrade: Introduced Microsoft GraphRAG to layer graph reasoning on top of vector retrieval. Built a multimodal PDF parsing pipeline with MinerU + LitServe, and implemented a heading-hierarchy-aware dynamic chunking strategy.

Problems solved:

Enabled complex relational queries such as "Which supplier provided the item in Order #123?"
Baseline accuracy improved to 78%
Supports both PDF product manuals and CSV order/inventory data sources

Remaining gaps: Multimodal understanding of images and tables still limited; no automated incremental index update; this phase was internal testing only — not released externally.

v1.0 Multi-Agent + API Release (Weeks 4–5): Full Feature Closure

Core upgrade: Built a multi-agent orchestration framework with LangGraph, wrapped GraphRAG as a production-grade RESTful API, and automated incremental index management.

Problems solved:

LangGraph-based multi-agent orchestration: an Intent Routing Agent dispatches to specialized agents — Tool Call Agent, KB Retrieval Agent, and Safety Guardrail Agent — each handling its own responsibility
Full automation of index updates and query operations via API, enabling business system integration
Streaming responses implemented for real-time conversational UX

Remaining gaps: Pre-release testing revealed missing production-grade safety compliance; load testing exposed cost overrun risks at scale — both became the core focus of v2.0.

v2.0 Production-Grade Stable Release (Weeks 6–8): Safety and Cost at Scale

Core upgrade: Introduced a 3-layer safety guardrail system, deployed semantic caching + tiered model routing for cost optimization, and completed full-pipeline performance tuning and load testing.

Final production capabilities:

Full-pipeline safety and compliance for enterprise-grade scenarios
Significant inference cost reduction with no degradation in answer quality
Stable support for peak traffic during high-load periods
Comprehensive monitoring and alerting; 99.9% service availability (validated under load test conditions)

3. Three Core Architecture Decisions

Any production-grade system is ultimately a collection of trade-off decisions. These three decisions were the foundation of this system's successful delivery — each backed by a clear business rationale, an explicit comparison of alternatives, quantifiable outcomes, and a detailed explanation of why popular alternatives were rejected.

Decision 1: Replacing Pure Vector Retrieval with a Hybrid Knowledge Base (GraphRAG + Vector + Text2Cypher)

Approach	Strengths	Weaknesses
Pure vector retrieval	Simple to implement, low latency	Poor performance on structured/relational queries; accuracy only 70%
Pure GraphRAG	Strong multi-hop reasoning	Inefficient for simple FAQ queries; high latency and operational cost
Hybrid knowledge base (chosen)	Combines all three capabilities; smart routing selects the optimal retrieval path	Higher implementation complexity; requires maintaining multiple indexes

The core data flow of this hybrid architecture is shown below, covering the full pipeline for both structured CSV data and unstructured PDF data in our e-commerce reference implementation:

                    ┌──────────────────────────┐
                    │   Customer Service Agent  │
                    └─────────────┬────────────┘
                                  │ RESTful API
                                  ▼
┌─────────────────┐    ┌──────────────────────────┐
│  Backend Data   │    │                          │
│  Management     │───▶│    Microsoft GraphRAG    │
│  System         │    │                          │
│ (Add/Incremental│    └──────┬──────┬─────┬──────┘
│  Update via     │           │      │     │
│  RESTful API)   │           │      │     │
└─────────────────┘           │      │     │
                               │      │     │
          ┌────────────────────┘      │     └──────────────────────┐
          │ Natural Language Data     │ Non-NL Data                │ Multimodal Data
          ▼                           ▼                             ▼
┌──────────────────┐      ┌───────────────────┐       ┌─────────────────────┐
│      MySQL       │      │  Knowledge Graph  │       │       MinerU        │
│                  │      │     (Neo4j)       │       └──────────┬──────────┘
│  [CSV Data]      │      │                   │                  │
│  · Product Data  │      │  · Graph-based    │                  ▼
│  · Order Data    │      │    Structured     │       ┌─────────────────────┐
│  · Logistics     │      │    Knowledge      │       │      LitServe       │
│  · User Data     │      └───────────────────┘       │                     │
└──────────────────┘                                   │  [PDF Data]         │
                                                        │  · E-commerce       │
          ┌─────────────────────────────────────────── │    Product Manuals  │
          │ Vector Data               │ Parquet Data    └─────────────────────┘
          ▼                           ▼
┌──────────────────┐      ┌──────────────────┐
│  Vector Store    │      │   Parquet Data   │
│   (LanceDB)      │      │  (Local Disk)    │
└──────────────────┘      └──────────────────┘

Diagram: Hybrid knowledge base core data flow — covering multi-source data parsing, GraphRAG index construction and retrieval, and the full upstream customer service agent pipeline.

Why we chose the hybrid architecture:
In our e-commerce reference implementation, 70% of user queries are either structured data requests (orders, inventory) or relational questions (product-supplier relationships). Pure vector retrieval cannot reliably handle these cases — it produces frequent hallucinations and off-topic responses, leading to low user satisfaction and high human escalation rates.

Alternatives we evaluated and rejected:

Rejected AutoGen/CrewAI in favor of LangGraph for agent orchestration: AutoGen and CrewAI excel at open-ended multi-agent collaboration, but are poorly suited to the deterministic workflows and strict safety controls required in production LLM applications. LangGraph is lower-level and highly customizable — it allows safety guardrails and circuit breakers to be embedded directly into every node of the workflow, which is exactly what a production-grade system with strong control requirements demands.
Rejected Amazon Neptune/NebulaGraph in favor of Neo4j: Neptune is a cloud-native managed service that cannot meet private deployment compliance requirements. NebulaGraph offers strong distributed capabilities, but its operational complexity far exceeds Neo4j for mid-scale knowledge graphs, and its Python tooling ecosystem is significantly less mature. Neo4j's single-node performance fully meets our business scale, supports private deployment, and has a well-established Cypher ecosystem — the best return on investment.
Rejected other open-source GraphRAG implementations in favor of Microsoft's official GraphRAG: Community lightweight GraphRAG projects have lower deployment costs, but show a notable gap in long-document community detection and multi-hop reasoning quality compared to the Microsoft official version. The official version is actively maintained, offers a complete API interface, and provides the long-term stability needed for production-grade iteration.

Quantified outcomes (based on 500-sample annotated test set, internal evaluation):

Answer accuracy improved from 70% to 94%
Scenario coverage improved from 60% to 98%
Human escalation rate reduced by approximately 75%

Full implementation details: Series Article 6 — "Full-Pipeline Closure: Hybrid Knowledge Base and Capability Integration"

Decision 2: Replacing Pure Prompt-Based Defense with a 3-Layer Full-Pipeline Safety Guardrail System

Approach	Strengths	Weaknesses
Pure prompt defense	Simplest to implement	Only 30% injection attack interception rate in red team testing; easily bypassed
Output-layer filtering only	Can block non-compliant content	Cannot prevent unauthorized operations from occurring at the execution layer
3-layer full-pipeline guardrails (chosen)	Closed-loop protection across input → execution → output	Higher implementation complexity; adds ~50ms latency

Why we chose full-pipeline guardrails:
In an enterprise LLM application, compliance and data security are non-negotiable. A single data breach or unauthorized order operation can trigger regulatory penalties and irreversible brand damage. Pure prompt defense is architecturally incapable of meeting production-grade security requirements.

Alternatives we evaluated and rejected:
We rejected open-source guardrail solutions such as Guardrails AI and NVIDIA NeMo Guardrails in favor of a custom-built full-pipeline system. The core reason: these open-source solutions are general-purpose tools that cannot be deeply integrated with our multi-agent workflow and hybrid knowledge base architecture. Our reference implementation also requires extensive custom business rule validation (e.g., order status checks, after-sales time window validation) — a custom system embeds these rules directly into every guardrail layer, delivering far superior precision and adaptability compared to any general-purpose solution.

Quantified outcomes (based on internal red team testing covering 50 attack vectors):

Malicious attack interception rate: 95%
Full compliance with major data protection regulations (e.g., GDPR, PIPL)

Full implementation details: Series Article 5 — "Compliance at the Core: Production-Grade LLM Safety Guardrail Architecture"

Decision 3: Replacing Pure LLM Inference with Semantic Caching + Tiered Model Routing

Approach	Strengths	Weaknesses
Pure cloud LLM inference	Highest answer quality	Costs scale rapidly and become unsustainable
Single local small model	Extremely low cost	Poor performance on complex reasoning; cannot handle after-sales disputes or multi-hop queries
Semantic cache + tiered routing (chosen)	Balances cost and quality; repeated queries hit cache, complex reasoning uses LLM	Limited effectiveness during cache cold-start; threshold tuning required

Why we chose this approach:
In our reference implementation, 70% of user queries are repeated or semantically near-identical. Invoking a full LLM inference for every single query is a massive waste of resources. This approach achieves extreme cost optimization without any degradation in answer quality.

Alternatives we evaluated and rejected:

Rejected full replacement with a local small model: We tested multiple 7B/14B open-source models. While they performed acceptably on simple FAQ queries, their accuracy on after-sales disputes, multi-hop relational queries, and complex rule comprehension was more than 30% lower than DeepSeek-R1 — which would significantly increase human escalation rates and ultimately raise total operational costs.
Rejected a dedicated vector cache database in favor of Redis: A dedicated vector cache database offers stronger vector query performance, but our semantic cache uses a dual-layer design — exact match first, semantic match as fallback — and Redis fully meets our performance requirements. Redis is already a foundational component of our system architecture, so using it avoids introducing a new storage component and significantly reduces operational complexity and architectural redundancy.

Quantified outcomes (based on load test environment + production traffic sampling):

LLM inference cost reduced by approximately 70%
Average response latency reduced by 46.7% (from 1500ms to 800ms)
Repeated query cache hit rate: 72%

Full implementation details: Series Article 7 — "Production Optimization: Inference Cost and Performance Control"

4. Five Production Pitfalls (Problems You Only Hit in Real Deployments)

This section is what separates engineers who have actually run LLM systems in production from those who have only built demos. Below are the five most painful problems I encountered during this deployment, along with root causes, solutions, industry context, and the core lessons learned.

Pitfall 1: GPU Out-of-Memory (OOM) During GraphRAG Index Construction

Symptom: When processing PDF product manuals exceeding 100 pages, the GraphRAG index construction pipeline crashed outright — even on an A10G instance.

Root cause: The default pipeline loads all files into memory at once, with no batching or resource management.

Solution:

Implemented batch processing: maximum 10 files per batch
Explicitly called torch.cuda.empty_cache() after each batch to release VRAM and prevent fragmentation
Added dynamic memory monitoring with automatic GC triggered when usage exceeds threshold

Industry context and key lesson: This issue has extensive discussion in Microsoft GraphRAG's GitHub Issues — many developers encounter OOM when processing documents over 100 pages, and the official pipeline still has no built-in batching mechanism. The lesson: production data pipelines must include resource management logic. You cannot rely on open-source default implementations to handle large-scale data safely.

Pitfall 2: Semantic Cache False Matches

Symptom: Queries that are semantically similar but logically distinct — such as "What is the return policy?" and "What is the exchange policy?" — were matched to the same cached result, returning incorrect answers to users.

Root cause: Initial similarity threshold was too low (0.85), and there was no keyword-level fallback validation for core business terms.

Solution: Raised the similarity threshold to 0.9, and added keyword-based fallback rules: any query containing critical business keywords such as "return/exchange" or "refund/cancel" is never allowed to share a cached result, regardless of semantic similarity score.

Industry context and key lesson: This is one of the most common caching issues in production LLM deployments. In LangChain community discussions, over 40% of cache-related problems stem from business-semantic false matches. The lesson: semantic caching cannot rely on similarity scores alone — it must include business logic fallback rules to prevent incorrect matches at the application layer.

Pitfall 3: Multi-Agent Infinite Retry Loop Causing Service Deadlock

Symptom: When a tool call failed (e.g., database connection timeout), the agent would retry indefinitely, exhausting all system resources and ultimately crashing the service.

Root cause: No circuit breaker mechanism or retry limit was implemented in the agent workflow.

Solution: Added a circuit breaker to the LangGraph workflow: if any agent tool call fails more than 3 times within a single conversation turn, the task is immediately terminated and the user receives a friendly error message. Retries also use exponential backoff (1s → 2s → 4s).

Industry context and key lesson: This issue is frequently raised in LangGraph's official GitHub Issues and is consistently ranked among the top 3 pain points in multi-agent production deployments — many developers have experienced service cascades caused by unbounded retries. The lesson: production multi-agent systems must have failure handling logic built into every critical workflow path. You cannot assume every tool call will succeed.

Pitfall 4: Unauthorized Data Access via Text2Cypher

Symptom: Users could query other customers' order details by supplying a fabricated order number.

Root cause: The initial implementation only validated the syntactic correctness of generated Cypher queries — it did not verify whether the user had permission to access the requested resource.

Solution:

All structured queries are bound to the current user's ID; every generated Cypher statement must include a WHERE user_id = $current_user clause
Row-Level Security (RLS) enabled at the database layer
A second permission check is performed before every query execution, forming a dual-layer defense

Industry context and key lesson: This is the #1 security risk in LLM applications that interface with structured data. OWASP Top 10 for LLM Applications explicitly lists unauthorized access as a critical risk. The lesson: never trust the LLM to generate permission-compliant queries. Access control must be enforced at both the query generation layer and the database execution layer — prompt-level constraints alone are never sufficient.

Pitfall 5: Request Timeouts Under Peak Concurrency

Symptom: Under load test conditions simulating 1000+ QPS peak traffic, 30% of requests timed out — while GPU/inference service resource utilization was only 40% (resources wasted on synchronous blocking).

Root cause: No async request queue optimization was in place, and streaming responses were not implemented. Each request was processed synchronously and independently, causing connection resource waste and high latency.

Solution: Implemented async request queuing + connection pool optimization, raising GPU/inference service resource utilization from 40% to 85%. Also implemented SSE-based streaming responses, reducing user-perceived time-to-first-token from 3s to under 500ms.

Industry context and key lesson: This is a pervasive problem in high-concurrency LLM service deployments, with extensive discussion in both the vLLM and FastAPI communities. The lesson: LLM service performance is never just about the model — it depends equally on how you optimize request scheduling and user experience under high-concurrency conditions.

5. Full-Pipeline Performance Metrics

The following metrics are drawn from two sources: load test environment (k6 simulated traffic) and internal evaluation set (500 annotated conversations). Data source is noted for each metric.

Metric	Before Optimization (Baseline)	After Optimization (Production)	Improvement	Data Source
Answer accuracy	70%	94%	+24 pp	Internal annotated eval set (500 samples)
Scenario coverage	60%	98%	+38 pp	Internal annotated eval set (500 samples)
Inference cost per request	$0.002	$0.0006	-70%	Production traffic sampling (1,000 requests)
Average response latency	1500ms	800ms	-46.7%	k6 load test (500 concurrent users)
Peak concurrency supported	500 QPS	1500 QPS	+200%	k6 load test (step ramp-up)
Prompt injection interception rate	30%	95%	+65 pp	Internal red team test (50 attack vectors)
Service availability	95%	99.9%	+4.9 pp	k6 load test (72-hour stability run)

6. Best Practices and Future Roadmap

5 Non-Negotiable Best Practices for Production LLM Systems

After 8 weeks of building and iterating, here are the five principles I consider non-negotiable for enterprise LLM application delivery:

Start with an MVP, iterate progressively: Resist the urge to over-engineer on day one. Ship a working MVP first, then add complexity based on real pain points surfaced in production.
Safety and compliance are foundations, not afterthoughts: Build full-pipeline safety guardrails before you go live. Prompt-based defenses alone will never meet production-grade security requirements.
Every architecture decision must be data-driven: Don't choose a technology because it's trending. The only valid reason to choose it is that it solves a real business problem — with measurable, quantifiable results.
Cost optimization is a core design concern, not a post-launch fix: LLM costs can spiral out of control as you scale. Caching and tiered routing must be designed into the system architecture from day one.
Design for failure, not just for the happy path: Red team your system. Load test it. Build failure handling into every critical workflow. Production systems will break — your job is to make them fail gracefully.

Future Roadmap

This system is currently running stably in production, with clear directions for future iteration:

Cross-industry adaptation: the core architecture requires only minor modifications to the retrieval and safety layers to support deployment scenarios in finance, healthcare, education, and other industries.
Multimodal capability expansion: Adding image and voice query support, enabling users to send product fault photos or voice queries and receive automated assistance.
Reinforcement learning optimization: Using positive/negative user feedback to continuously optimize routing strategies, cache thresholds, and prompt templates — making the system smarter over time.
Core module code is available in the GitHub repository — contributions and discussions are welcome.

7. Full Series, GitHub Repository, and Contact

This article is the capstone of the Production-Grade LLM Application series. The links below provide deep dives into the implementation details of each module:

GitHub Repository (Full production codebase): llm-customer-service, Tag: v2.0.0-production-ready

About me: 10+ years of software engineering experience, 3+ years focused on LLM/AI application development. Core expertise: RAG/GraphRAG system design, multi-agent architecture, LLM cost optimization, and production-grade service delivery.

Top comments (2)

Andre Cytryn • Mar 23

the semantic cache + tiered model routing combination is underrated. most teams only look at one or the other, but the real cost savings come from the layered effect: cache handles the repetitive queries so the expensive model never sees them. curious how you tuned the similarity threshold for cache hits — did you end up settling on a fixed cosine cutoff or something adaptive based on query complexity?

James Lee • Mar 24

Good catch! Fixed threshold works fine for high-frequency repetitive queries, but we hit false positives pretty quickly on semantically similar but contextually different ones — classic example being queries that look alike but differ on a key condition.

Ended up going hybrid: aggressive caching for factual/lookup intents, more conservative for policy-type queries with an extra entity-level check, and skipping cache entirely for complaint/escalation — those are just too context-sensitive to cache safely.

Intent classification runs upstream of the cache layer on a lightweight local model, so the overhead stays minimal.

Still experimenting with adaptive thresholds but not convinced it's worth the added complexity in production yet.

Are you running this in front of a single model or a tiered stack?