1. Problem Background: 4 Core Production Pain Points in Enterprise AI Customer Service
Enterprise-grade AI customer service deployment consistently runs into four pain points that no open-source demo can solve. These are the core design goals of this project — and the architectural principles I locked in from day one of the MVP stage.
1. Private Deployment & Data Compliance
Customer data, product manuals, and order information in e-commerce and finance are highly sensitive. Public cloud LLM APIs are simply not an option. Full local deployment and model privatization are mandatory — ensuring data never leaves the boundary and complying with data protection regulations. This is a prerequisite, not an optional feature.
→ This article's solution: Private deployment of DeepSeek via Ollama. Zero third-party API calls across the entire pipeline.
2. Performance Bottlenecks Under High Concurrency
Customer service traffic has sharp peaks and valleys. During major sales events, query volume can reach 10–20x the daily average. Traditional LLM services suffer from high response latency, session loss, and cascading failures — unable to guarantee stability under load.
→ This article's solution: FastAPI async architecture + Redis semantic cache, reducing high-frequency query response latency from 1.8s to 0.3s.
3. Multi-Source Knowledge Base Integration
Enterprise knowledge is scattered across structured CSV order/product data, unstructured PDF manuals/service agreements, and business system database interfaces. Traditional full-text search and basic vector retrieval fail to handle cross-page semantic associations and table/image content parsing.
→ This article's solution: Extension interfaces reserved at MVP stage; MinerU + GraphRAG + Neo4j hybrid knowledge base to be integrated in subsequent iterations.
4. Uncontrollable Inference Costs
Over 70% of customer service queries are high-frequency repetitive questions. Calling the LLM for every single query wastes GPU resources in private deployments and drives up API costs in cloud deployments — making operational costs completely unpredictable.
→ This article's solution: Redis semantic similarity cache reduces inference costs for high-frequency queries by 68%.
2-Week Delivery Timeline
| Phase | Timeline | Key Deliverables |
|---|---|---|
| Week 1 | Days 1–7 | Core infrastructure, FastAPI backend, MySQL/Redis storage, Ollama model deployment |
| Week 2 | Days 8–14 | LangChain Agent integration, semantic cache, JWT auth, Vue frontend, local deployment validation |
2. Architecture Overview: From MVP to Production-Grade Design
2.1 MVP Full-Stack Architecture
The core design principle of the MVP is: minimum viable loop validation, with seamless extensibility reserved for production-grade iteration — no over-engineering, no temporary hacks that cause future rewrites.
┌─────────────────────────────────────────────────────────┐
│ Frontend Interaction Layer │
│ Vue Chat Interface │
└──────────────────────────┬──────────────────────────────┘
│ HTTP / SSE
▼
┌─────────────────────────────────────────────────────────┐
│ Application Architecture Layer │
│ FastAPI Backend Service │
└──────────────────────────┬──────────────────────────────┘
│
┌────────────────┴─────────────────┐
│ │
▼ ▼
┌─────────────────────────┐ ┌───────────────────────────┐
│ LLM Technical Layer │ │ LLM Platform Layer │
│ │ │ │
│ ┌─────────────────┐ │ │ ┌─────────────────────┐ │
│ │ Session Mgmt │ │ │ │ Model Layer │ │
│ │ JWT Auth │ │ │ │ Ollama + DeepSeek-R1│ │
│ └─────────────────┘ │ │ │ Private Deployment │ │
│ │ │ └─────────────────────┘ │
│ ┌─────────────────┐ │ │ │
│ │ Dialogue Agent │ │ │ ┌─────────────────────┐ │
│ │ LangChain │ │ │ │ Data Layer │ │
│ └─────────────────┘ │ │ │ MySQL Persistent │ │
│ │ │ │ Storage + Redis │ │
│ ┌─────────────────┐ │ │ │ Cache │ │
│ │ Tool Invocation │ │ │ └─────────────────────┘ │
│ │ Web Search │ │ │ │
│ └─────────────────┘ │ │ ┌─────────────────────┐ │
│ │ │ │ Infrastructure │ │
│ ┌─────────────────┐ │ │ │ GPU Servers │ │
│ │ Semantic Cache │ │ │ │ + Docker Platform │ │
│ │ Redis │ │ │ └─────────────────────┘ │
│ └─────────────────┘ │ │ │
└─────────────────────────┘ └───────────────────────────┘
Layer-by-layer responsibilities, forming a complete business support chain from bottom to top:
- Infrastructure Layer: The hardware foundation — GPU servers with Docker containerization, providing stable compute resources for private model inference.
- Model & Data Layer: The core foundation of the MVP. Ollama handles private deployment of the DeepSeek open-source model. MySQL handles user/session data persistence; Redis handles semantic caching and session management, balancing performance and storage cost.
- Core Technical Layer: FastAPI powers the async backend service; LangChain implements the dialogue agent and tool-calling framework, providing standardized technical capabilities to upper layers.
- Application Service Layer: Encapsulated into three service types — user service, session service, and dialogue service — delivering five core capabilities: user authentication, session management, dialogue inference, tool invocation, and cache optimization.
- Frontend Interaction Layer: Vue-based UI providing chat interface and user login. SSE streaming responses replicate the real-time ChatGPT-style conversation experience.
2.2 Production Target Architecture & MVP Boundary
The ultimate goal of this series is to iterate toward an enterprise-grade production-ready customer service system. The complete target architecture has been designed at the top level.
Components marked as grayed-out in the architecture diagram (GraphRAG, Neo4j, LanceDB, MinerU multimodal parsing, LangGraph multi-agent architecture, three-layer safety guardrails, vLLM inference service) are planned for v1.0+ production iterations. Extension interfaces have been reserved in the MVP architecture. The MVP currently delivers a complete loop based on basic text Q&A + Ollama private deployment.
2.3 MVP Core Data Flow
The MVP has fully validated the core data flow pipeline. The production version will extend this to handle multi-source data processing.
- User initiates a conversation → JWT authentication + session context validation.
- Request hits the Redis semantic cache layer first — if a matching high-frequency answer exists, return immediately, skipping model inference.
- On cache miss, the dialogue agent determines whether to invoke the web search tool to supplement time-sensitive information beyond the model's knowledge cutoff.
- DeepSeek (privately deployed) handles inference → SSE streaming response returned to user → session history persisted + cache updated.
3. Tech Stack Decisions: MVP Architecture Trade-offs
The core logic behind every tech decision: prioritize closing the loop fast at MVP stage, while reserving seamless extensibility for production iteration. Every choice involved multi-option comparison and production-scenario fit analysis — not chasing trending tools.
3.1 Backend Framework: FastAPI
Alternatives considered: Flask, Django
Final choice: FastAPI. Key reasons:
- Native async support — perfectly suited for LLM streaming responses and long-latency inference. Far outperforms Flask under high concurrency.
- Auto-generates OpenAPI documentation — significantly reduces frontend-backend integration and third-party system onboarding costs. Meets enterprise-grade engineering standards.
- Built-in type hints and data validation — reduces parameter errors and interface exceptions in production at the code level. Fully compatible with LangChain, LangGraph, and the broader LLM toolchain ecosystem.
3.2 Model Deployment: Ollama (with vLLM adapter reserved for production)
Alternatives considered: vLLM, native Transformers
Final choice: Ollama for MVP, with a seamless vLLM switchover path reserved for production. Key reasons:
- Extremely low deployment friction — a single command downloads, deploys, and runs DeepSeek-R1 and other mainstream open-source models, compressing the MVP validation cycle from one week to one day.
- Built-in multi-GPU load balancing, model quantization, and VRAM optimization — no custom low-level adapter code needed to meet baseline private deployment performance requirements.
- Standard OpenAI-compatible API — switching to vLLM or online models later requires zero changes to core business logic. No technical debt introduced.
Why not vLLM at MVP stage? vLLM delivers stronger high-concurrency performance, but comes with significantly higher deployment complexity and environment setup cost. The MVP goal is to validate the private deployment loop fast — not to optimize for peak throughput. Ollama delivers the best ROI at this stage.
3.3 Storage Architecture: MySQL + Redis
Final choice: MySQL for persistent storage, Redis for caching and session management — the most battle-tested, lowest-ops-overhead storage combination for enterprise applications.
- MySQL: Persists user data, session history, and knowledge base metadata. Transaction support guarantees data consistency in enterprise scenarios. Also sets up the foundation for future Text2SQL structured data queries.
- Redis: Handles active session memory caching, semantic similarity caching, and rate limiting — solving response latency under high concurrency. Implements hot/cold session separation: active sessions in Redis, historical sessions persisted to MySQL, balancing performance and storage cost.
3.4 Core Capability Reservations: LangGraph + GraphRAG
Status: Tech selection validated and extension interfaces reserved at MVP stage. Full implementation in production version.
- LangGraph: Compared to CrewAI and Swarm, LangGraph is lower-level, more flexible, and more extensible. It handles multi-agent workflow orchestration and iterative execution loops — perfectly suited for complex task decomposition in customer service scenarios. Currently the most widely adopted agent orchestration framework in production environments.
- GraphRAG: Addresses the fundamental limitations of traditional vector retrieval in long-document and cross-section semantic association scenarios. Entity and relationship extraction combined with community detection enables deep semantic understanding — ideal for processing PDF product manuals and service agreement documents in customer service use cases.
4. MVP Feature Delivery
The MVP's core objective was to close the full business loop and validate the feasibility of the core technical approach. Five core features were delivered and fully validated through local deployment:
- Streaming Dialogue: FastAPI dialogue endpoint with SSE streaming response — replicating real-time ChatGPT-style conversation experience and ensuring responsiveness for user queries.
- Function Calling + Web Search: External tool invocation framework with web search support — addressing knowledge cutoff limitations and expanding the Q&A boundary of the customer service system.
- Semantic Similarity Cache: Redis-based semantic cache for high-frequency repetitive queries — reusing inference results to address the uncontrollable inference cost pain point.
- Standardized Database Schema: MySQL schema covering user table, session table, and message table — persisting user data and conversation history to ensure session context continuity.
- User Authentication & Authorization: JWT-based user login, registration, and authentication — establishing baseline user permission control that meets enterprise-grade security requirements.
Core compliance achievement: The MVP delivers full local deployment. From user conversation to model inference to data storage — zero third-party API calls, zero data leaving the boundary. Fully satisfies baseline enterprise data compliance requirements.
5. MVP Validation Results & Iteration Roadmap
5.1 Validation Results
Tested against 1,000 real e-commerce customer service conversations (covering product inquiry, order query, and after-sales policy — 1 to 8 dialogue turns per conversation).
Test environment: Dual RTX 4090 GPU server, 32GB RAM. Inference model: DeepSeek-R1:14B 4-bit quantized.
Key results:
- All 5 core features fully functional. Complete flow validated: user login → initiate conversation → tool invocation → result returned. Full private local deployment, zero third-party API dependency.
- Semantic cache hit rate: 72% on the 70% high-frequency repetitive query subset of the 1,000-conversation corpus. Corresponding per-request inference cost reduction: 68%. Average response latency reduced from 1.8s → 0.3s.
- Locust load test: 50 concurrent continuous dialogue sessions. Service ran stably with zero crashes and zero session loss. Average response latency < 2s, P99 latency < 5s. Meets the daily customer service load requirements of small-to-medium e-commerce businesses.
5.2 MVP Simplifications
To validate the core loop quickly, the MVP intentionally simplified several areas — these are the primary optimization targets for subsequent iterations:
- Semantic cache uses a fixed-threshold basic matching strategy — no scenario-specific threshold tuning, hot/cold data separation, or automated cache invalidation.
- Function calling supports only a single web search tool — no multi-tool collaboration or complex task decomposition.
- Knowledge base supports basic text Q&A only — no PDF, CSV, or other multi-source structured/unstructured data ingestion.
5.3 Core Production Bottlenecks
The MVP validated the core approach, but three bottlenecks remain that cannot be patched incrementally — these are the focus of subsequent articles in this series:
- High-concurrency performance ceiling: The baseline architecture shows latency spikes and stability degradation at 100+ concurrent sessions. The async FastAPI foundation is already in place — adding vLLM continuous batching, a request queue, and circuit breakers will complete the full-stack performance optimization without requiring a core architecture rewrite.
- Insufficient multi-source data and long-document support: Currently limited to basic text Q&A. Cannot handle PDF long documents, table/image multimodal data, or complex CSV structured queries. MinerU + GraphRAG will address this in the next iteration.
- Missing production-grade security and compliance: No prompt injection protection, privilege escalation prevention, or hallucination validation. Does not meet enterprise compliance requirements. A three-layer full-stack safety guardrail system with red team testing will be built in a later iteration.
5.4 Series Roadmap
Each subsequent article addresses one of the MVP's core bottlenecks, following the evolution path: v0.1 MVP → v0.5 Knowledge Graph Upgrade → v1.0 Multi-Agent + API Release → v2.0 Production-Grade Stable
- Part 2: Production-Grade GraphRAG Pipeline — From PDF Parsing to Knowledge Graph (v0.5 iteration)
- Part 3: GraphRAG Service Encapsulation — From CLI to Enterprise API (v0.5 → v1.0 iteration)
- Part 4: Multi-Agent Architecture — Complex Task Handling & Fault Tolerance with LangGraph (v1.0 iteration)
- Part 5: Compliance Core — Production-Grade LLM Safety Guardrail System (v1.0 → v2.0 iteration)
- Part 6: Full-Stack Closure — Hybrid Knowledge Base & System Capability Completion (v2.0 iteration)
- Part 7: Production Optimization — LLM Inference Cost & Performance Control (v2.0 iteration)
6. Scope & Series Navigation
6.1 Scope
This MVP architecture is designed for basic text Q&A scenarios in small-to-medium e-commerce businesses. Healthcare, finance, and other regulated industries will need to adapt data segmentation rules and security policies to their specific requirements. Production-grade iteration requires adding the GraphRAG hybrid knowledge base, LangGraph multi-agent orchestration, and a three-layer safety guardrail system.
6.2 Series Navigation
-
GitHub Repository (MVP complete source code): llm-customer-service, Tag:
v0.1.0-mvp - Previous article: This is Part 1 — no prerequisites.
- Next up — Part 2: Tackling the "insufficient multi-source data and long-document support" bottleneck head-on. Full implementation of the MinerU + GraphRAG + Neo4j hybrid knowledge base data pipeline. Stay tuned.
- Series finale: Part 8 will provide a complete retrospective of every architecture decision, lessons learned, and quantifiable outcomes from MVP to production — a full end-to-end engineering practice record.
Top comments (0)