DEV Community

James Lee
James Lee

Posted on

# From 0 to MVP in 2 Weeks: Building a Production-Grade AI Customer Service System

1. Background: Four Core Production-Grade Pain Points of Enterprise AI Customer Service

The implementation of enterprise-level AI customer service always faces four critical production-grade pain points that cannot be solved by open-source demos. These are the core design goals of this project and the architectural principles I anchored from the MVP stage:

  1. Mandatory Private Deployment & Compliance: Sensitive data such as customer data, product manuals, and order information in e-commerce, finance, and other industries cannot be connected to public cloud LLM APIs. Full-process local deployment and private model deployment are required to ensure data stays within the domain and complies with regulatory requirements like the Personal Information Protection Law — this is a prerequisite for project implementation, not an optional feature.
  2. Performance Bottlenecks in High-Concurrency Scenarios: Customer service scenarios have obvious peak and off-peak traffic. During big promotions, consultation volume can reach 10-20 times the daily level. Traditional LLM services often suffer from high response latency, session loss, and service avalanche, failing to guarantee stability under high concurrency.
  3. Adaptation Challenges for Multi-Source Knowledge Bases: Enterprise customer service knowledge is scattered across multiple data sources — structured CSV order/product data, unstructured PDF product manuals/service agreements, and database interfaces of business systems. Traditional full-text search and basic vector retrieval cannot solve problems such as lost cross-page semantic associations and failed parsing of table/image content.
  4. Uncontrollable Inference Costs: More than 70% of consultations in customer service scenarios are high-frequency repetitive questions. Invoking large models for answers without distinction leads to wasted GPU resources in private deployments and surging public cloud API costs, making enterprise operating costs completely uncontrollable.

The core goal of this project is to first complete a full closed-loop validation of "private deployment - user dialogue - tool invocation - cost optimization" through the MVP version, and then gradually iterate into a production-ready system, rather than building a toy demo that only runs locally.


2. Architecture Overview: Complete Design from MVP to Production Grade

2.1 Full MVP Version Landing Architecture

The core design principle of the MVP version is: Validate the minimal closed-loop while reserving seamless expansion capabilities for production-grade iteration, rejecting over-engineering and temporary solutions that lead to subsequent refactoring. The full landing architecture is as follows:

Figure 1: MVP Version Full Architecture — Five-Layer Design from Infrastructure to Frontend

Core Responsibilities of Each Layer (forming a complete business support link from bottom to top):

  1. Infrastructure Layer: The hardware foundation of the project, based on GPU servers and Docker containerized deployment, providing stable computing resources for private model inference.
  2. Model & Data Layer: The core foundation of the MVP version, implementing private deployment of DeepSeek open-source models via Ollama, using MySQL for user/session data persistence and Redis for semantic cache and session management, balancing performance and storage costs.
  3. LLM Technical Architecture Layer: Builds an asynchronous backend service based on FastAPI, implements dialogue agents and tool invocation frameworks via LangChain, providing standardized technical capabilities for upper-layer businesses.
  4. Application Service Layer: Encapsulates three standardized interfaces (User Service, Session Service, Dialogue Service), implementing five core business capabilities: user authentication, session management, dialogue inference, tool invocation, and cache optimization.
  5. Frontend Interaction Layer: A user interface built with Vue, providing visual functions such as chat windows and user login, and restoring the real-time chat experience similar to ChatGPT through SSE streaming responses.

2.2 Boundary Between Production-Grade Target Architecture and MVP

The ultimate goal of this series is to iterate into an enterprise-level production-ready intelligent customer service system, and the complete target architecture has been top-level designed (corresponding to the optimized "Production-Grade Target Full Architecture Diagram").

Gray semi-transparent components in the architecture diagram (GraphRAG, Neo4j, LanceDB, MinerU multimodal parsing, LangGraph multi-Agent architecture, three-layer safety guardrail system, vLLM inference service) are planned for v1.0+ production iterations. The MVP architecture diagram only presents the core components currently implemented, with the core closed-loop based on basic text Q&A + Ollama private deployment.

2.3 Core Data Flow of the MVP Version

The MVP version has completed full validation of the basic data flow link, and the production-grade version will expand multi-source data processing capabilities based on this. The core data flow main line is as follows:

  1. A user initiates a dialogue request, which first undergoes JWT identity authentication and session context verification;
  2. The request first enters the Redis semantic cache layer to match historical answers to high-frequency repetitive questions. If matched, the result is returned directly, skipping model inference;
  3. If the cache is missed, it enters the dialogue agent to determine whether to call the web search tool to supplement time-sensitive content beyond the LLM's knowledge cutoff date;
  4. Finally, the privately deployed DeepSeek model is called for inference, returned to the user via SSE streaming response, and session history persistence and cache update are completed at the same time.

3. Technical Selection: Architectural Decisions and Tradeoffs in the MVP Stage

The core logic of technical selection is: Prioritize closed-loop landing speed in the MVP stage while reserving seamless expansion capabilities for production-grade iteration. All selections are based on multi-scheme comparisons and production scenario adaptability analysis, rather than blindly following popular technologies.

3.1 Backend Framework: FastAPI

Comparison Schemes: Flask, Django
Final Selection: FastAPI, core decision reasons:

  1. Native support for asynchronous programming, perfectly adapting to streaming responses and long-time-consuming inference scenarios in LLM dialogues, with far better performance under high concurrency than Flask;
  2. Automatically generates OpenAPI specification documents, greatly reducing the cost of subsequent front-end and back-end joint debugging and third-party system integration, meeting the engineering requirements of enterprise-level projects;
  3. Built-in type hints and data validation capabilities, reducing parameter errors and interface anomalies in the production environment at the code level, and perfectly compatible with the LLM tool chain ecosystem such as LangChain and LangGraph.

3.2 Model Deployment: Ollama (Reserve vLLM Adaptation Interface for Production Grade)

Comparison Schemes: vLLM, Native Transformers
Final Selection: Use Ollama in the MVP stage, reserve seamless vLLM switching capability for production grade, core decision reasons:

  1. Extreme deployment convenience: one command to download, deploy, and run mainstream open-source models such as DeepSeek-R1, shortening the MVP verification cycle from one week to one day;
  2. Built-in multi-GPU load balancing, model quantization, and video memory optimization capabilities, meeting the basic performance requirements of private deployment without writing underlying adaptation code;
  3. Provides a standard OpenAI-compatible API, so no core business code needs to be modified when switching to vLLM or online models later, avoiding technical debt entirely. > Why not use vLLM directly in the MVP stage? While vLLM has stronger high-concurrency performance, its deployment complexity and environment adaptation cost are higher. The core goal of the MVP stage is to quickly verify the private deployment closed loop, not to optimize for extreme performance — Ollama is the most cost-effective choice.

3.3 Storage Architecture: MySQL + Redis Combination

Final Selection: MySQL for persistent storage and Redis for cache and session management. This is the most mature and low-operation-cost storage solution for enterprise applications, with core adaptation logic:

  1. MySQL: Used to persist user information, session history, and knowledge base metadata, supporting transaction features to ensure data consistency in enterprise scenarios, and adapting to subsequent Text2SQL structured data query requirements;
  2. Redis: Used for in-memory caching of active sessions, semantic similarity caching, and request throttling, solving response latency issues in high-concurrency scenarios. It also implements session hot-cold separation — active sessions are stored in Redis, and historical sessions are persisted to MySQL, balancing performance and storage costs.

3.4 Core Capacity Reservation: LangGraph + GraphRAG

Selection Note: Complete technical selection verification and expansion interface reservation in the MVP stage, and full implementation in the production-grade version, core decision reasons:

  1. LangGraph: Compared with frameworks such as CrewAI and Swarm, LangGraph is more low-level, flexible, and scalable, perfectly realizing multi-Agent workflow orchestration and cyclic iterative execution, adapting to complex task decomposition requirements in customer service scenarios. It is currently the most mainstream Agent orchestration framework in production environments;
  2. GraphRAG: Solves the fatal flaw of traditional vector retrieval in long documents and cross-chapter association scenarios. Through entity and relationship extraction and community detection, it realizes deep semantic understanding, perfectly adapting to long text processing requirements such as PDF product manuals and agreement documents in customer service scenarios.

4. MVP Version Function Implementation

The core goal of the MVP version is to complete a full business closed loop and verify the feasibility of the core technical solution. Five core functions are finally implemented, all of which have completed local deployment and availability verification:

  1. Basic Streaming Dialogue Capability: Implements a dialogue interface based on FastAPI, supporting SSE streaming responses to restore the real-time chat experience similar to ChatGPT, ensuring the real-time nature of user consultations;
  2. Function Call for Web Search: Implements an external tool invocation framework supporting web search, solving timeliness issues caused by the LLM's knowledge cutoff date and expanding the Q&A boundary of customer service;
  3. Semantic Similarity Caching: Implements a basic version of semantic cache based on Redis, reusing inference results for high-frequency repetitive questions and initially solving the pain point of uncontrollable inference costs;
  4. Standardized Database Design: Designs core data structures such as user tables, session tables, and message tables based on MySQL, realizing persistent storage of user data and dialogue history and ensuring the continuity of session context;
  5. User Authentication & Authorization System: Implements JWT-based user login, registration, and authentication functions, completing basic control of user permissions and meeting the basic security requirements of enterprise-level systems.

Core Compliance Achievement: The MVP version implements full-process local deployment. From user dialogue and model inference to data storage, there are no third-party API calls or data outbound throughout the entire process, fully meeting the basic requirements of enterprise-level data compliance.


5. MVP Effect Verification and Iteration Plan

5.1 Landing Effect Verification

Based on 1,000 real-world e-commerce customer service conversation logs (covering three core customer service scenarios: product consultation, order query, and after-sales policy, with 1-8 rounds of dialogue per piece), a full-function and performance test was conducted on the MVP version.
Test Environment: Dual RTX 4090 GPU server, 32G memory, inference base is DeepSeek-R1:14B 4-bit quantized model
Core Verification Results:

  • All core functions are 100% available, completing the full process of "user login → initiate dialogue → tool invocation → result return", supporting private local deployment without third-party API dependencies;
  • Tested on 70% of high-frequency repetitive consultations in 1,000 conversation logs, the semantic cache hit rate reached 72%, corresponding to a 68% reduction in single-request inference cost, and the average response latency was optimized from 1.8s to 0.3s;
  • Using the Locust load testing framework to simulate 50 concurrent continuous dialogue requests, the service runs stably without crashes or session loss, with an average response latency <2s and a 99th percentile latency <5s, meeting the daily customer service needs of small and medium-sized e-commerce enterprises.

5.2 Simplified Processing in the MVP Stage

To quickly verify the core closed loop, targeted simplified design was implemented in the MVP stage, without pursuing production-grade completeness. These are also the core optimization directions for subsequent iterations:

  1. The semantic cache only implements a basic version of matching logic with a fixed threshold, without scenario-based threshold tuning, hot-cold data separation, or automated cache update and invalidation mechanisms;
  2. Function Call only supports a single web search tool, without multi-tool collaboration or complex task decomposition capabilities;
  3. The knowledge base only supports basic text Q&A, without accessing multi-source structured/unstructured data such as PDFs and CSVs.

5.3 Core Production-Grade Bottlenecks of the MVP

The MVP version verified the feasibility of the core solution, but there are still three core bottlenecks that cannot be solved by minor fixes before enterprise-level production implementation, which are also the core iteration directions of subsequent articles in this series:

  1. Performance Bottlenecks in High-Concurrency Scenarios: Under more than 100 concurrent scenarios, the basic architecture will experience a surge in response latency and a decline in service stability. The underlying design based on the FastAPI asynchronous architecture has been completed in the MVP stage. Subsequent iterations only need to access vLLM for continuous batching, add request queues and circuit breaker mechanisms, to complete full-link performance optimization without reconstructing the core architecture.
  2. Insufficient Support for Multi-Source Data and Long Documents: Currently only supports basic text Q&A, unable to handle complex query requirements for long PDF documents, table/image multimodal data, and CSV structured data. Subsequent iterations will build a complete multimodal data pipeline through MinerU+GraphRAG to solve this core pain point.
  3. Lack of Production-Grade Safety and Compliance Capabilities: There is no three-layer safety guardrail system such as Prompt injection protection, unauthorized operation interception, and hallucination verification, which cannot meet the compliance requirements of enterprise-level scenarios. Subsequent iterations will build a three-layer full-link safety guardrail system, complete red team testing verification, and realize production-grade compliance capabilities.

5.4 Follow-Up Series Article Plan

Subsequent articles in this series will focus on the core bottlenecks of the MVP, completely disassembling the full-process iteration from Demo to production-grade system, strictly corresponding to the evolution route of "v0.1 MVP → v0.5 Knowledge Graph Upgrade → v1.0 Multi-Agent + API Release → v2.0 Production-Grade Stable Version":

  1. Article 2: Production-Grade GraphRAG Data Pipeline: Full-Link Construction from PDF Parsing to Knowledge Graph (corresponding to v0.5 version iteration)
  2. Article 3: GraphRAG Service Encapsulation: Engineering Transformation from CLI to Enterprise-Level API (corresponding to v0.5→v1.0 version iteration)
  3. Article 4: Multi-Agent Architecture Design: Complex Task Processing and Fault Tolerance Mechanism Based on LangGraph (corresponding to v1.0 version iteration)
  4. Article 5: Compliance Core: Production-Grade LLM Application Safety Guardrail System (corresponding to v1.0→v2.0 version iteration)
  5. Article 6: Full-Link Closing: Hybrid Knowledge Base and System Capability Closed Loop (corresponding to v2.0 version iteration)
  6. Article 7: Production Optimization: LLM Inference Cost and Performance Control (corresponding to v2.0 version iteration)

6. Closed-Loop Linkage

  • GitHub Repository Full MVP Code: [Repository Link, to be replaced with actual address], corresponding Tag: v0.1.0-mvp
  • All articles in this series will be expanded based on the production-grade target architecture in this article, with each article corresponding to a version Tag in the repository, forming a complete content and code closed loop;
  • The series closing article (Article 7) will fully review the full architectural decisions, pitfall reviews, and quantifiable results from MVP to production-grade systems, forming a complete end-to-end engineering practice record as a full testament to technical capabilities.

Top comments (0)