How I Built a Private Knowledge Base with LangChain + FastAPI — and the 3 Pitfalls That Cost Me 8 Hours

#python #programming

At 1:30 AM, my phone went crazy. The ops chat exploded: “The knowledge base QA endpoint is timing out — users are already cursing us.” I opened Grafana and saw P99 latency soaring to 34 seconds with a 40% error rate. I had confidently launched this LangChain‑based RAG system two weeks ago. The Colab demo ran buttery smooth, but moving it to production caused a total meltdown. Over the next eight hours, I peeled back LangChain’s elegant abstractions and uncovered three critical issues that can instantly kill your service.

Problem Breakdown: The Galaxy‑Sized Gap Between Demo and Production

Our use case is typical: ingest thousands of internal technical documents and runbooks into a vector store, then let engineers ask natural language questions — like “How to troubleshoot MySQL replication lag?” or “What are the steps to scale a Redis cluster?”

The pipeline is straightforward: user question → vector retrieval of relevant document chunks → prompt assembly → LLM generates an answer. In the local demo, with few docs and the model running in‑process, everything was peaceful.

Once in production, three problems hit us at once:

Document count jumped from 10 to 3,000, turning retrieval from milliseconds into seconds and frequently returning irrelevant chunks.
The LLM was on a separate GPU cluster — network overhead plus generation latency meant every query took 15–20 seconds. Users had long switched to other tasks.
Memory kept growing under concurrent users with zero concurrency protection, causing an OOM crash after two hours.

A typical Flask + sync approach would fail here — one slow query blocks all users. Worse, many LangChain docs only show the “happy path” and never warn you how things explode in an async production environment.

Architecture Decisions: Why This Stack and Not Others

We settled on: FastAPI + LangChain + Chroma + vLLM + Redis.

Why not Flask?

Flask is synchronous by default. Even with flask[async], most extensions remain synchronous. FastAPI is async‑native, and with asyncio.to_thread you can easily offload synchronous LangChain calls to a thread pool and handle hundreds of concurrent requests.

Why not LangServe?

LangServe’s thick abstraction makes it tempting to spin up a RAG service with one line, but when it breaks, you can’t tell if the traceback points to your bug or its internal keep‑alive logic. Production demands transparency, not magic.

Why Chroma over Pinecone or Milvus?

We needed an on‑prem solution so data stays inside the network. Chroma’s single‑node mode handles hundreds of thousands of vectors with near‑zero ops overhead. When we outgrow it, migrating to Milvus is just swapping a LangChain vectorstore wrapper — minimal cost.

Architecture highlights:

Structure‑aware splitter that first splits by paragraph and table boundaries, then by token length.
Stuff chain + question rephrasing instead of ConversationalRetrievalChain, which is a memory black hole.
Two API paths: a regular /ask for background tasks and a streaming /ask/stream using Server‑Sent Events to stream tokens.
Redis caching of final answers for frequent questions — on cache hit, we skip the LLM entirely.

Core Implementation: If It Doesn’t Run Out of the Box, It’s Not Good Code

1. Document Loading & Structure‑Aware Splitting

What this code solves: Traditional RecursiveCharacterTextSplitter blindly splits tables into fragments, stripping out critical data columns and making the LLM hallucinate. We split by paragraphs and table boundaries first to keep structured data intact.

# preprocess.py
import os
from typing import List
from langchain_community.document_loaders import TextLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document

def split_with_table_aware(docs: List[Document], chunk_size=1200, overlap=200):
    """先按表格边界保护性切分，再对长文本做常规切分"""
    protected_docs = []
    for doc in docs:
        # 假设文档中表格以 '|---' 或 '+---+' 开头
        parts = doc.page_content.split("\n|---")
        for i, part in enumerate(parts):
            if i > 0:
                part = "|---" + part  # 恢复表格头
            protected_docs.append(Document(page_content=part, metadata=doc.metadata))

    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=overlap,
        separators=["\n\n", "\n", " ", ""]
    )
    return text_splitter.split_documents(protected_docs)

# 加载所有 .md 和 .txt 文档
loader = DirectoryLoader("./docs", glob="**/*.{md,txt}", loader_cls=TextLoader)
raw_docs = loader.load()
splitted = split_with_table_aware(raw_docs)
print(f"原始文档 {len(raw_docs)} 份，切分为 {len(splitted)} 个片段")

2. Vectorization & Async‑Wrapped Retriever

What this code solves: The Chroma client is synchronous. Calling it directly in an async route would block the event loop. We wrap it with asyncio.to_thread so the entire pipeline stays fully async.

# vectorstore.py
import asyncio
from langchain_community.embeddings import OpenAIEmbeddings  # 或其他兼容 OpenAI 的本地 embe

Top comments (1)

Harjot Singh • Jun 1

i can totally relate to the gap between demo and production. it’s wild how things can seem so smooth in a controlled environment but fall apart when real users are involved. if you ever need a fast way to deploy an app, moonshift lets you get a full next.js + postgres + auth build up in about 7 minutes. happy to offer a free run if you want to give it a shot.