Yu-Chen, Lin

Posted on Dec 2 • Edited on Dec 7

How I Built a RAG App from Scratch to Production in 5 Days

#rag #llm #tutorial #beginners

Introduction

"I hear about RAG a lot lately, but how do you actually build one?"

This 5-day development log started from that very question. I will share my learnings from having zero knowledge to completing a RAG application running in a production environment.

Completed App: https://github.com/oharu121/rag-demo

What I Built

I built a RAG application that allows users to ask questions about internal documentation using natural language.

Key Features:

Streaming responses like ChatGPT
Document upload and management
Citation display for answers (with filenames and line numbers)
Multi-turn conversation

Tech Stack:

Layer	Technology	Reason for Selection
Frontend	Next.js 16 / React 19	App Router + RSC support
Backend	FastAPI	Asynchronous processing & Type safety
Embedding Model	multilingual-e5-large	High accuracy for Japanese text
Vector DB	Chroma	Low learning curve & OSS
LLM	Gemini 2.0 Flash	Free tier available
Deployment	Vercel + HF Spaces	Production-ready for free

Day 1: Understanding the Basics of RAG

What is RAG?

RAG (Retrieval-Augmented Generation) = Retrieval + Generation

User Question
    ↓
[Retrieval] Fetch relevant documents from Vector DB
    ↓
[Generation] Ask LLM: "Please answer using these materials"
    ↓
Answer with evidence

Why is RAG Necessary?

Problems with standalone LLMs:

Knowledge Cutoff: They don't know information past their training date.
No Access to Private Data: The LLM has never seen your company's documents.
Hallucinations: They confidently tell lies.

RAG solves these issues by "reinforcing with search."

The Importance of Chunking

The first thing I stumbled upon was chunking (text splitting).

The problem with chunk_overlap = 0:

"Mr. Tanaka lives in Tokyo. He is | an engineer."
                          ↑ If split here
                          It becomes unclear who "He" refers to.

Lesson: A chunk_overlap of 10-20% of the chunk_size is recommended to prevent context fragmentation.

How Embeddings Work

I wondered, "Why can't we just use simple character code conversion?"

# Naive approach - Cannot capture meaning
"cat" -> [99, 97, 116]
"dog" -> [100, 111, 103]
# "cat" and "feline" would end up in completely unrelated positions!

# Embedding Model - Captures meaning
"cat"      -> [0.82, 0.15, -0.34, ...]
"feline"   -> [0.79, 0.18, -0.31, ...]  # Close!
"airplane" -> [-0.45, 0.67, 0.12, ...]  # Far

Embedding models are trained so that "words used in similar contexts result in similar vectors."

Day 2-3: Backend Implementation

Implementing Streaming API with FastAPI

To achieve the "text flowing in" experience like ChatGPT, I adopted Server-Sent Events (SSE).

@router.post("/chat")
async def chat(request: ChatRequest):
    async def generate():
        # 1. Search related documents
        docs = vector_store.similarity_search(request.message)

        # 2. Send source info first
        yield f"event: sources\ndata: {json.dumps(sources)}\n\n"

        # 3. Stream from LLM token by token
        async for token in llm.stream(prompt):
            yield f"event: token\ndata: {json.dumps({'token': token})}\n\n"

        yield f"event: done\ndata: {{}}\n\n"

    return StreamingResponse(generate(), media_type="text/event-stream")

The FastAPI 404 Issue in Production

When I added a new endpoint, it kept returning a 404 error.

GET /api/documents/abc123/content
→ {"detail": "Not Found"}  # FastAPI's default 404

Since it was FastAPI's default 404 and not my custom error (HTTPException(404, "Document not found")), it meant the route wasn't matching at all.

Debugging Method:

# 1. Add a debug endpoint
@router.get("/debug/routes")
async def debug_routes():
    return {"routes": [...]}

# 2. Print route list at startup
for route in app.routes:
    print(f"{route.methods} {route.path}")

It turned out the code just hadn't been reflected during deployment.

Lesson: When debugging in production, first verify "is the code actually deployed?"

Day 4: Frontend Implementation

Receiving SSE Streaming

export async function* streamChat(message: string, history: Message[]) {
  const response = await fetch(`${API_URL}/chat`, {
    method: "POST",
    body: JSON.stringify({ message, history }),
  });

  const reader = response.body?.getReader();
  const decoder = new TextDecoder();
  let buffer = "";

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;

    buffer += decoder.decode(value, { stream: true });

    // Parse SSE events
    const lines = buffer.split("\n");
    for (const line of lines) {
      if (line.startsWith("event: ")) {
        // Event processing...
        yield { type: eventType, data };
      }
    }
  }
}

The React Stale Closure Problem

This was the bug where I learned the most.

Symptom: The AI answers based on old messages.

1. "Tell me about the vacation policy" → Answers correctly
2. "Tell me about the branch office" → Somehow returns the answer about the vacation policy

Cause: Stale closure in useCallback.

// Buggy code
const sendMessage = useCallback(async (content: string) => {
  // This `messages` is the value from when useCallback was created!
  const history = messages.map(m => ({ role: m.role, content: m.content }));

  await streamChat(content, history); // Sends old history
}, [messages]); // Even with the dependency array, timing issues occur

Solution: Get the current state using setState functional updates.

// Fixed code
const sendMessage = useCallback(async (content: string) => {
  let capturedHistory: Message[] = [];

  // Get current state via functional update
  setMessages((prev) => {
    capturedHistory = prev.map(m => ({
      role: m.role,
      content: m.content
    }));
    return [...prev, userMessage, assistantMessage];
  });

  // capturedHistory is guaranteed to be latest
  await streamChat(content, capturedHistory);
}, []); // Dependency array can be empty

Lesson: If you need the latest state within an async callback:

Use useRef to synchronize state
Utilize setState functional updates

Day 5: UX Improvements

Document Preview Feature

I added a preview feature so users can check "which documents are being targeted for search."

// DocumentChipsBar - List of documents always displayed
<DocumentChipsBar
  documents={allDocuments}
  onPreview={(doc) => setPreviewDoc(doc)}
/>

// Modal display on click
<DocumentPreviewModal
  doc={previewDoc}
  onClose={() => setPreviewDoc(null)}
/>

Onboarding Flow

For first-time visitors, I displayed step-by-step hints:

Tooltip pointing to the "Manage Documents" button.
"Click to preview" hint pointing to the document chips.

// Tooltip position calculation
useEffect(() => {
  if (targetRef.current) {
    const rect = targetRef.current.getBoundingClientRect();
    setPosition({
      top: rect.bottom + 12,
      right: window.innerWidth - rect.right,
    });
  }
}, [targetRef]);

Project Structure

simple-rag-app/
├── frontend/              # Next.js Frontend
│   ├── app/
│   │   ├── components/    # UI Components
│   │   ├── page.tsx
│   │   └── layout.tsx
│   ├── lib/               # API, Constants, Type definitions
│   ├── hooks/             # Custom Hooks (useChat, useDocuments, etc.)
│   └── package.json
│
├── backend/               # Python Backend
│   ├── app/
│   │   ├── routers/       # API Endpoints (chat, documents)
│   │   ├── services/      # Business Logic (RAG, Document Management)
│   │   ├── models/        # Pydantic Schemas
│   │   └── utils/         # Rate limiting, Error handling
│   ├── pyproject.toml
│   └── Dockerfile
│
└── README.md

Why Separate Frontend and Backend?

I considered a monolithic structure (doing everything with Next.js API Routes), but separated them for the following reasons:

Leveraging the Python Ecosystem: AI/ML libraries like LangChain, Chroma, and sentence-transformers are overwhelmingly richer in Python.
Asynchronous Processing: FastAPI's async/await works very well with streaming responses.
Deployment Flexibility: I can choose the optimal platform for both frontend and backend.

Cost Efficiency: Strategy for Free Production

For personal development or learning purposes, keeping costs to zero is crucial. I achieved a completely free production environment with the following configuration:

Deployment Selection

Component	Deployment	Free Tier	Reason for Selection
Frontend	Vercel	100GB bandwidth/mo	Creators of Next.js. Zero-config deployment.
Backend	HF Spaces	2vCPU, 16GB RAM	Docker support. Easy installation of ML libraries.
LLM	Gemini API	15 RPM, 1M tokens/day	Most generous free tier.
Vector DB	Chroma (Inside HF)	-	Persisted within HF Spaces.

Why Vercel + Hugging Face Spaces?

Vercel (Frontend):

Since they make Next.js, it has perfect compatibility with the latest features like App Router and RSC.
Automatic deployment just by pushing to GitHub.
Fast delivery via Edge Network.
Commercial use allowed on the free tier.

Hugging Face Spaces (Backend):

Docker support allows handling complex dependencies.
Pre-installed environment for ML/AI libraries (PyTorch, sentence-transformers, etc.).
16GB RAM allows embedding models to run comfortably.
Safe environment variable management via Secrets.
There is a cold start issue, but it's acceptable for demo use.

Cost Comparison: LLM Selection

In RAG apps, LLM API fees tend to be the biggest cost.

Model	Input (1M tokens)	Output (1M tokens)	Feature
Gemini Flash	$0.075	$0.30	Free tier available
GPT-4o mini	$0.15	$0.60	Good balance
Claude Haiku	$0.25	$1.25	Good at Japanese
GPT-4o	$2.50	$10.00	High precision

Result: Selected Gemini Flash (Free Tier) for learning and demo purposes.

Summary of Learnings

Technical Takeaways

Chunking: Overlap is mandatory. 10-20% of chunk_size.
Vector DB: Chroma is sufficient for learning. Consider Pinecone, etc., for production.
Streaming: Implemented with SSE. Contributes significantly to UX.
React Closures: In async processing, use useRef or functional setState.

Architectural Takeaways

[Vercel]          [HF Spaces]        [Chroma]
Next.js    <--->    FastAPI    <--->  Vector DB
Frontend            Backend           Persistence
   ↑                   ↑
   └──── SSE ──────────┘

Separating frontend and backend allows for:

Independent scaling
Deployment to the optimal platform for each
Maximizing the use of free tiers

Process Takeaways

Debug Endpoints: Prepare verification endpoints like /debug/routes in production.
Incremental Logging: Essential for isolating issues.
Deployment Verification: First check "Is the code actually deployed?"

Conclusion

In 5 days, I was able to experience everything from the basics of RAG to production deployment.

Repository: https://github.com/oharu121/rag-demo

What made the biggest impression on me was:

The Power of RAG: Realizing "answers with evidence," which is impossible with an LLM alone.
Stale Closures: Be very careful with React async processing.
Free Production: Achievable with Vercel + HF Spaces + Gemini Free Tier.

RAG is a technology that makes "LLMs usable for actual work." I hope this demo app serves as a useful reference for your learning.

DEV Community