DEV Community

Cover image for How I Built a RAG App from Scratch to Production in 5 Days
Yu-Chen, Lin
Yu-Chen, Lin

Posted on

How I Built a RAG App from Scratch to Production in 5 Days

Introduction

"I hear about RAG a lot lately, but how do you actually build one?"

This 5-day development log started from that very question. I will share my learnings from having zero knowledge to completing a RAG application running in a production environment.

Completed App: https://github.com/oharu121/rag-demo

What I Built

I built a RAG application that allows users to ask questions about internal documentation using natural language.

Key Features:

  • Streaming responses like ChatGPT
  • Document upload and management
  • Citation display for answers (with filenames and line numbers)
  • Multi-turn conversation

Tech Stack:

Layer Technology Reason for Selection
Frontend Next.js 16 / React 19 App Router + RSC support
Backend FastAPI Asynchronous processing & Type safety
Embedding Model multilingual-e5-large High accuracy for Japanese text
Vector DB Chroma Low learning curve & OSS
LLM Gemini 2.0 Flash Free tier available
Deployment Vercel + HF Spaces Production-ready for free

Day 1: Understanding the Basics of RAG

What is RAG?

RAG (Retrieval-Augmented Generation) = Retrieval + Generation

User Question
    ↓
[Retrieval] Fetch relevant documents from Vector DB
    ↓
[Generation] Ask LLM: "Please answer using these materials"
    ↓
Answer with evidence
Enter fullscreen mode Exit fullscreen mode

Why is RAG Necessary?

Problems with standalone LLMs:

  1. Knowledge Cutoff: They don't know information past their training date.
  2. No Access to Private Data: The LLM has never seen your company's documents.
  3. Hallucinations: They confidently tell lies.

RAG solves these issues by "reinforcing with search."

The Importance of Chunking

The first thing I stumbled upon was chunking (text splitting).

The problem with chunk_overlap = 0:

"Mr. Tanaka lives in Tokyo. He is | an engineer."
                          ↑ If split here
                          It becomes unclear who "He" refers to.
Enter fullscreen mode Exit fullscreen mode

Lesson: A chunk_overlap of 10-20% of the chunk_size is recommended to prevent context fragmentation.

How Embeddings Work

I wondered, "Why can't we just use simple character code conversion?"

# Naive approach - Cannot capture meaning
"cat" -> [99, 97, 116]
"dog" -> [100, 111, 103]
# "cat" and "feline" would end up in completely unrelated positions!

# Embedding Model - Captures meaning
"cat"      -> [0.82, 0.15, -0.34, ...]
"feline"   -> [0.79, 0.18, -0.31, ...]  # Close!
"airplane" -> [-0.45, 0.67, 0.12, ...]  # Far
Enter fullscreen mode Exit fullscreen mode

Embedding models are trained so that "words used in similar contexts result in similar vectors."

Day 2-3: Backend Implementation

Implementing Streaming API with FastAPI

To achieve the "text flowing in" experience like ChatGPT, I adopted Server-Sent Events (SSE).

@router.post("/chat")
async def chat(request: ChatRequest):
    async def generate():
        # 1. Search related documents
        docs = vector_store.similarity_search(request.message)

        # 2. Send source info first
        yield f"event: sources\ndata: {json.dumps(sources)}\n\n"

        # 3. Stream from LLM token by token
        async for token in llm.stream(prompt):
            yield f"event: token\ndata: {json.dumps({'token': token})}\n\n"

        yield f"event: done\ndata: {{}}\n\n"

    return StreamingResponse(generate(), media_type="text/event-stream")
Enter fullscreen mode Exit fullscreen mode

The FastAPI 404 Issue in Production

When I added a new endpoint, it kept returning a 404 error.

GET /api/documents/abc123/content
→ {"detail": "Not Found"}  # FastAPI's default 404
Enter fullscreen mode Exit fullscreen mode

Since it was FastAPI's default 404 and not my custom error (HTTPException(404, "Document not found")), it meant the route wasn't matching at all.

Debugging Method:

# 1. Add a debug endpoint
@router.get("/debug/routes")
async def debug_routes():
    return {"routes": [...]}

# 2. Print route list at startup
for route in app.routes:
    print(f"{route.methods} {route.path}")
Enter fullscreen mode Exit fullscreen mode

It turned out the code just hadn't been reflected during deployment.

Lesson: When debugging in production, first verify "is the code actually deployed?"

Day 4: Frontend Implementation

Receiving SSE Streaming

export async function* streamChat(message: string, history: Message[]) {
  const response = await fetch(`${API_URL}/chat`, {
    method: "POST",
    body: JSON.stringify({ message, history }),
  });

  const reader = response.body?.getReader();
  const decoder = new TextDecoder();
  let buffer = "";

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;

    buffer += decoder.decode(value, { stream: true });

    // Parse SSE events
    const lines = buffer.split("\n");
    for (const line of lines) {
      if (line.startsWith("event: ")) {
        // Event processing...
        yield { type: eventType, data };
      }
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

The React Stale Closure Problem

This was the bug where I learned the most.

Symptom: The AI answers based on old messages.

1. "Tell me about the vacation policy" → Answers correctly
2. "Tell me about the branch office" → Somehow returns the answer about the vacation policy
Enter fullscreen mode Exit fullscreen mode

Cause: Stale closure in useCallback.

// Buggy code
const sendMessage = useCallback(async (content: string) => {
  // This `messages` is the value from when useCallback was created!
  const history = messages.map(m => ({ role: m.role, content: m.content }));

  await streamChat(content, history); // Sends old history
}, [messages]); // Even with the dependency array, timing issues occur
Enter fullscreen mode Exit fullscreen mode

Solution: Get the current state using setState functional updates.

// Fixed code
const sendMessage = useCallback(async (content: string) => {
  let capturedHistory: Message[] = [];

  // Get current state via functional update
  setMessages((prev) => {
    capturedHistory = prev.map(m => ({
      role: m.role,
      content: m.content
    }));
    return [...prev, userMessage, assistantMessage];
  });

  // capturedHistory is guaranteed to be latest
  await streamChat(content, capturedHistory);
}, []); // Dependency array can be empty
Enter fullscreen mode Exit fullscreen mode

Lesson: If you need the latest state within an async callback:

  • Use useRef to synchronize state
  • Utilize setState functional updates

Day 5: UX Improvements

Document Preview Feature

I added a preview feature so users can check "which documents are being targeted for search."

// DocumentChipsBar - List of documents always displayed
<DocumentChipsBar
  documents={allDocuments}
  onPreview={(doc) => setPreviewDoc(doc)}
/>

// Modal display on click
<DocumentPreviewModal
  doc={previewDoc}
  onClose={() => setPreviewDoc(null)}
/>
Enter fullscreen mode Exit fullscreen mode

Onboarding Flow

For first-time visitors, I displayed step-by-step hints:

  1. Tooltip pointing to the "Manage Documents" button.
  2. "Click to preview" hint pointing to the document chips.
// Tooltip position calculation
useEffect(() => {
  if (targetRef.current) {
    const rect = targetRef.current.getBoundingClientRect();
    setPosition({
      top: rect.bottom + 12,
      right: window.innerWidth - rect.right,
    });
  }
}, [targetRef]);
Enter fullscreen mode Exit fullscreen mode

Project Structure

simple-rag-app/
├── frontend/              # Next.js Frontend
│   ├── app/
│   │   ├── components/    # UI Components
│   │   ├── page.tsx
│   │   └── layout.tsx
│   ├── lib/               # API, Constants, Type definitions
│   ├── hooks/             # Custom Hooks (useChat, useDocuments, etc.)
│   └── package.json
│
├── backend/               # Python Backend
│   ├── app/
│   │   ├── routers/       # API Endpoints (chat, documents)
│   │   ├── services/      # Business Logic (RAG, Document Management)
│   │   ├── models/        # Pydantic Schemas
│   │   └── utils/         # Rate limiting, Error handling
│   ├── pyproject.toml
│   └── Dockerfile
│
└── README.md
Enter fullscreen mode Exit fullscreen mode

Why Separate Frontend and Backend?

I considered a monolithic structure (doing everything with Next.js API Routes), but separated them for the following reasons:

  1. Leveraging the Python Ecosystem: AI/ML libraries like LangChain, Chroma, and sentence-transformers are overwhelmingly richer in Python.
  2. Asynchronous Processing: FastAPI's async/await works very well with streaming responses.
  3. Deployment Flexibility: I can choose the optimal platform for both frontend and backend.

Cost Efficiency: Strategy for Free Production

For personal development or learning purposes, keeping costs to zero is crucial. I achieved a completely free production environment with the following configuration:

Deployment Selection

Component Deployment Free Tier Reason for Selection
Frontend Vercel 100GB bandwidth/mo Creators of Next.js. Zero-config deployment.
Backend HF Spaces 2vCPU, 16GB RAM Docker support. Easy installation of ML libraries.
LLM Gemini API 15 RPM, 1M tokens/day Most generous free tier.
Vector DB Chroma (Inside HF) - Persisted within HF Spaces.

Why Vercel + Hugging Face Spaces?

Vercel (Frontend):

  • Since they make Next.js, it has perfect compatibility with the latest features like App Router and RSC.
  • Automatic deployment just by pushing to GitHub.
  • Fast delivery via Edge Network.
  • Commercial use allowed on the free tier.

Hugging Face Spaces (Backend):

  • Docker support allows handling complex dependencies.
  • Pre-installed environment for ML/AI libraries (PyTorch, sentence-transformers, etc.).
  • 16GB RAM allows embedding models to run comfortably.
  • Safe environment variable management via Secrets.
  • There is a cold start issue, but it's acceptable for demo use.

Cost Comparison: LLM Selection

In RAG apps, LLM API fees tend to be the biggest cost.

Model Input (1M tokens) Output (1M tokens) Feature
Gemini Flash $0.075 $0.30 Free tier available
GPT-4o mini $0.15 $0.60 Good balance
Claude Haiku $0.25 $1.25 Good at Japanese
GPT-4o $2.50 $10.00 High precision

Result: Selected Gemini Flash (Free Tier) for learning and demo purposes.

Summary of Learnings

Technical Takeaways

  1. Chunking: Overlap is mandatory. 10-20% of chunk_size.
  2. Vector DB: Chroma is sufficient for learning. Consider Pinecone, etc., for production.
  3. Streaming: Implemented with SSE. Contributes significantly to UX.
  4. React Closures: In async processing, use useRef or functional setState.

Architectural Takeaways

[Vercel]          [HF Spaces]        [Chroma]
Next.js    <--->    FastAPI    <--->  Vector DB
Frontend            Backend           Persistence
   ↑                   ↑
   └──── SSE ──────────┘
Enter fullscreen mode Exit fullscreen mode

Separating frontend and backend allows for:

  • Independent scaling
  • Deployment to the optimal platform for each
  • Maximizing the use of free tiers

Process Takeaways

  1. Debug Endpoints: Prepare verification endpoints like /debug/routes in production.
  2. Incremental Logging: Essential for isolating issues.
  3. Deployment Verification: First check "Is the code actually deployed?"

Conclusion

In 5 days, I was able to experience everything from the basics of RAG to production deployment.

Repository: https://github.com/oharu121/rag-demo

What made the biggest impression on me was:

  • The Power of RAG: Realizing "answers with evidence," which is impossible with an LLM alone.
  • Stale Closures: Be very careful with React async processing.
  • Free Production: Achievable with Vercel + HF Spaces + Gemini Free Tier.

RAG is a technology that makes "LLMs usable for actual work." I hope this demo app serves as a useful reference for your learning.

References

Top comments (0)