Introduction
"I hear about RAG a lot lately, but how do you actually build one?"
This 5-day development log started from that very question. I will share my learnings from having zero knowledge to completing a RAG application running in a production environment.
Completed App: https://github.com/oharu121/rag-demo
What I Built
I built a RAG application that allows users to ask questions about internal documentation using natural language.
Key Features:
- Streaming responses like ChatGPT
- Document upload and management
- Citation display for answers (with filenames and line numbers)
- Multi-turn conversation
Tech Stack:
| Layer | Technology | Reason for Selection |
|---|---|---|
| Frontend | Next.js 16 / React 19 | App Router + RSC support |
| Backend | FastAPI | Asynchronous processing & Type safety |
| Embedding Model | multilingual-e5-large | High accuracy for Japanese text |
| Vector DB | Chroma | Low learning curve & OSS |
| LLM | Gemini 2.0 Flash | Free tier available |
| Deployment | Vercel + HF Spaces | Production-ready for free |
Day 1: Understanding the Basics of RAG
What is RAG?
RAG (Retrieval-Augmented Generation) = Retrieval + Generation
User Question
↓
[Retrieval] Fetch relevant documents from Vector DB
↓
[Generation] Ask LLM: "Please answer using these materials"
↓
Answer with evidence
Why is RAG Necessary?
Problems with standalone LLMs:
- Knowledge Cutoff: They don't know information past their training date.
- No Access to Private Data: The LLM has never seen your company's documents.
- Hallucinations: They confidently tell lies.
RAG solves these issues by "reinforcing with search."
The Importance of Chunking
The first thing I stumbled upon was chunking (text splitting).
The problem with chunk_overlap = 0:
"Mr. Tanaka lives in Tokyo. He is | an engineer."
↑ If split here
It becomes unclear who "He" refers to.
Lesson: A chunk_overlap of 10-20% of the chunk_size is recommended to prevent context fragmentation.
How Embeddings Work
I wondered, "Why can't we just use simple character code conversion?"
# Naive approach - Cannot capture meaning
"cat" -> [99, 97, 116]
"dog" -> [100, 111, 103]
# "cat" and "feline" would end up in completely unrelated positions!
# Embedding Model - Captures meaning
"cat" -> [0.82, 0.15, -0.34, ...]
"feline" -> [0.79, 0.18, -0.31, ...] # Close!
"airplane" -> [-0.45, 0.67, 0.12, ...] # Far
Embedding models are trained so that "words used in similar contexts result in similar vectors."
Day 2-3: Backend Implementation
Implementing Streaming API with FastAPI
To achieve the "text flowing in" experience like ChatGPT, I adopted Server-Sent Events (SSE).
@router.post("/chat")
async def chat(request: ChatRequest):
async def generate():
# 1. Search related documents
docs = vector_store.similarity_search(request.message)
# 2. Send source info first
yield f"event: sources\ndata: {json.dumps(sources)}\n\n"
# 3. Stream from LLM token by token
async for token in llm.stream(prompt):
yield f"event: token\ndata: {json.dumps({'token': token})}\n\n"
yield f"event: done\ndata: {{}}\n\n"
return StreamingResponse(generate(), media_type="text/event-stream")
The FastAPI 404 Issue in Production
When I added a new endpoint, it kept returning a 404 error.
GET /api/documents/abc123/content
→ {"detail": "Not Found"} # FastAPI's default 404
Since it was FastAPI's default 404 and not my custom error (HTTPException(404, "Document not found")), it meant the route wasn't matching at all.
Debugging Method:
# 1. Add a debug endpoint
@router.get("/debug/routes")
async def debug_routes():
return {"routes": [...]}
# 2. Print route list at startup
for route in app.routes:
print(f"{route.methods} {route.path}")
It turned out the code just hadn't been reflected during deployment.
Lesson: When debugging in production, first verify "is the code actually deployed?"
Day 4: Frontend Implementation
Receiving SSE Streaming
export async function* streamChat(message: string, history: Message[]) {
const response = await fetch(`${API_URL}/chat`, {
method: "POST",
body: JSON.stringify({ message, history }),
});
const reader = response.body?.getReader();
const decoder = new TextDecoder();
let buffer = "";
while (true) {
const { done, value } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
// Parse SSE events
const lines = buffer.split("\n");
for (const line of lines) {
if (line.startsWith("event: ")) {
// Event processing...
yield { type: eventType, data };
}
}
}
}
The React Stale Closure Problem
This was the bug where I learned the most.
Symptom: The AI answers based on old messages.
1. "Tell me about the vacation policy" → Answers correctly
2. "Tell me about the branch office" → Somehow returns the answer about the vacation policy
Cause: Stale closure in useCallback.
// Buggy code
const sendMessage = useCallback(async (content: string) => {
// This `messages` is the value from when useCallback was created!
const history = messages.map(m => ({ role: m.role, content: m.content }));
await streamChat(content, history); // Sends old history
}, [messages]); // Even with the dependency array, timing issues occur
Solution: Get the current state using setState functional updates.
// Fixed code
const sendMessage = useCallback(async (content: string) => {
let capturedHistory: Message[] = [];
// Get current state via functional update
setMessages((prev) => {
capturedHistory = prev.map(m => ({
role: m.role,
content: m.content
}));
return [...prev, userMessage, assistantMessage];
});
// capturedHistory is guaranteed to be latest
await streamChat(content, capturedHistory);
}, []); // Dependency array can be empty
Lesson: If you need the latest state within an async callback:
- Use
useRefto synchronize state - Utilize
setStatefunctional updates
Day 5: UX Improvements
Document Preview Feature
I added a preview feature so users can check "which documents are being targeted for search."
// DocumentChipsBar - List of documents always displayed
<DocumentChipsBar
documents={allDocuments}
onPreview={(doc) => setPreviewDoc(doc)}
/>
// Modal display on click
<DocumentPreviewModal
doc={previewDoc}
onClose={() => setPreviewDoc(null)}
/>
Onboarding Flow
For first-time visitors, I displayed step-by-step hints:
- Tooltip pointing to the "Manage Documents" button.
- "Click to preview" hint pointing to the document chips.
// Tooltip position calculation
useEffect(() => {
if (targetRef.current) {
const rect = targetRef.current.getBoundingClientRect();
setPosition({
top: rect.bottom + 12,
right: window.innerWidth - rect.right,
});
}
}, [targetRef]);
Project Structure
simple-rag-app/
├── frontend/ # Next.js Frontend
│ ├── app/
│ │ ├── components/ # UI Components
│ │ ├── page.tsx
│ │ └── layout.tsx
│ ├── lib/ # API, Constants, Type definitions
│ ├── hooks/ # Custom Hooks (useChat, useDocuments, etc.)
│ └── package.json
│
├── backend/ # Python Backend
│ ├── app/
│ │ ├── routers/ # API Endpoints (chat, documents)
│ │ ├── services/ # Business Logic (RAG, Document Management)
│ │ ├── models/ # Pydantic Schemas
│ │ └── utils/ # Rate limiting, Error handling
│ ├── pyproject.toml
│ └── Dockerfile
│
└── README.md
Why Separate Frontend and Backend?
I considered a monolithic structure (doing everything with Next.js API Routes), but separated them for the following reasons:
- Leveraging the Python Ecosystem: AI/ML libraries like LangChain, Chroma, and sentence-transformers are overwhelmingly richer in Python.
- Asynchronous Processing: FastAPI's
async/awaitworks very well with streaming responses. - Deployment Flexibility: I can choose the optimal platform for both frontend and backend.
Cost Efficiency: Strategy for Free Production
For personal development or learning purposes, keeping costs to zero is crucial. I achieved a completely free production environment with the following configuration:
Deployment Selection
| Component | Deployment | Free Tier | Reason for Selection |
|---|---|---|---|
| Frontend | Vercel | 100GB bandwidth/mo | Creators of Next.js. Zero-config deployment. |
| Backend | HF Spaces | 2vCPU, 16GB RAM | Docker support. Easy installation of ML libraries. |
| LLM | Gemini API | 15 RPM, 1M tokens/day | Most generous free tier. |
| Vector DB | Chroma (Inside HF) | - | Persisted within HF Spaces. |
Why Vercel + Hugging Face Spaces?
Vercel (Frontend):
- Since they make Next.js, it has perfect compatibility with the latest features like App Router and RSC.
- Automatic deployment just by pushing to GitHub.
- Fast delivery via Edge Network.
- Commercial use allowed on the free tier.
Hugging Face Spaces (Backend):
- Docker support allows handling complex dependencies.
- Pre-installed environment for ML/AI libraries (PyTorch, sentence-transformers, etc.).
- 16GB RAM allows embedding models to run comfortably.
- Safe environment variable management via Secrets.
- There is a cold start issue, but it's acceptable for demo use.
Cost Comparison: LLM Selection
In RAG apps, LLM API fees tend to be the biggest cost.
| Model | Input (1M tokens) | Output (1M tokens) | Feature |
|---|---|---|---|
| Gemini Flash | $0.075 | $0.30 | Free tier available |
| GPT-4o mini | $0.15 | $0.60 | Good balance |
| Claude Haiku | $0.25 | $1.25 | Good at Japanese |
| GPT-4o | $2.50 | $10.00 | High precision |
Result: Selected Gemini Flash (Free Tier) for learning and demo purposes.
Summary of Learnings
Technical Takeaways
- Chunking: Overlap is mandatory. 10-20% of
chunk_size. - Vector DB: Chroma is sufficient for learning. Consider Pinecone, etc., for production.
- Streaming: Implemented with SSE. Contributes significantly to UX.
- React Closures: In async processing, use
useRefor functionalsetState.
Architectural Takeaways
[Vercel] [HF Spaces] [Chroma]
Next.js <---> FastAPI <---> Vector DB
Frontend Backend Persistence
↑ ↑
└──── SSE ──────────┘
Separating frontend and backend allows for:
- Independent scaling
- Deployment to the optimal platform for each
- Maximizing the use of free tiers
Process Takeaways
- Debug Endpoints: Prepare verification endpoints like
/debug/routesin production. - Incremental Logging: Essential for isolating issues.
- Deployment Verification: First check "Is the code actually deployed?"
Conclusion
In 5 days, I was able to experience everything from the basics of RAG to production deployment.
Repository: https://github.com/oharu121/rag-demo
What made the biggest impression on me was:
- The Power of RAG: Realizing "answers with evidence," which is impossible with an LLM alone.
- Stale Closures: Be very careful with React async processing.
- Free Production: Achievable with Vercel + HF Spaces + Gemini Free Tier.
RAG is a technology that makes "LLMs usable for actual work." I hope this demo app serves as a useful reference for your learning.
Top comments (0)