What I Built
A RAG (Retrieval-Augmented Generation) system that helps Japanese small businesses find government subsidies. Users describe their business situation, and the system retrieves relevant subsidies from a vector database, then generates a detailed answer using Claude API.
Why RAG Instead of Just Using an LLM?
LLMs alone have three problems for this use case:
- Hallucination — They confidently make up subsidy details that don't exist
- Stale data — Subsidy information updates frequently; LLM training data can't keep up
- No sources — Users need to verify the information, but LLMs don't cite where it came from
RAG solves all three by retrieving real data first, then passing it to the LLM as context. The LLM generates answers grounded in actual documents, not its training data.
Architecture
User Query
│
▼
┌─────────────┐ ┌──────────────┐
│ Retriever │────▶│ Generator │
│ (embed + │ │ (Claude API) │
│ search) │ └──────┬───────┘
└──────┬──────┘ │
│ ▼
┌──────▼──────┐ AI Answer
│ ChromaDB │ with sources
│ (Vector DB) │
└─────────────┘
The data ingestion pipeline:
JSON Data → Sentence Chunker → Embedding Model → ChromaDB
Tech Stack and Why
| Layer | Choice | Why |
|---|---|---|
| Embedding | multilingual-e5-large | Best multilingual performance for free. Runs locally — no data leaves the machine |
| Vector DB | ChromaDB | Zero config. pip install and it works with local persistence |
| LLM | Claude API | Strong long-context handling and Japanese language quality |
| Frontend | Streamlit | Full UI in ~40 lines of Python |
Why Not LangChain/LlamaIndex?
I intentionally built the pipeline from scratch first. Frameworks abstract away the internals, which is great for production but bad for learning. By writing each step manually — chunking, embedding, storing, retrieving, prompting — I understood exactly what each piece does and why it matters.
Key Implementation Details
1. Sentence-Boundary Chunking
The most impactful decision was how to split documents into chunks.
Naive approach (fixed-size split):
...eligible IT tools include accounting software, ord ← cut mid-sentence
er management software, payment software... ← next chunk
My approach (split at Japanese sentence boundaries 。):
sentences = text.split("。")
# Greedily pack sentences into chunks up to CHUNK_SIZE
# Overlap the tail for context continuity
This preserves semantic meaning in each chunk, which directly improves retrieval accuracy.
2. Embedding with Prefix
multilingual-e5 requires specific prefixes to distinguish between stored passages and search queries:
# When indexing documents
prefixed = [f"passage: {text}" for text in texts]
# When searching
query_embedding = model.encode(f"query: {user_query}")
Missing this prefix is a common mistake that silently degrades search quality.
3. Prompt Design for Grounded Answers
The generation prompt explicitly constrains the LLM:
prompt = """You are a Japanese subsidy advisor.
Answer based on the reference information below.
If information is insufficient, state that explicitly.
For critical details like deadlines and amounts, be precise."""
This reduces hallucination by telling the model to stick to the provided context.
Evaluation: Measuring What Matters
Building a RAG that "seems to work" isn't enough. I created a test suite with 10 queries, each with an expected correct subsidy:
| Query | Expected Result | Rank |
|---|---|---|
| "Want to introduce accounting software" | Digital/AI Subsidy | 1st |
| "Want to install new manufacturing equipment" | Manufacturing Subsidy | 2nd |
| "Small shop wanting to attract customers via flyers" | Small Business Sustainability Subsidy | 1st |
| "Restaurant wanting to start EC business" | New Business Advancement Subsidy | 3rd |
| "Want to convert part-timers to full-time" | Career Up Grant | 1st |
| ... | ... | ... |
Results:
| Metric | Score |
|---|---|
| Hit Rate @5 | 100% (10/10) |
| MRR @5 | 0.817 |
- Hit Rate @5: The correct answer appeared in the top 5 results for every query
- MRR (Mean Reciprocal Rank): On average, the correct answer ranked between 1st and 2nd place
Why These Metrics?
- Hit Rate tells you "can the system find the right answer at all?"
- MRR tells you "how quickly does it find it?" (1st place = 1.0, 2nd = 0.5, 3rd = 0.33)
Both are standard IR (Information Retrieval) metrics. Having these numbers lets you objectively compare different chunking strategies, embedding models, or retrieval parameters.
What I Learned
1. Chunking strategy matters more than the embedding model
Switching from fixed-size to sentence-boundary chunking had a bigger impact on retrieval quality than I expected. The embedding model can only work with what it's given — if chunks are semantically broken, even the best model can't fix that.
2. Evaluation should come before optimization
Building the evaluation script early gave me a baseline to measure improvements against. Without it, I would have been tuning blindly.
3. Start simple, verify, then add complexity
The entire system is ~200 lines of Python (excluding tests). No frameworks, no complex abstractions. This made debugging straightforward — when something went wrong, there were very few places to look.
What's Next
- Reranker: Add a cross-encoder reranker to improve ranking quality (especially for Q4 and Q9 which ranked 3rd)
- Hybrid search: Combine vector similarity with keyword matching (BM25) for better recall
- More data: Scale from 15 to 100+ subsidies with automated collection
- Production frontend: Migrate from Streamlit to Next.js + FastAPI for a proper web application
If you're building a RAG system for the first time, I'd recommend the same approach: build from scratch first, measure with real metrics, then decide if you need a framework. The understanding you gain is worth far more than the time saved by a framework.
Top comments (0)