Daichi

Posted on Mar 28

Building a RAG-Based Subsidy Matching System from Scratch with Python

#ai #python #rag #showdev

What I Built

A RAG (Retrieval-Augmented Generation) system that helps Japanese small businesses find government subsidies. Users describe their business situation, and the system retrieves relevant subsidies from a vector database, then generates a detailed answer using Claude API.

Why RAG Instead of Just Using an LLM?

LLMs alone have three problems for this use case:

Hallucination — They confidently make up subsidy details that don't exist
Stale data — Subsidy information updates frequently; LLM training data can't keep up
No sources — Users need to verify the information, but LLMs don't cite where it came from

RAG solves all three by retrieving real data first, then passing it to the LLM as context. The LLM generates answers grounded in actual documents, not its training data.

Architecture

User Query
    │
    ▼
┌─────────────┐     ┌──────────────┐
│  Retriever  │────▶│  Generator   │
│ (embed +    │     │ (Claude API) │
│  search)    │     └──────┬───────┘
└──────┬──────┘            │
       │                   ▼
┌──────▼──────┐      AI Answer
│  ChromaDB   │      with sources
│ (Vector DB) │
└─────────────┘

The data ingestion pipeline:

JSON Data → Sentence Chunker → Embedding Model → ChromaDB

Tech Stack and Why

Layer	Choice	Why
Embedding	multilingual-e5-large	Best multilingual performance for free. Runs locally — no data leaves the machine
Vector DB	ChromaDB	Zero config. `pip install` and it works with local persistence
LLM	Claude API	Strong long-context handling and Japanese language quality
Frontend	Streamlit	Full UI in ~40 lines of Python

Why Not LangChain/LlamaIndex?

I intentionally built the pipeline from scratch first. Frameworks abstract away the internals, which is great for production but bad for learning. By writing each step manually — chunking, embedding, storing, retrieving, prompting — I understood exactly what each piece does and why it matters.

Key Implementation Details

1. Sentence-Boundary Chunking

The most impactful decision was how to split documents into chunks.

Naive approach (fixed-size split):

...eligible IT tools include accounting software, ord  ← cut mid-sentence
er management software, payment software...           ← next chunk

My approach (split at Japanese sentence boundaries 。):

sentences = text.split("。")
# Greedily pack sentences into chunks up to CHUNK_SIZE
# Overlap the tail for context continuity

This preserves semantic meaning in each chunk, which directly improves retrieval accuracy.

2. Embedding with Prefix

multilingual-e5 requires specific prefixes to distinguish between stored passages and search queries:

# When indexing documents
prefixed = [f"passage: {text}" for text in texts]

# When searching
query_embedding = model.encode(f"query: {user_query}")

Missing this prefix is a common mistake that silently degrades search quality.

3. Prompt Design for Grounded Answers

The generation prompt explicitly constrains the LLM:

prompt = """You are a Japanese subsidy advisor.
Answer based on the reference information below.
If information is insufficient, state that explicitly.
For critical details like deadlines and amounts, be precise."""

This reduces hallucination by telling the model to stick to the provided context.

Evaluation: Measuring What Matters

Building a RAG that "seems to work" isn't enough. I created a test suite with 10 queries, each with an expected correct subsidy:

Query	Expected Result	Rank
"Want to introduce accounting software"	Digital/AI Subsidy	1st
"Want to install new manufacturing equipment"	Manufacturing Subsidy	2nd
"Small shop wanting to attract customers via flyers"	Small Business Sustainability Subsidy	1st
"Restaurant wanting to start EC business"	New Business Advancement Subsidy	3rd
"Want to convert part-timers to full-time"	Career Up Grant	1st
...	...	...

Results:

Metric	Score
Hit Rate @5	100% (10/10)
MRR @5	0.817

Hit Rate @5: The correct answer appeared in the top 5 results for every query
MRR (Mean Reciprocal Rank): On average, the correct answer ranked between 1st and 2nd place

Why These Metrics?

Hit Rate tells you "can the system find the right answer at all?"
MRR tells you "how quickly does it find it?" (1st place = 1.0, 2nd = 0.5, 3rd = 0.33)

Both are standard IR (Information Retrieval) metrics. Having these numbers lets you objectively compare different chunking strategies, embedding models, or retrieval parameters.

What I Learned

1. Chunking strategy matters more than the embedding model

Switching from fixed-size to sentence-boundary chunking had a bigger impact on retrieval quality than I expected. The embedding model can only work with what it's given — if chunks are semantically broken, even the best model can't fix that.

2. Evaluation should come before optimization

Building the evaluation script early gave me a baseline to measure improvements against. Without it, I would have been tuning blindly.

3. Start simple, verify, then add complexity

The entire system is ~200 lines of Python (excluding tests). No frameworks, no complex abstractions. This made debugging straightforward — when something went wrong, there were very few places to look.

What's Next

Reranker: Add a cross-encoder reranker to improve ranking quality (especially for Q4 and Q9 which ranked 3rd)
Hybrid search: Combine vector similarity with keyword matching (BM25) for better recall
More data: Scale from 15 to 100+ subsidies with automated collection
Production frontend: Migrate from Streamlit to Next.js + FastAPI for a proper web application

If you're building a RAG system for the first time, I'd recommend the same approach: build from scratch first, measure with real metrics, then decide if you need a framework. The understanding you gain is worth far more than the time saved by a framework.

DEV Community