DEV Community

Daichi Koga
Daichi Koga

Posted on

Building a RAG-Based Subsidy Matching System from Scratch with Python

What I Built

A RAG (Retrieval-Augmented Generation) system that helps Japanese small businesses find government subsidies. Users describe their business situation, and the system retrieves relevant subsidies from a vector database, then generates a detailed answer using Claude API.

Why RAG Instead of Just Using an LLM?

LLMs alone have three problems for this use case:

  1. Hallucination — They confidently make up subsidy details that don't exist
  2. Stale data — Subsidy information updates frequently; LLM training data can't keep up
  3. No sources — Users need to verify the information, but LLMs don't cite where it came from

RAG solves all three by retrieving real data first, then passing it to the LLM as context. The LLM generates answers grounded in actual documents, not its training data.

Architecture

User Query
    │
    ▼
┌─────────────┐     ┌──────────────┐
│  Retriever  │────▶│  Generator   │
│ (embed +    │     │ (Claude API) │
│  search)    │     └──────┬───────┘
└──────┬──────┘            │
       │                   ▼
┌──────▼──────┐      AI Answer
│  ChromaDB   │      with sources
│ (Vector DB) │
└─────────────┘
Enter fullscreen mode Exit fullscreen mode

The data ingestion pipeline:

JSON Data → Sentence Chunker → Embedding Model → ChromaDB
Enter fullscreen mode Exit fullscreen mode

Tech Stack and Why

Layer Choice Why
Embedding multilingual-e5-large Best multilingual performance for free. Runs locally — no data leaves the machine
Vector DB ChromaDB Zero config. pip install and it works with local persistence
LLM Claude API Strong long-context handling and Japanese language quality
Frontend Streamlit Full UI in ~40 lines of Python

Why Not LangChain/LlamaIndex?

I intentionally built the pipeline from scratch first. Frameworks abstract away the internals, which is great for production but bad for learning. By writing each step manually — chunking, embedding, storing, retrieving, prompting — I understood exactly what each piece does and why it matters.

Key Implementation Details

1. Sentence-Boundary Chunking

The most impactful decision was how to split documents into chunks.

Naive approach (fixed-size split):

...eligible IT tools include accounting software, ord  ← cut mid-sentence
er management software, payment software...           ← next chunk
Enter fullscreen mode Exit fullscreen mode

My approach (split at Japanese sentence boundaries ):

sentences = text.split("")
# Greedily pack sentences into chunks up to CHUNK_SIZE
# Overlap the tail for context continuity
Enter fullscreen mode Exit fullscreen mode

This preserves semantic meaning in each chunk, which directly improves retrieval accuracy.

2. Embedding with Prefix

multilingual-e5 requires specific prefixes to distinguish between stored passages and search queries:

# When indexing documents
prefixed = [f"passage: {text}" for text in texts]

# When searching
query_embedding = model.encode(f"query: {user_query}")
Enter fullscreen mode Exit fullscreen mode

Missing this prefix is a common mistake that silently degrades search quality.

3. Prompt Design for Grounded Answers

The generation prompt explicitly constrains the LLM:

prompt = """You are a Japanese subsidy advisor.
Answer based on the reference information below.
If information is insufficient, state that explicitly.
For critical details like deadlines and amounts, be precise."""
Enter fullscreen mode Exit fullscreen mode

This reduces hallucination by telling the model to stick to the provided context.

Evaluation: Measuring What Matters

Building a RAG that "seems to work" isn't enough. I created a test suite with 10 queries, each with an expected correct subsidy:

Query Expected Result Rank
"Want to introduce accounting software" Digital/AI Subsidy 1st
"Want to install new manufacturing equipment" Manufacturing Subsidy 2nd
"Small shop wanting to attract customers via flyers" Small Business Sustainability Subsidy 1st
"Restaurant wanting to start EC business" New Business Advancement Subsidy 3rd
"Want to convert part-timers to full-time" Career Up Grant 1st
... ... ...

Results:

Metric Score
Hit Rate @5 100% (10/10)
MRR @5 0.817
  • Hit Rate @5: The correct answer appeared in the top 5 results for every query
  • MRR (Mean Reciprocal Rank): On average, the correct answer ranked between 1st and 2nd place

Why These Metrics?

  • Hit Rate tells you "can the system find the right answer at all?"
  • MRR tells you "how quickly does it find it?" (1st place = 1.0, 2nd = 0.5, 3rd = 0.33)

Both are standard IR (Information Retrieval) metrics. Having these numbers lets you objectively compare different chunking strategies, embedding models, or retrieval parameters.

What I Learned

1. Chunking strategy matters more than the embedding model

Switching from fixed-size to sentence-boundary chunking had a bigger impact on retrieval quality than I expected. The embedding model can only work with what it's given — if chunks are semantically broken, even the best model can't fix that.

2. Evaluation should come before optimization

Building the evaluation script early gave me a baseline to measure improvements against. Without it, I would have been tuning blindly.

3. Start simple, verify, then add complexity

The entire system is ~200 lines of Python (excluding tests). No frameworks, no complex abstractions. This made debugging straightforward — when something went wrong, there were very few places to look.

What's Next

  • Reranker: Add a cross-encoder reranker to improve ranking quality (especially for Q4 and Q9 which ranked 3rd)
  • Hybrid search: Combine vector similarity with keyword matching (BM25) for better recall
  • More data: Scale from 15 to 100+ subsidies with automated collection
  • Production frontend: Migrate from Streamlit to Next.js + FastAPI for a proper web application

If you're building a RAG system for the first time, I'd recommend the same approach: build from scratch first, measure with real metrics, then decide if you need a framework. The understanding you gain is worth far more than the time saved by a framework.

Top comments (0)