DEV Community

Cover image for I built a RAG system where hallucinations aren't acceptable. Here's what actually worked.
gibs-dev
gibs-dev

Posted on

I built a RAG system where hallucinations aren't acceptable. Here's what actually worked.

I built a regulatory compliance API for EU law in 5 weeks of evenings. Here's what I learned about RAG, citations, and why nobody panics until it's too late.

I work in construction during the day. At night, I build software.

Five weeks ago, I started building Gibs — an API that answers questions about EU regulations (DORA, AI Act, GDPR) and returns the answer with a direct citation to the specific EUR-Lex article. Grounded answers with source citations. No "this is not legal advice" disclaimer walls. Just: question in, cited answer out.

This post is about the technical decisions, the things that broke, and what I learned about building a RAG system where correctness actually matters.

The problem

The EU AI Act entered into force in August 2024, but the high-risk AI system rules (Annex III) don't apply until August 2026. DORA (Digital Operational Resilience Act) is already live since January 2025. GDPR has been around since 2018 but people still can't get straight answers about Article 22.

If you're a fintech building an AI feature, or a municipality deploying a chatbot, you need to know: What are my obligations? Which articles apply? What's the deadline?

Today, the answer is: hire a lawyer, or spend three days reading EUR-Lex. I wanted to build a third option.

Architecture: the boring version

User question
  → Classifier (which regulation?)
  → Qdrant vector search (relevant chunks)
  → Reranker (sort by relevance)
  → LLM synthesis (answer + citations)
  → Response with EUR-Lex article references
Enter fullscreen mode Exit fullscreen mode

Nothing revolutionary. Standard RAG pipeline. The interesting parts are in the details.

Lesson 1: Chunking legal text is harder than you think

EU regulations aren't blog posts. A single DORA article can reference three other articles, two delegated acts, and an implementing technical standard. If you chunk naively by paragraph, you lose the cross-references that make the answer correct.

I ended up building a custom parser that:

  • Preserves article boundaries (never splits mid-article)
  • Attaches metadata: regulation name, article number, section, EUR-Lex URL
  • Handles delegated acts as separate documents with parent references
  • Builds a cross-reference graph — over 3,600 edges mapping which articles reference which other articles, across regulations

The graph matters because legal text is relational. DORA Article 19 references Article 20, which references a delegated act, which supplements the base regulation. If your retrieval only returns Article 19, the answer is incomplete. The graph lets the pipeline follow those edges and pull in related chunks automatically.

The DORA corpus alone is 641 chunks across 12 delegated acts. Getting the metadata right took longer than building the retrieval pipeline.

Lesson 2: Classification routing matters more than embedding quality

My first eval scored 79.9% on DORA questions. Terrible.

The problem wasn't the embeddings or the LLM. It was routing. A question like "What are the RTS requirements for threat-led penetration testing?" doesn't contain the word "DORA" — so my classifier sent it to the AI Act collection and retrieved completely wrong chunks.

The fix was expanding the classifier's keyword patterns:

# Before: only matched "DORA" literally
# After:
DORA_PATTERNS = [
    r'DORA', r'digital\s+operational\s+resilience',
    r'ICT\s+(risk|incident)', r'TLPT',
    r'financial\s+entit', r'RTS\s+on\s+(ict|incident|tlpt)',
    r'delegated\s+act'
]
Enter fullscreen mode Exit fullscreen mode

That single change took accuracy from 79.9% to 90.8%. The lesson: in domain-specific RAG, routing is your biggest lever. Spend your time there, not on prompt engineering.

Lesson 3: Eval is everything

I built a golden dataset of 140 questions across DORA, AI Act, and GDPR. Each question has:

  • Expected answer keywords (answer_must_contain)
  • Expected source articles (sources_must_contain)
  • Whether the system should abstain (question outside scope)

The eval runner scores every response automatically and breaks down accuracy by category: objective, cross_reference, delegated_act, adversarial, real_user, negative, and should_abstain.

This caught bugs I never would have found manually. For example: my scorer was case-sensitive, so "Article 6" matched but "article 6" didn't. Tiny bug, 3% accuracy hit.

python eval_runner.py --regulation dora --delay 7000 --retry
Enter fullscreen mode Exit fullscreen mode

If you're building any RAG system, build eval first. I'm serious. Before you optimize anything, know how to measure it.

Lesson 4: Citations are the product

I could have built a chatbot that answers compliance questions. There are dozens of those. What makes Gibs different is that every claim in the response maps to a specific article, and you can click through to EUR-Lex and verify it.

This is non-negotiable for compliance use cases. A lawyer will never trust "According to EU regulations, you need to do X." They will trust "According to Article 6(1) of Regulation (EU) 2024/1689, deployers of high-risk AI systems shall..." with a link.

The synthesis prompt forces the model to cite specific articles and cross-reference the source chunks. If a claim can't be grounded in the retrieved text, the system says so instead of hallucinating.

Lesson 5: Abstention is a feature

90% of compliance questions are in-scope. The other 10% are questions like "What's the best CRM for startups?" — and if your system confidently answers those with made-up regulatory citations, you've lost all trust.

Gibs has an abstention threshold. If the retrieved chunks aren't relevant enough, the system says "This question is outside the scope of the indexed regulations." My abstention accuracy is 96.7% — meaning it almost never hallucinates an answer for out-of-scope questions.

The stack

  • Vector DB: Qdrant (one collection per regulation — strict isolation)
  • Embeddings + Reranker: Cohere embed-v3 + rerank-v3.5
  • Synthesis: LLM via API (multi-pass: decompose → retrieve → synthesize → verify)
  • Framework: Python + FastAPI + LangGraph
  • Hosting: Self-hosted Docker on a mini PC with 8GB RAM
  • Auth: API key + Stripe for billing
  • SDKs: Python and TypeScript published on PyPI/npm
  • MCP Server: For AI assistant integration (Cursor, Windsurf, etc.)

Total monthly infrastructure cost: roughly what you'd spend on a nice dinner.

What I'd do differently

  1. Start with eval, not with the product. I built the pipeline first and eval second. Should have been the other way around.
  2. Chunk smaller, retrieve more. My initial chunks were too large. Smaller chunks with more retrieval and better reranking would have been better from the start.
  3. Don't underestimate metadata. The EUR-Lex URL, the article number, the regulation name — that metadata is what makes the citations work. It took 40% of my time and it was worth every hour.

Current status

  • DORA: 90.8% accuracy, all 12 delegated acts indexed
  • AI Act: 88% accuracy, all articles + annexes + recitals
  • python and npm built
  • mcp built and listed
  • GDPR: Live, eval in progress
  • API: Live at api.gibs.dev
  • Free tier: 50 requests/month, no credit card

I'm looking for beta users — especially compliance consultants, fintechs dealing with DORA, or anyone building AI systems that need EU AI Act classification.

If you want to try it: gibs.dev

If you want to see the API docs: docs.gibs.dev

If you want to talk about RAG for legal text, citations, or building dev tools while working a day job in construction — I'm in the comments.


Gibs is a developer tool for regulatory research — not a substitute for qualified legal advice.

Built by one person, evenings and weekends, with a lot of coffee and a somewhat irresponsible sleep schedule.

Top comments (0)