Want to skip the theory and launch a local RAG benchmark in Docker right now? Check out the repo
1. Introduction: Breaking the Infrastructure Barrier
In my previous article, we prepped our "shuttle" for launch by containerizing the Meta CRAG infrastructure. It gave us a standardized environment, but we were still tethered to one expensive "ground control" dependency.
The original benchmark baselines are resource-hungry. They expect:
- Paid OpenAI API for final judging.
- GPU(CUDA) clusters to run inference via vLLM.
Developing a RAG system under these constraints feels like ordering expensive parts by mail when you already have the tools in your garage. You spend your budget on "shipping" (API tokens) and wait for external servers to reply, even though you have plenty of local horsepower sitting idle.
What if you could launch the rocket from your own spaceport? Right on your laptop, with zero cost per request and total autonomy. We’re swapping external APIs for local inference using Ollama and Ray.
2. Architecture: The OpenAI-Compatible Interface
The biggest headache with academic benchmarks is their rigid stack. Meta CRAG expects either vLLM or OpenAI by default. Rewriting the core evaluation logic is a recipe for bugs and broken metrics.
Instead, we’ll take the engineering shortcut:
We implemented a RAGOpenAICompatibleModel class. It uses the standard openai library but "hijacks" the data flow via the base_url variable. This lets us point the benchmark at a local Ollama instance without changing of the core logic.
Why this matters: This gives us hot-swappable brains. Want to test Llama 3? Just change the key. Want to compare it against Qwen or Gemma? A quick export in your terminal is all it takes and a few lines in the configuration file.
3. Tuning the "Onboard Systems": Ray and HTML Cleanup
In the cloud, you pay for convenience—you can feed raw HTML to LLM and hope it figures it out. In a local spaceport, resources are finite. Every extra token is dead weight (ballast).
🛠 Parallelism via Ray
Processing hundreds of HTML pages for every question is heavy. We use Ray to distribute the load: while the GPU is busy generating an answer, the idle CPU cores are "scrubbing" data for the next batch in the background.
🧹 The "Space Junk" Filter
Using BeautifulSoup to strip tags is a survival requirement. Local models with 8k context windows quickly "suffocate" under endless <div> and <script> tags.
- We clean the HTML.
- Split text into sentences.
- Cap snippets at 1000 characters.
Result: We fit significantly more useful info into the context, boosting accuracy without needing massive model weights.
4. Field Testing: Real Metrics
We picked three popular models to see how they handle a "combat" RAG scenario.
| Model | Accuracy (Correct) | Hallucination | Missing (I don't know) | Final Score |
|---|---|---|---|---|
| Gemma-2-9B | 25% | 20% | 55% | 0.05 |
| Llama-3-8B | 15% | 30% | 55% | -0.15 |
| Qwen-2.5-7B | 0% | 100% | 0% | -1.00 |
Post-Mortem: Why did Qwen crash? 💥
Qwen’s results look catastrophic, but this is a huge engineering lesson. It didn't fail because it was "stupid"—it failed because it violated the protocol.
Typical Qwen output:
"
<think>Okay, let's see. The user is asking about the producers... I need to check the references..."
The model started "thinking out loud" via the tag, ignoring the instruction to answer succinctly. In CRAG, any text that isn't the direct answer is flagged as a Hallucination.
The Takeaway: Models with forced Chain-of-Thought (CoT) need heavy post-processing (stripping tags) or ювелирный (precise) prompting to keep them from turning a short answer into a philosophical essay.
5. Try it Yourself: Code on GitHub
Stop reading and start launching. I’ve prepped a repository with:
- Docker configs for easy deployment.
- Ollama adapters for local inference.
- Ray scripts for high-speed HTML cleaning.
🚀 Project Repo: astronaut27/CRAG_with_Docker
6. Conclusion: Autonomy Achieved
We’ve proven that you don’t need a corporate budget to do serious RAG engineering.
Our Results:
- Reproducibility: Run the benchmark with a single command.
- Cost: Exactly $0 per iteration.
- Security: Your data never leaves your "space station."
Local evaluation is about building an honest development process where every change is backed by numbers, not just gut feeling.
7. Next Mission: RAGas vs. CRAG
Our spaceport is fully operational. But how does our local ground truth compare to popular metrics like RAGas? In the next post, we’ll pit "RAGas" against the hard facts of CRAG.
See you in orbit! 👨🚀✨





Top comments (2)
Solid post. The HTML cleanup tip is gold — I’ve lost so many tokens to junk tags before. Also appreciate the honesty about the low scores; CRAG is brutal. Will test with Mistral-Nemo next. Thanks for open-sourcing!
Solid work getting CRAG fully local, the OpenAI/vLLM dependency was killing reproducibility for most people.
The Qwen trap is super common in Ollama; a strict "no tags, no reasoning" system prompt + temp=0.1 usually recovers 10–15% Correct without extra post-processing.
Have you tried Trafilatura before BeautifulSoup? It often cleans much better boilerplate and gives a noticeable lift on Task 3.
Repo looks clean, thanks for sharing a practical local setup!