astronaut

Posted on Nov 6

🧑‍🚀 Mission Accomplished: How an Engineer-Astronaut Prepared Meta’s CRAG Benchmark for Launch in Docker

#rag #ai #llm #devops

Every ML system is like a spacecraft — powerful, intricate, and temperamental.
But without telemetry, you have no idea where it’s headed.

🌌 Introduction

The CRAG (Comprehensive RAG Benchmark) from Meta AI is the control panel for Retrieval-Augmented Generation systems.
It measures how well model responses stay grounded in facts, remain robust under noise, and maintain contextual relevance.

As is often the case with research projects, CRAG required engineering adaptation to operate reliably in a modern environment:
incompatible library versions, dependency conflicts, unclear paths, and manual launch steps.

🧰 I wanted to bring CRAG to a state where it could be launched with a single command — no dependency chaos, no manual fixes.
The result is a fully reproducible Dockerized environment, available here:

👉 github.com/astronaut27/CRAG_with_Docker

🚀 What I Improved

In the original build, several issues made CRAG difficult to run:

🔧 Conflicting library versions;
⚙️ No unified, reproducible start-up workflow.

Now, everything comes to life with a single command:

docker-compose up --build

After building, two containers start automatically:

🛰️ mock-api — an emulator for web search and Knowledge Graph APIs;
🚀 crag-app — the main container with the benchmark and built-in baseline models.

🧱 Pre-Launch Preparation: Handling the Mission Artifacts

Before firing up the Docker build, make sure all mission artifacts — the large data and model files — are present locally.

Because CRAG includes files over 100 MB, it uses Git Large File Storage (LFS). Without them, your container won’t initialize.

So the first command in your console is essentially fueling the ship with data:

git lfs pull

🧩 How It Works

📡 ⚙️ CRAG in Autonomous Mode

mock-API — simulates external data sources (Web Search, KG API) used by the RAG system.
crag-app — the main container running the benchmark and the model used for response generation (a dummy model at this stage).
local_evaluation.py — coordinates the pipeline, calls the mock API, and handles metric evaluation.
ChatGPT — serves as an LLM-assisted judge that evaluates generated responses by CRAG’s metrics.

🧠 What CRAG Measures: The Telemetry Dashboard

CRAG reports quantitative indicators — a flight log of your system after a test mission:

total: Total number of evaluated examples.
n_correct: Count of responses that are fully supported by retrieved context.
n_hallucination: Number of responses containing unsupported or invented facts.
n_miss: Responses missing key information or empty answers.
accuracy/ score: Overall precision (ratio of correct responses).
hallucination: Ratio = n_hallucination / total.
missing: Ratio = n_miss / total.

💡 These metrics are the sensors on your RAG ship’s dashboard.
If any of them start flashing red — it’s time to check the model’s engine.

🧱 Docker Architecture

version: '3.8'

services:
  # Mock API service for RAG data
  mock-api:
    build:
      context: ../mock_api
      dockerfile: ../deployments/Dockerfile.mock-api
    container_name: crag-mock-api
    ports:
      - "8000:8000"
    volumes:
      - ../mock_api/cragkg:/app/cragkg
    environment:
      - PYTHONPATH=/app
    networks:
      - crag-network
    restart: unless-stopped

  # CRAG application container
  crag-app:
    build:
      context: ..
      dockerfile: deployments/Dockerfile.crag-app
    container_name: crag-app
    depends_on:
      - mock-api
    environment:
      # OpenAI for evaluation (optional)
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      # Mock API connection (Docker service)
      - CRAG_MOCK_API_URL=http://mock-api:8000
      # Evaluation model
      - EVALUATION_MODEL_NAME=${EVALUATION_MODEL_NAME:-gpt-4-0125-preview}
    volumes:
      # Mount large data directories (read-only)
      - ../data:/app/data:ro
      - ../results:/app/results
      - ../example_data:/app/example_data:ro
      # Tokenizer (if needed)
      - ../tokenizer:/app/tokenizer:ro
    extra_hosts:
      - "host.docker.internal:host-gateway"
    networks:
      - crag-network
    stdin_open: true
    tty: true
    command: ["python", "local_evaluation.py"]

networks:
  crag-network:
    driver: bridge

🪐 Why This Matters

RAG systems are quickly becoming the core engines of modern LLM-based products.
CRAG allows engineers to evaluate their reliability and factual grounding before shipping to production.

This Docker build transforms Meta AI’s research benchmark into a practical engineering environment:

📦 fully isolated and reproducible;
🧠 runnable locally or in CI pipelines;
🚀 easily extendable with your own models (for example, via LM Studio — coming in the next mission).

🔭 The Next Mission

Right now, CRAG runs on its built-in baselines — a test flight before mounting the real engine.
The next step is integrating the LM Studio API and evaluating a live LLM within the same container setup.
That will be Mission II 🚀

🧭 Mission Summary

“Sometimes engineering magic isn’t about building a brand-new ship,
but about preparing an existing one for its next flight.”

CRAG now launches reliably, telemetry is stable, and the mission is a success.

Next up: integrating LM Studio and real models.
For now, the ship holds a steady course. 🪐

🔗 Mission Repository

📦 github.com/astronaut27/CRAG_with_Docker

📜 License
CRAG is distributed under the MIT License, developed by Meta AI / Facebook Research.
All modifications in CRAG_with_Docker preserve the original copyright notices.

DEV Community