Tahrim Bilal

Posted on May 13

Building a Safety-First RAG Triage Agent in 24 Hours

#llm #rag #vectordatabase #gemini

Last weekend, I participated in HackerRank Orchestrate 2026 — a 24-hour hackathon where the challenge was deceptively simple: build a terminal-based support triage agent that handles tickets across HackerRank, Claude, and Visa using only a provided support corpus.

The catch? No hallucinations. No unsafe replies. Zero tolerance for wrong answers on fraud or billing tickets.

Here's how I built a hybrid RAG agent that prioritizes safety over speed — and why I burned through 3 API keys in the process.

The Problem: Why Vanilla RAG Wasn't Enough

Most RAG tutorials show how to chunk documents, embed them, and ask questions. That's fine for a blog demo. But for a production support system handling fraud reports, billing disputes, and account compromises, vanilla RAG is dangerous.

What happens when:

A user says "My identity was stolen, what should I do?"
The retriever finds a doc about "Identity verification for new accounts"
The LLM generates a helpful response about uploading ID documents

That's a catastrophic failure. Someone in distress gets a bureaucratic runaround instead of immediate escalation to a human agent.

I needed a system that escalates first, generates second.

The Architecture: 5-Stage Safety-First Pipeline

Stage 1: Classification

One LLM call extracts structured metadata:

# classifier.py
SYSTEM_PROMPT = """You are a support ticket classifier...
Return ONLY a JSON object:
{
  "company": "<HackerRank | Claude | Visa | Unknown>",
  "request_type": "<product_issue | feature_request | bug | invalid>",
  "product_area": "<short phrase>",
  "risk_level": "<low | high>"
}"""

def classify(llm, ticket_company, issue_text):
    result = llm.chat_json(SYSTEM_PROMPT, user_msg)
    # Sanitize and fallback
    return {
        "company": result.get("company", "Unknown"),
        "request_type": result.get("request_type", "product_issue"),
        "product_area": result.get("product_area", "general").lower(),
        "risk_level": result.get("risk_level", "low").lower()
    }

Key insight: I keep risk_level for logging but do NOT use it for escalation. The LLM over-flags benign tickets like "how do I delete my account" as high risk. Deterministic rules are more precise.

Stage 2: Safety Gate (Zero LLM Calls)

This is the heart of the system. Before any retrieval or generation, deterministic rules check for danger:

# safety.py
def check(classification, issue_text):
    text_lower = issue_text.lower()

    # 1. Bug reports → always escalate to engineers
    if classification.get("request_type") == "bug":
        return True, "Bug report escalated to technical team"

    # 2. Sensitive product areas
    product_area = classification.get("product_area", "").lower()
    for sensitive in HIGH_RISK_PRODUCT_AREAS:
        if sensitive in product_area:
            return True, f"Product area '{product_area}' is sensitive"

    # 3. Keyword scan
    for kw in ESCALATION_KEYWORDS:
        if kw in text_lower:
            return True, f"Contains sensitive keyword '{kw}'"

    # 4. Assessment integrity (HackerRank-specific)
    integrity_phrases = [
        "increase my score", "change my score", "graded me unfairly",
        "review my answers", "move me to the next round"
    ]
    for phrase in integrity_phrases:
        if phrase in text_lower:
            return True, f"Assessment integrity dispute: '{phrase}'"

    return False, ""

My Keyword List:

ESCALATION_KEYWORDS = [
    # fraud / financial
    "fraud", "unauthorized charge", "chargeback",
    "scam", "identity theft", "money back",
    "refund request", "billing dispute", "payment dispute",
    # account security
    "account hacked", "account compromised", "someone else logged in",
    "account suspended", "account banned", "account terminated",
    # legal
    "lawsuit", "legal action", "attorney", "lawyer", "court",
    # assessment integrity
    "cheating", "plagiarism", "academic integrity", "proctoring dispute",
    "candidate cheated", "unfair disqualification",
    # other high-risk
    "security breach", "vulnerability", "security vulnerability",
    "ban the seller", "ban this seller", "make visa refund",
    "subscription",
]

Stage 3: Retrieval (Free, Local, Fast)

# retriever.py
class Retriever:
    def __init__(self):
        self.index = faiss.read_index(str(INDEX_FILE))
        self.model = SentenceTransformer(EMBEDDING_MODEL)
        with open(CHUNKS_FILE) as f:
            self.chunks = json.load(f)

    def retrieve(self, query, company=None, top_k=6):
        # Embed query (L2-normalized → dot product == cosine)
        q_vec = self.model.encode([query], normalize_embeddings=True)

        # Company filtering
        if company:
            filtered = [(i, c) for i, c in enumerate(self.chunks) 
                       if c["company"].lower() == company.lower()]
            if len(filtered) < top_k:
                filtered = list(enumerate(self.chunks))  # fallback
        else:
            filtered = list(enumerate(self.chunks))

        # Build temporary index on subset
        idxs = np.array([i for i, _ in filtered])
        subset = np.stack([self.index.reconstruct(int(i)) for i in idxs])

        sub_index = faiss.IndexFlatIP(subset.shape[1])
        sub_index.add(subset)

        scores, positions = sub_index.search(q_vec, min(top_k, len(filtered)))

        return [{"text": self.chunks[int(idxs[p])]["text"], 
                 "score": float(scores[0][i]), ...} 
                for i, p in enumerate(positions[0]) if p >= 0]

Why FAISS: No server, no API cost, loads in <2 seconds. 1773 vectors is tiny — no need for approximate search.

Stage 4: Fast vs Careful Lane

# responder.py
def respond(llm, retriever, classification, issue_text):
    chunks = retriever.retrieve(issue_text, company=classification.get("company"))
    best = retriever.best_score(chunks)

    # FAST LANE: High confidence → single LLM call
    if best >= FAST_LANE_THRESHOLD:  # 0.50
        response = generate(llm, issue_text, chunks)
        return {"status": "replied", "lane": "fast", ...}

    # CAREFUL LANE: Low confidence → verify everything
    if best >= SIMILARITY_THRESHOLD:  # 0.35
        relevant = check_relevance(llm, issue_text, chunks)
        if not relevant:
            # Query rewrite + retry
            rewritten = rewrite_query(llm, issue_text)
            new_chunks = retriever.retrieve(rewritten, ...)
            new_best = retriever.best_score(new_chunks)
            if new_best >= SIMILARITY_THRESHOLD:
                chunks = new_chunks
                best = new_best
            else:
                return {"status": "escalated", "lane": "escalated_no_corpus", ...}

    response = generate(llm, issue_text, chunks)

    # SELF-CHECK: Verify every claim is grounded
    grounded, issue = self_check(llm, issue_text, response, chunks)
    if not grounded:
        return {"status": "escalated", "lane": "escalated_ungrounded", ...}

    return {"status": "replied", "lane": "careful", ...}

Self-Check Prompt:

_SELFCHECK_SYSTEM = """You are a fact-checking assistant. Review whether a support response is fully grounded in the provided context.
Return ONLY a JSON object: {"grounded": true/false, "issue": "<describe any unsupported claim, or empty string if grounded>"}
A response is grounded if every factual claim it makes can be directly traced to the context."""

Stage 5: Output Schema

# main.py
OUTPUT_COLS = ["status", "product_area", "response", "justification", "request_type"]

def _build_row(*, status, product_area, response, justification, request_type):
    return {
        "status": status.lower(),
        "product_area": product_area.lower(),
        "response": response.strip(),
        "justification": justification.strip(),
        "request_type": request_type.lower(),
    }

Challenges faced

The biggest challenge i faced was Grok API Limit. The Groq's free-tier limit is ~20-30 RPM. I burned through 3 API Keys.
Fix? increased the delay time and introduced Gemini so that when Groq fails 3x times, Gemini takes over.

What I'd Do Differently:

Train an classifier using machine learning that would save me an LLM call.
If possible, use Ollama. ( I considered using it but my laptop hardware wasn't suitable)
Making use of Hybrid Retrival BM25 + vector search
Add layer of synonyms (e.g 'blocked card' and 'frozen card') for query expansion

REPOSITORY

https://github.com/Tahrim19/hackerrank-orchestrate-may26

You may find the files in code directory.

DEV Community