Last weekend, I participated in HackerRank Orchestrate 2026 — a 24-hour hackathon where the challenge was deceptively simple: build a terminal-based support triage agent that handles tickets across HackerRank, Claude, and Visa using only a provided support corpus.
The catch? No hallucinations. No unsafe replies. Zero tolerance for wrong answers on fraud or billing tickets.
Here's how I built a hybrid RAG agent that prioritizes safety over speed — and why I burned through 3 API keys in the process.
The Problem: Why Vanilla RAG Wasn't Enough
Most RAG tutorials show how to chunk documents, embed them, and ask questions. That's fine for a blog demo. But for a production support system handling fraud reports, billing disputes, and account compromises, vanilla RAG is dangerous.
What happens when:
- A user says "My identity was stolen, what should I do?"
- The retriever finds a doc about "Identity verification for new accounts"
- The LLM generates a helpful response about uploading ID documents
That's a catastrophic failure. Someone in distress gets a bureaucratic runaround instead of immediate escalation to a human agent.
I needed a system that escalates first, generates second.
The Architecture: 5-Stage Safety-First Pipeline
Stage 1: Classification
One LLM call extracts structured metadata:
# classifier.py
SYSTEM_PROMPT = """You are a support ticket classifier...
Return ONLY a JSON object:
{
"company": "<HackerRank | Claude | Visa | Unknown>",
"request_type": "<product_issue | feature_request | bug | invalid>",
"product_area": "<short phrase>",
"risk_level": "<low | high>"
}"""
def classify(llm, ticket_company, issue_text):
result = llm.chat_json(SYSTEM_PROMPT, user_msg)
# Sanitize and fallback
return {
"company": result.get("company", "Unknown"),
"request_type": result.get("request_type", "product_issue"),
"product_area": result.get("product_area", "general").lower(),
"risk_level": result.get("risk_level", "low").lower()
}
Key insight: I keep risk_level for logging but do NOT use it for escalation. The LLM over-flags benign tickets like "how do I delete my account" as high risk. Deterministic rules are more precise.
Stage 2: Safety Gate (Zero LLM Calls)
This is the heart of the system. Before any retrieval or generation, deterministic rules check for danger:
# safety.py
def check(classification, issue_text):
text_lower = issue_text.lower()
# 1. Bug reports → always escalate to engineers
if classification.get("request_type") == "bug":
return True, "Bug report escalated to technical team"
# 2. Sensitive product areas
product_area = classification.get("product_area", "").lower()
for sensitive in HIGH_RISK_PRODUCT_AREAS:
if sensitive in product_area:
return True, f"Product area '{product_area}' is sensitive"
# 3. Keyword scan
for kw in ESCALATION_KEYWORDS:
if kw in text_lower:
return True, f"Contains sensitive keyword '{kw}'"
# 4. Assessment integrity (HackerRank-specific)
integrity_phrases = [
"increase my score", "change my score", "graded me unfairly",
"review my answers", "move me to the next round"
]
for phrase in integrity_phrases:
if phrase in text_lower:
return True, f"Assessment integrity dispute: '{phrase}'"
return False, ""
My Keyword List:
ESCALATION_KEYWORDS = [
# fraud / financial
"fraud", "unauthorized charge", "chargeback",
"scam", "identity theft", "money back",
"refund request", "billing dispute", "payment dispute",
# account security
"account hacked", "account compromised", "someone else logged in",
"account suspended", "account banned", "account terminated",
# legal
"lawsuit", "legal action", "attorney", "lawyer", "court",
# assessment integrity
"cheating", "plagiarism", "academic integrity", "proctoring dispute",
"candidate cheated", "unfair disqualification",
# other high-risk
"security breach", "vulnerability", "security vulnerability",
"ban the seller", "ban this seller", "make visa refund",
"subscription",
]
Stage 3: Retrieval (Free, Local, Fast)
# retriever.py
class Retriever:
def __init__(self):
self.index = faiss.read_index(str(INDEX_FILE))
self.model = SentenceTransformer(EMBEDDING_MODEL)
with open(CHUNKS_FILE) as f:
self.chunks = json.load(f)
def retrieve(self, query, company=None, top_k=6):
# Embed query (L2-normalized → dot product == cosine)
q_vec = self.model.encode([query], normalize_embeddings=True)
# Company filtering
if company:
filtered = [(i, c) for i, c in enumerate(self.chunks)
if c["company"].lower() == company.lower()]
if len(filtered) < top_k:
filtered = list(enumerate(self.chunks)) # fallback
else:
filtered = list(enumerate(self.chunks))
# Build temporary index on subset
idxs = np.array([i for i, _ in filtered])
subset = np.stack([self.index.reconstruct(int(i)) for i in idxs])
sub_index = faiss.IndexFlatIP(subset.shape[1])
sub_index.add(subset)
scores, positions = sub_index.search(q_vec, min(top_k, len(filtered)))
return [{"text": self.chunks[int(idxs[p])]["text"],
"score": float(scores[0][i]), ...}
for i, p in enumerate(positions[0]) if p >= 0]
Why FAISS: No server, no API cost, loads in <2 seconds. 1773 vectors is tiny — no need for approximate search.
Stage 4: Fast vs Careful Lane
# responder.py
def respond(llm, retriever, classification, issue_text):
chunks = retriever.retrieve(issue_text, company=classification.get("company"))
best = retriever.best_score(chunks)
# FAST LANE: High confidence → single LLM call
if best >= FAST_LANE_THRESHOLD: # 0.50
response = generate(llm, issue_text, chunks)
return {"status": "replied", "lane": "fast", ...}
# CAREFUL LANE: Low confidence → verify everything
if best >= SIMILARITY_THRESHOLD: # 0.35
relevant = check_relevance(llm, issue_text, chunks)
if not relevant:
# Query rewrite + retry
rewritten = rewrite_query(llm, issue_text)
new_chunks = retriever.retrieve(rewritten, ...)
new_best = retriever.best_score(new_chunks)
if new_best >= SIMILARITY_THRESHOLD:
chunks = new_chunks
best = new_best
else:
return {"status": "escalated", "lane": "escalated_no_corpus", ...}
response = generate(llm, issue_text, chunks)
# SELF-CHECK: Verify every claim is grounded
grounded, issue = self_check(llm, issue_text, response, chunks)
if not grounded:
return {"status": "escalated", "lane": "escalated_ungrounded", ...}
return {"status": "replied", "lane": "careful", ...}
Self-Check Prompt:
_SELFCHECK_SYSTEM = """You are a fact-checking assistant. Review whether a support response is fully grounded in the provided context.
Return ONLY a JSON object: {"grounded": true/false, "issue": "<describe any unsupported claim, or empty string if grounded>"}
A response is grounded if every factual claim it makes can be directly traced to the context."""
Stage 5: Output Schema
# main.py
OUTPUT_COLS = ["status", "product_area", "response", "justification", "request_type"]
def _build_row(*, status, product_area, response, justification, request_type):
return {
"status": status.lower(),
"product_area": product_area.lower(),
"response": response.strip(),
"justification": justification.strip(),
"request_type": request_type.lower(),
}
Challenges faced
The biggest challenge i faced was Grok API Limit. The Groq's free-tier limit is ~20-30 RPM. I burned through 3 API Keys.
Fix? increased the delay time and introduced Gemini so that when Groq fails 3x times, Gemini takes over.
What I'd Do Differently:
- Train an classifier using machine learning that would save me an LLM call.
- If possible, use Ollama. ( I considered using it but my laptop hardware wasn't suitable)
- Making use of Hybrid Retrival BM25 + vector search
- Add layer of synonyms (e.g 'blocked card' and 'frozen card') for query expansion
REPOSITORY
https://github.com/Tahrim19/hackerrank-orchestrate-may26
You may find the files in code directory.
Top comments (0)