Building RabbitHole broke my brain a little (in a good way)

#langgraph #rag #llm #ai

ok so i've been sitting on this project for weeks now and finally the courtroom actually WORKS end to end so lemme just dump everything about it while it's fresh.

RabbitHole is this multi agent thing built on LangGraph where instead of asking one LLM "hey what's the answer" and getting one confident paragraph back, i make a bunch of agent personas actually argue about it. like a state advocate vs a privacy activist vs a compliance officer, all pulling from the same retrieved docs but arguing completely different sides, cross examining each other, and then a judiciary node has to actually rule on it with a confidence score.

why. because normal RAG flattens everything. you ask something with no clean answer (legal stuff, policy tradeoffs, anything genuinely contested) and it still hands you ONE tidy paragraph like the question wasn't messy in the first place. that always bugged me. the messiness is the point sometimes

not deployed yet btw, that's purely a money thing not a "not ready" thing, will get to that

it's actually two graphs

people ask me if it's one big graph and no, there's the outer Courtroom graph (refines your query, calls into RAG, moderator picks who debates, runs the debate in parallel, then stops and waits for you to weigh in before concluding) and then nested INSIDE that is a whole separate RAG sub-graph doing its own thing.

the RAG part alone has more going on than i expected when i started. hybrid search (pinecone dense + BM25 sparse bc keyword matches on legal citations matter a lot, semantic search alone misses those), jina reranker to cut noise, and then a CRAG loop — grader checks if the retrieved docs are actually decent, if not it falls back to web search instead of just yolo-ing with bad context. then on top of THAT theres a self-RAG hallucination check where the final brief gets audited against the raw source before it's even allowed to leave the subgraph.

splitting it into two graphs instead of one flat pipeline was honestly one of the better calls i made, purely bc when a verdict came out wrong i could isolate — was that bad retrieval or bad arguing. saved me so much debugging time lol

ok the bug that actually annoyed me the most

so early version, i'd ask for 2 perspectives and get like 6-8 back. system prompt literally said "use exactly 2 perspectives" in caps even lol and the model just. didn't listen. and under any real load this meant burning through groq's rate limit almost instantly, which was NOT fun to watch happen live

took me way too long to realize the fix isn't a better prompt, the fix is not trusting the prompt for this at all. moved the constraint into the state schema itself — moderator node reads a typed field for perspective count straight off state and only ever schedules that many nodes. the LLM literally never gets asked to count, the graph topology just doesn't let it

anyway that's the takeaway i keep repeating to myself now — if something is structural, encode it structurally, don't beg the model to behave

rate limits basically designed half the architecture

groq free tier is 30 req/min, 6000 tokens/min on the good models. a courtroom debate running perspectives in parallel eats that in seconds, no exaggeration. so i built this FallbackChatModel wrapper thing that catches 429s and connection errors and just fails over — cerebras to groq to gemini — without the graph state even noticing anything went wrong.

also at startup it checks whatever keys you actually have in .env and figures out routing order itself for heavy vs lite tasks. and the routing itself matters too, not just failover — structured synthesis (the actual arguments, the verdict) goes to the heavier model, llama 3.3 70b or gemini 1.5 pro, but boring boolean stuff like "is this doc relevant y/n" goes to a lite model, llama 3.1 8b or gemini flash. kept most node calls off the expensive quota entirely

latency thing that actually made me go woahhh

19.8s down to 9.8s. ~51% cut and honestly it came from like two changes only

running the perspective nodes concurrently with langgraph's async scheduler instead of one by one (should've done this from day 1 tbh), and reranking with jina before synthesis so the context going into the heavy models is smaller — which speeds up inference AND cuts token cost, kind of a two for one

nothing exotic here is the thing. the wins were architectural not "swap in a better model"

why not deployed

plws don't come at me for this lol — hosting a multi provider multi agent graph with a pinecone index and reranker calls running 24/7 is not free, and i'd rather wait till i can actually afford to keep it alive than ship it and watch it die in a month. everything runs locally and via docker compose right now, docker-compose up --build gets you fastapi backend + react frontend behind nginx in one go. it's a "when" not an "if"

whats next

now that the pipeline actually runs i wanna instrument it properly. order is: per-node cost tracing in langsmith first (rn i can tell a run was expensive but not WHICH node did it, driving me insane), then RAGAS eval on the live pipeline so i'm measuring quality instead of just vibes-checking verdicts, then prompt caching, then model routing/cascades on top of what's already there

repo's here if you wanna poke around: github.com/Somay-kousis/RabbitHole

happy to go deeper on any of this in a follow up if ppl want — the CRAG fallback, the state schema fix, the failover wrapper, whatever