DEV Community

Cover image for Battle Mage: We Built a Codebase Expert That Lives in Slack
v.j.k.
v.j.k.

Posted on

Battle Mage: We Built a Codebase Expert That Lives in Slack

It reads your repo, cites its sources, and gets smarter every time someone corrects it.


Every engineering team has that one person who knows where everything is. The one who answers "where's the auth module?" without looking up from their coffee. The one who remembers that the payment service was refactored in Q3, that the config moved from YAML to JSON last sprint, and that the weird naming convention in the test suite exists because of a migration from PHPUnit three years ago.

You know who I'm talking about. You've probably pinged them on Slack at 11pm once or twice.

We wanted to put that person in Slack. Not replace them. Free them from being the team's living search engine so they can go back to doing the work only they can do.

So we built Battle Mage.


What It Actually Is

Battle Mage is a Slack agent powered by Claude that answers questions about your GitHub repo in real time. Mention @bm in any channel and ask about code, architecture, open issues, recent PRs. It searches your actual codebase and responds with specific file paths, line numbers, and citations. Not vibes. Not summaries of summaries. The real thing.

It also creates GitHub issues on request (with a confirmation step so it never surprises your team), remembers corrections you give it, and gets smarter over time from the conversations it has with your team.

The entire thing runs on Vercel serverless. No Docker, no Kubernetes, no containers to babysit at 2am. Just a Next.js API route, a few environment variables, and a Slack webhook. The kind of setup you can hand off to a new teammate on their first day and have them running in 20 minutes.


From Wizard to Mage: The Origin Story

Battle Mage grew out of a development methodology called the Wizard skill, an 8-phase approach to building software that prioritizes understanding over velocity. The core idea is that a developer who spends 70% of their time truly understanding a problem will outship a developer who starts typing immediately, every single time.

The Wizard methodology enforces TDD (failing tests before implementation), systematic planning, adversarial self-review ("what happens if this runs twice concurrently?"), and PR-based quality gates. It's opinionated and thorough, and it works.

So we asked: what if we applied the same principles to understanding a codebase rather than building one? What if an agent approached your repo the way a careful architect would: verify before asserting, cite specifically, trust code over docs, admit when it's unsure?

That's Battle Mage. A wizard that fights your codebase battles for you. Hence the name. (And yes, the icon is a knight-mage hybrid with a lightning shield. We leaned in.)


The Smart Parts (Under the Hood)

Topic-Based Repo Indexing

When Battle Mage first connects to your repo, it doesn't read every file. That would be slow, expensive, and honestly a bit rude to the GitHub API. Instead, it builds a topic map, a structured index that maps areas of your codebase to file paths:

authentication: src/services/auth/, config/auth.ts, tests/auth/
deployment:     Dockerfile, .github/workflows/deploy.yml
database:       db/migrations/, src/config/database.ts
testing:        tests/Unit/AuthTest.php, tests/Feature/LoginTest.php (+12 more)
Enter fullscreen mode Exit fullscreen mode

This index is built from a single GitHub API call that returns your entire file tree, then classified by a heuristic rules engine. No AI call needed for the classification. It's fast, deterministic, and free. A single file can appear under multiple topics; tests/Auth/LoginTest.php maps to both testing and authentication.

The index lives in Vercel KV and rebuilds lazily: on each question, Battle Mage compares the repo's current HEAD SHA with the one it indexed. If they match, it uses the cache. If the repo has changed, it rebuilds in a couple of seconds. No cron jobs, no webhooks in the target repo, no setup ceremony.

When you ask "how does authentication work?", the agent sees authentication: src/services/auth/ in its prompt and goes straight to the right files instead of wandering around blind. What used to take 10 tool-use rounds now takes 2. This single change was the biggest performance win of the entire project. Do this first if you're building something similar.

Reading Your Project's Own Docs

One detail that makes Battle Mage feel like it actually belongs on your team: if your repo has a CLAUDE.md file (common in projects that use Claude for development), the agent reads it on startup and uses it to understand your project's conventions, architecture, and terminology. It's the equivalent of the onboarding doc that nobody reads, except the bot actually reads it.

Source-of-Truth Hierarchy

Not all information is created equal, and Battle Mage knows it.

Every answer is assembled using a clear hierarchy:

  1. Source code: the actual implementation. Code doesn't lie (it just sometimes surprises you).
  2. Tests: encode expected behavior. If the tests pass, the tested behavior is correct.
  3. Documentation: describes intent, but can drift from reality over time.
  4. Knowledge base: corrections from your team. Useful, but can go stale.
  5. Feedback signals: thumbs up/down. The weakest signal, used to calibrate tone and style, not as a source of factual truth.

When sources conflict, the agent prefers higher-ranked ones and flags the discrepancy out loud. If the docs say auth uses sessions but the code clearly uses JWTs, Battle Mage tells you both and trusts the code. It will never silently prefer a lower-ranked source over a higher-ranked one.

Weighted Reference Ranking

Every answer includes a reference footer with links to the sources the agent actually used, ranked by trustworthiness:

References:
  📄 src/services/auth/login.ts          <- code the agent read
  📖 tests/auth/login.test.ts            <- tests it verified
  🎫 #1446 Replace supervisor             <- issue it cited
  📜 docs/deployment/setup.md             <- doc for context
Enter fullscreen mode Exit fullscreen mode

Source code files score 50 points, test files 40, anything cited in the answer gets a +20 bonus, and documentation gets 10. References are deduplicated and capped at 7. Fewer links, but all of them meaningful. With the optional .battle-mage.json config, core paths get an extra boost and historic/vendor paths get penalized. A 2023 architecture doc will literally rank below an uncited GitHub issue. As it should.

The Self-Learning Knowledge Base

Here's where it gets interesting. Battle Mage doesn't just answer questions. It learns from being wrong.

When you correct the agent in Slack ("no, auth moved to src/services/auth in the v4 refactor"), it calls a save_knowledge tool and stores that fact in Vercel KV as a timestamped entry. These entries get injected into every future system prompt, but importantly, the system prompt also tells the agent to treat them with appropriate skepticism and always verify against the actual code:

[2026-03-28] Auth module is in src/services/auth, not app/Http/Auth
[2026-03-27] API rate limit is 120 req/min, not 60
[2026-03-25] Deploy pipeline uses Docker Alpine, not Ubuntu
Enter fullscreen mode Exit fullscreen mode

Because the knowledge base is team-wide, a correction from one engineer benefits everyone who asks a related question in the future.

The feedback system (👍/👎 reactions) is separate from this. Thumbs up/down are lower-signal quality preferences that help calibrate tone and approach, not factual corrections. When you thumbs-down an answer, the auto-correction system analyzes which KB entries might be related using keyword matching against the file paths the agent actually read. It flags those entries to you and asks what was wrong. You confirm what should be removed; nothing gets deleted automatically. The keyword matching is intentionally broad, so false positives are possible, and the human stays in the loop before any knowledge gets wiped.

Path Annotations (.battle-mage.json)

Drop this optional config file in the root of your target repo and the agent will know which parts of your codebase to trust, and which to treat with appropriate suspicion:

{
  "paths": {
    "src/": "core",
    "tests/": "core",
    "docs/": "current",
    "docs/archive/": "historic",
    "vendor/": "vendor",
    "node_modules/": "excluded"
  }
}
Enter fullscreen mode Exit fullscreen mode

Five trust levels: core (read first, always), current (normal trust, the default), historic (skipped by default; only consulted for history questions, always qualified with "historically..."), vendor (only for dependency questions), and excluded (completely invisible, never indexed, never read, never referenced).

More specific paths override broader ones, so you can set docs/ to current and then override docs/archive/ to historic. The config is read via the GitHub API on each index rebuild, so no deploy or restart needed.

The agent won't cite a 2024 architecture doc as current fact. It won't dive into vendor code unless you ask about a dependency. And it will never accidentally read your node_modules. We've all been there.


The UX Details That Matter

Live Progress Updates

Nobody wants to stare at a static "thinking..." message while an AI agent goes on a two-minute adventure through their codebase. Battle Mage shows you exactly what it's doing:

🧠 Battle Mage is working... (this may take a minute, go grab some tea)
🔍 Searching for "authentication middleware"...
Enter fullscreen mode Exit fullscreen mode

Then a few seconds later:

🧠 Battle Mage is working... (this may take a minute, go grab some tea)
👓 Reading src/middleware/auth.ts...
Enter fullscreen mode Exit fullscreen mode

The header stays fixed while only the status line updates, which prevents the message from visually jumping in the Slack UI. When the answer arrives, the thinking message is deleted entirely rather than edited in place, so the thread stays clean with just the response.

Thread Conversations

After the first @bm mention, you can keep chatting in the same thread without re-mentioning the bot. It detects that it's already participating and responds to follow-ups automatically:

You:  @bm how does the auth middleware work?
Bot:  [explains auth middleware]
You:  what about the refresh token logic?   <- no @bm needed
Bot:  [explains refresh tokens]
Enter fullscreen mode Exit fullscreen mode

Each thread is an independent conversation. The bot has no memory across separate threads, but within a thread it follows along naturally. The implementation required subscribing to Slack's message events and checking whether the bot had already replied before responding, otherwise it would try to answer every message in every channel it's in. Which would get old very fast.

Issue Creation Requires Your Say-So

Ask Battle Mage to create a GitHub issue and it drafts one (title, body, suggested labels) and shows you a preview. Nothing gets created until you react with ✅. No reaction, no issue. Creating issues is a write operation visible to your whole team, and the bot should never surprise anyone with unexpected things appearing in the backlog.

5-Minute Time Budget

Complex questions involving many files can take a while, but the agent doesn't run indefinitely. At 4 minutes it gets a quiet hint to start synthesizing with what it has. At 5 minutes: force-stop, return a partial answer with a note. In practice most answers land in 1 to 3 minutes. The budget is just a safety net for genuinely gnarly questions, the ones where the agent would otherwise chase rabbit holes all day.


Launching: Simpler Than You Think

The setup requires four things:

  1. A Slack app (created from the included YAML manifest, one click)
  2. A fine-grained GitHub PAT scoped to your target repo
  3. A Vercel project (free tier works; Pro recommended for the 60-second function timeout)
  4. Six environment variables

That's it. No infrastructure to provision, no Terraform, no containers. The bot deploys to Vercel on every push. You'll also want Vercel KV for the knowledge base, feedback storage, and the repo index cache. The free tier covers it for most teams.

Total infrastructure cost at rest: $0. You only pay for Anthropic API usage when the bot is actually answering questions.

One non-obvious tip: fine-grained GitHub PATs expire. Set a calendar reminder to rotate yours before it does. Expired tokens fail silently and the bot just quietly stops being able to read your repo. Ask us how we know.


What We Learned Building This

Prompt engineering is architecture, not copywriting. The system prompt is 200+ lines of carefully structured instructions covering the source-of-truth hierarchy, search strategy, recency rules, brevity constraints, and annotation guidance. Twelve distinct sections, assembled fresh on every agent invocation. Changing one line can dramatically alter behavior in ways that aren't obvious until someone asks a weird question at 2am.

The agent loop is where the complexity lives. The actual AI call is one line of code. Everything around it (tool execution, reference collection, progress updates, error handling, time budgets, thread management) is where the real work is. The loop runs up to 15 rounds, and each round is an opportunity for something creative to go wrong. The Wizard methodology this project was built with has a lot to say about this. Adversarial self-review exists for a reason.

Keep humans in the feedback loop. Early versions of the thumbs-down handler were more aggressive about auto-removing KB entries. The keyword matching heuristic is broad enough that false positives are common. A thumbs-down about formatting could flag a completely valid knowledge entry. We learned to show users what might be affected and let them decide. The extra confirmation step is worth it.

Recency matters more than completeness. Engineers asking "what's new?" want the last week, not a comprehensive history. Date-aware prompts and sort:updated on API calls made a bigger practical difference than any clever summarization strategy.

The repo index was the biggest win. Before it, every question started with blind GitHub searches. After it, the agent jumps straight to relevant files. Build the index first if you're doing something similar.


Try It

Battle Mage is open source. Clone it, set your environment variables, deploy to Vercel, and your team has a codebase expert in Slack that gets smarter every time someone corrects it.

It won't replace the senior engineer who holds your team's institutional knowledge. But it might free them from answering "where's the config file?" for the hundredth time, so they can go back to the work that actually needs them.

Every team deserves a mage in their corner. This one doesn't need onboarding, never takes PTO, and doesn't mind being pinged at 11pm.


Battle Mage is MIT licensed and available on GitHub. Built with Claude AI, Next.js, Vercel, and the Wizard development methodology.

Top comments (0)