Building QAOnFire: how I used prompt caching to make AI QA reports affordable

Radu Ghetu — Tue, 02 Jun 2026 06:38:42 +0000

https://www.producthunt.com/posts/launching-qaonfire-dev/maker-invite?code=QMyQzM

I shipped QAOnFire last week — a GitHub App that posts
a full QA report on every pull request. Manual test scenarios, edge cases,
setup & verification scripts, PM notes. Written for the actual diff, not a
generic checklist.

This post is the architecture writeup I wished I'd had when I started.

The shape of the problem

A typical pull request on a small team gets:

Code review (sometimes thorough, often a rubber stamp)
Automated tests (usually unit tests; rarely covers integration)
Manual QA (often skipped because there's no QA person)
Merge & deploy
Bug found in prod
Hotfix
Repeat

Step 3 is where most bugs get caught when it happens — and where most bugs
ship when it doesn't. I wanted to put a useful approximation of step 3 right
into the PR comment, within seconds of the PR being opened.

Architecture

GitHub webhook → Express endpoint (returns 200 in <100ms)
     ↓
BullMQ job queued in Redis
     ↓
Worker dequeues:
  1. Octokit fetches the PR diff + selected file contents
  2. Read `qabot.md` from repo root (if present)
  3. Build prompt with cached system prompt + cached qabot.md
  4. Call Claude Sonnet
  5. Octokit posts comment to PR
  6. Decrement monthly usage counter

Stack: Node 18, Express, BullMQ, Redis, Postgres, Stripe, @anthropic-ai/sdk,
@octokit/rest, deployed on Railway.

Three things that made it work

1. Prompt caching changes the economics entirely

Without caching, every PR for every customer pays full input-token price for
the system prompt (~5kb) and the per-repo qabot.md (~1-3kb). For a repo with
30 PRs/month, that's 30 × 6-8kb of redundant input tokens.

With caching enabled on those two prefix blocks, the first PR primes the cache
and the next 4 minutes of PRs pay ~10% of the input token cost on that prefix.
For repos with steady PR cadence (multiple PRs in a short window — which is
exactly the bursty pattern PR traffic actually has), this is the difference
between "free tier is impossible" and "5 free PRs/month is comfortably
sustainable."

The SDK call looks like:

const response = await anthropic.messages.create({
  model: 'claude-sonnet-4-6',
  max_tokens: 4000,
  system: [
    {
      type: 'text',
      text: QA_SYSTEM_PROMPT,
      cache_control: { type: 'ephemeral' }
    }
  ],
  messages: [
    {
      role: 'user',
      content: [
        // qabot.md as its own cached block, separate from the dynamic PR data
        {
          type: 'text',
          text: qabotMarkdown,
          cache_control: { type: 'ephemeral' }
        },
        {
          type: 'text',
          text: dynamicPRPayload  // not cached — changes per PR
        }
      ]
    }
  ]
});

The key insight: cache the static prefix (system prompt + per-repo config),
NOT the per-PR payload.

2. A per-repo config file (`qabot.md`) does more work than the model

Generic AI test plans drift to generic AI test plans. "Test the happy path.
Test invalid inputs. Test edge cases." Useless.

Give the model 20 lines describing the actual domain — user roles, what
"correct" means for this app, known fragile areas — and the output transforms.
Test scenarios start referencing real entities. Edge cases align to actual
business rules. Setup steps mention the specific seeding commands your team
actually uses.

The file is dead simple Markdown — no special syntax. Example:

# Project context

## What this app does
Order management for a B2B wholesale distributor.

## User roles
- buyer (places orders)
- vendor (fulfills orders)
- ops (resolves disputes)

## What "correct" means
- Inventory MUST decrement on order placement, not on fulfillment
- Vendor payouts MUST round half-to-even (NEVER half-up — finance audit)
- A buyer MUST NOT see another buyer's order history

## Known fragile areas
- The order-status state machine (10+ states, several non-obvious transitions)
- Multi-currency conversion in /api/checkout

This config is so much more valuable than any system prompt tuning I did.

3. PM notes turned QA-only into "useful to the whole team"

I almost shipped without the PM notes section. Then a beta user said: "this
is great, but my PM keeps asking me to summarize what each PR does in
non-technical language." I added a ## PM notes section to the prompt with:

User-facing impact (or "none — internal change")
Business / product implications
Release-note candidate (yes/no + the line)
Coordination needs (support, sales, marketing)

The PM notes are now the most-quoted part of the report. PMs stop pinging
engineers in Slack. Devs stop translating PRs into English manually. Nobody
expected this; everyone keeps it.

Things I'd reconsider

Postgres might be overkill for v1. SQLite + Litestream would have shipped 2 weeks faster.
I rolled my own usage quota. A library like Schematic or Stigg would have given me free metering UIs and better analytics.
GitHub App permissions matrix is intimidating. Document yours obsessively for users — they want to know exactly what you can see.

Want to try it?

5 free reports per month, no credit card: https://qaonfire.dev

If you have a qabot.md file you're happy with, send it to me — I'm building
a small public collection of good ones to help new users get started.

Happy to answer architecture questions in the comments.

DEV Community: Radu Ghetu