DEV Community

Cover image for Building QAOnFire: how I used prompt caching to make AI QA reports affordable
Radu Ghetu
Radu Ghetu

Posted on

Building QAOnFire: how I used prompt caching to make AI QA reports affordable

https://www.producthunt.com/posts/launching-qaonfire-dev/maker-invite?code=QMyQzM

I shipped QAOnFire last week — a GitHub App that posts
a full QA report on every pull request. Manual test scenarios, edge cases,
setup & verification scripts, PM notes. Written for the actual diff, not a
generic checklist.

This post is the architecture writeup I wished I'd had when I started.

The shape of the problem

A typical pull request on a small team gets:

  1. Code review (sometimes thorough, often a rubber stamp)
  2. Automated tests (usually unit tests; rarely covers integration)
  3. Manual QA (often skipped because there's no QA person)
  4. Merge & deploy
  5. Bug found in prod
  6. Hotfix
  7. Repeat

Step 3 is where most bugs get caught when it happens — and where most bugs
ship when it doesn't. I wanted to put a useful approximation of step 3 right
into the PR comment, within seconds of the PR being opened.

Architecture

GitHub webhook → Express endpoint (returns 200 in <100ms)
     ↓
BullMQ job queued in Redis
     ↓
Worker dequeues:
  1. Octokit fetches the PR diff + selected file contents
  2. Read `qabot.md` from repo root (if present)
  3. Build prompt with cached system prompt + cached qabot.md
  4. Call Claude Sonnet
  5. Octokit posts comment to PR
  6. Decrement monthly usage counter
Enter fullscreen mode Exit fullscreen mode

Stack: Node 18, Express, BullMQ, Redis, Postgres, Stripe, @anthropic-ai/sdk,
@octokit/rest, deployed on Railway.

Three things that made it work

1. Prompt caching changes the economics entirely

Without caching, every PR for every customer pays full input-token price for
the system prompt (~5kb) and the per-repo qabot.md (~1-3kb). For a repo with
30 PRs/month, that's 30 × 6-8kb of redundant input tokens.

With caching enabled on those two prefix blocks, the first PR primes the cache
and the next 4 minutes of PRs pay ~10% of the input token cost on that prefix.
For repos with steady PR cadence (multiple PRs in a short window — which is
exactly the bursty pattern PR traffic actually has), this is the difference
between "free tier is impossible" and "5 free PRs/month is comfortably
sustainable."

The SDK call looks like:

const response = await anthropic.messages.create({
  model: 'claude-sonnet-4-6',
  max_tokens: 4000,
  system: [
    {
      type: 'text',
      text: QA_SYSTEM_PROMPT,
      cache_control: { type: 'ephemeral' }
    }
  ],
  messages: [
    {
      role: 'user',
      content: [
        // qabot.md as its own cached block, separate from the dynamic PR data
        {
          type: 'text',
          text: qabotMarkdown,
          cache_control: { type: 'ephemeral' }
        },
        {
          type: 'text',
          text: dynamicPRPayload  // not cached — changes per PR
        }
      ]
    }
  ]
});
Enter fullscreen mode Exit fullscreen mode

The key insight: cache the static prefix (system prompt + per-repo config),
NOT the per-PR payload.

2. A per-repo config file (qabot.md) does more work than the model

Generic AI test plans drift to generic AI test plans. "Test the happy path.
Test invalid inputs. Test edge cases." Useless.

Give the model 20 lines describing the actual domain — user roles, what
"correct" means for this app, known fragile areas — and the output transforms.
Test scenarios start referencing real entities. Edge cases align to actual
business rules. Setup steps mention the specific seeding commands your team
actually uses.

The file is dead simple Markdown — no special syntax. Example:

# Project context

## What this app does
Order management for a B2B wholesale distributor.

## User roles
- buyer (places orders)
- vendor (fulfills orders)
- ops (resolves disputes)

## What "correct" means
- Inventory MUST decrement on order placement, not on fulfillment
- Vendor payouts MUST round half-to-even (NEVER half-up — finance audit)
- A buyer MUST NOT see another buyer's order history

## Known fragile areas
- The order-status state machine (10+ states, several non-obvious transitions)
- Multi-currency conversion in /api/checkout
Enter fullscreen mode Exit fullscreen mode

This config is so much more valuable than any system prompt tuning I did.

3. PM notes turned QA-only into "useful to the whole team"

I almost shipped without the PM notes section. Then a beta user said: "this
is great, but my PM keeps asking me to summarize what each PR does in
non-technical language." I added a ## PM notes section to the prompt with:

  • User-facing impact (or "none — internal change")
  • Business / product implications
  • Release-note candidate (yes/no + the line)
  • Coordination needs (support, sales, marketing)

The PM notes are now the most-quoted part of the report. PMs stop pinging
engineers in Slack. Devs stop translating PRs into English manually. Nobody
expected this; everyone keeps it.

Things I'd reconsider

  • Postgres might be overkill for v1. SQLite + Litestream would have shipped 2 weeks faster.
  • I rolled my own usage quota. A library like Schematic or Stigg would have given me free metering UIs and better analytics.
  • GitHub App permissions matrix is intimidating. Document yours obsessively for users — they want to know exactly what you can see.

Want to try it?

5 free reports per month, no credit card: https://qaonfire.dev

If you have a qabot.md file you're happy with, send it to me — I'm building
a small public collection of good ones to help new users get started.

Happy to answer architecture questions in the comments.

Top comments (1)

Collapse
 
xulingfeng profile image
xulingfeng

Prompt caching approach mirrors what we do with DeepSeek V4 Flash \u2014 split the static system prompt and project context into two cache blocks, only the diff is dynamic. Though we use a different cache strategy: the cache key is tied to the git commit SHA, so any diff change automatically misses without manual invalidation.

The qabot.md point really resonates \u2014 getting a team to write 20 lines of domain context is worth 10\u00d7 more than prompt tuning. We have a similar .qacontext file in our project, and the most valuable outcome wasn\u2019t telling the AI what to test \u2014 it was forcing the team to write down what \u201ccorrect\u201d means for their system. That exercise alone caught several implicit assumptions nobody had articulated.

The PM notes section is an unexpected MVP \u2014 have you considered making it an optional module? Some teams might not need that output.