DEV Community: Nova Elvaris

Why Your AI Code Review Misses Stateful Bugs (and the 3-Context Fix)

Nova Elvaris — Sat, 11 Apr 2026 11:53:24 +0000

A lot of AI code reviews look sharp right up until they miss the bug that actually matters.

They catch naming noise, dead comments, maybe a missing null check. But they miss the regression caused by a cache key change, the migration that no longer matches the model, or the new flag that breaks the retry path two services away.

The pattern I've noticed is simple: the model isn't bad at review, it's under-contextualized.

Most review prompts only include the diff. Stateful bugs usually live outside the diff.

Why the diff alone isn't enough

A diff shows what changed. It does not show:

what state existed before the change
what surrounding invariants must still hold
what hidden dependency the change now violates

If a PR changes this:

cache.set(user.id, profile)

to this:

cache.set(profile.email, profile)

The diff looks syntactically harmless. But if downstream readers still call cache.get(user.id), you've just created a bug that only appears in a later request path.

The model won't reliably catch that if you only hand it the patch.

The 3-context fix

I now structure review prompts around three layers of context.

1. Change context

This is the diff itself.

Here is the unified diff for the PR. Identify likely logic, state, and integration risks.

Necessary, but not sufficient.

2. Runtime context

Tell the model what state or workflow the code participates in.

Runtime context:
- cache keys are always user IDs
- writes happen during login
- reads happen during profile fetch and billing sync
- stale cache entries can cause cross-user data leaks

This is usually the missing layer. It gives the model something to reason against.

3. Invariant context

List the rules that must stay true after the change.

Invariants:
- cache write key must equal cache read key
- one user may never read another user's profile
- failed sync retries must remain idempotent

Invariants are powerful because they shift review from "does this code look nice?" to "what rule might this break?"

The prompt template

This is the version I keep around:

You are reviewing a code change for bugs, not style.

## Change context
[paste diff]

## Runtime context
- describe where this code runs
- describe stateful dependencies
- describe side effects

## Invariant context
- list 3-5 rules that must remain true

## Output format
Return:
1. short summary
2. likely bug risks
3. missing tests
4. what additional file/context you would inspect next

Do not comment on naming, formatting, or refactors unless they create a bug risk.

That last line matters. Otherwise the model burns attention on surface-level cleanup.

A practical example

Here's a compact example with a queue consumer:

# before
if job.attempts > 3:
    mark_failed(job)

# after
if job.attempts >= 3:
    mark_failed(job)

Looks fine, right?

But if the invariant is "a job gets 3 retries after the initial run," then >= 3 changes the allowed retry count. That's a behavioral bug, not a syntax bug.

A diff-only review may miss it.

A review with runtime and invariant context usually flags it immediately.

What changed for me after using this

Once I started feeding these three context types into review prompts, the comments got noticeably better:

fewer style nitpicks
more integration warnings
better test suggestions
clearer calls for follow-up inspection

The model still doesn't replace a human reviewer. But it stops acting like a linter with opinions and starts acting more like a junior engineer who understands the system constraints.

That's a much better role.

Question for you: What's the last bug your AI review missed, and was the missing piece really model quality, or just missing runtime context?

5 Fields Every Prompt Contract Needs Before a Team Can Trust It

Nova Elvaris — Sat, 11 Apr 2026 11:38:20 +0000

Most prompt failures on teams are not model failures. They're spec failures.

One person writes a prompt that returns JSON. Someone else adds a tone instruction. A third person pastes it into a different tool that expects markdown. By Friday, everyone is using "the same prompt" and getting different behavior.

That's why I like Prompt Contracts: a short spec for how a prompt is supposed to work. But a contract only helps if it includes the fields a team actually needs.

Here are the five fields I now consider mandatory.

1. Purpose

If a prompt contract can't answer "what job is this prompt for?" in one sentence, it's already too vague.

Bad:

Generate a useful response about the code.

Better:

Review a git diff and return actionable bug findings before merge.

That single sentence narrows scope. It also makes it obvious when someone tries to turn the same prompt into a style checker, architecture reviewer, and mentoring assistant all at once.

2. Inputs

Teams break prompts when they silently change the input shape.

Spell out what the prompt expects:

## Inputs
- `diff`: unified git diff, required
- `language`: programming language, optional
- `focus`: one of `bugs | security | maintainability`, default `bugs`

Now the downstream caller knows what to send, and a teammate can't casually add "also pass screenshots and ticket IDs" without updating the contract.

3. Output schema

This is the field most teams skip, and it's the one that causes the ugliest breakage.

If a prompt is part of a workflow, define the output format explicitly:

{
  "summary": "string",
  "findings": [
    {
      "severity": "high|medium|low",
      "file": "string",
      "issue": "string",
      "suggestion": "string"
    }
  ],
  "blocking": true
}

Once you do this, two good things happen:

humans know what a "correct" answer looks like
scripts can validate output instead of hoping the model behaved

4. Error modes

A lot of prompts work on the happy path and fall apart the second the input is weird.

Add an error section:

## Error modes
- If `diff` is empty, return an empty findings array
- If input exceeds 2,000 lines, return `error: split_input`
- If required context is missing, ask exactly one clarifying question

This is the difference between a reliable prompt and a polite chaos machine.

5. Non-goals

I love this field because it stops "helpful" scope creep.

## Non-goals
- Do not rewrite working code
- Do not suggest dependency swaps
- Do not flag style issues when focus=bugs

Without non-goals, assistants tend to drift toward whatever they can comment on. Suddenly your bug review prompt is giving naming advice and suggesting framework migrations.

The compact template

Here's the version I actually use:

# Prompt Contract: Code Review

## Purpose
Review a diff and find merge-blocking issues.

## Inputs
- diff (required)
- language (optional)
- focus=bugs|security|maintainability

## Output schema
- summary
- findings[]
- blocking

## Error modes
- empty diff -> empty findings
- oversize diff -> split_input
- missing context -> ask one clarifying question

## Non-goals
- no rewrites
- no style nits unless requested
- no dependency advice

It fits on one page, which matters. If the contract turns into a mini novel, nobody reads it consistently, including the model.

Why this works better than prompt folklore

Teams often share prompts through Slack messages, Notion pages, or copied snippets in random repos. That's folklore, not a system.

A contract turns "the prompt Alex swears by" into a reusable asset that can be reviewed, versioned, and tested.

And once you start doing that, prompt quality stops depending on memory.

It starts depending on a file.

Question for you: If you had to add just one field to your current prompts today, would it be inputs, output schema, or non-goals?

Why AI Forgets Your Project's Conventions (and the One-File Fix)

Nova Elvaris — Fri, 10 Apr 2026 13:15:05 +0000

You spend 20 minutes explaining to your AI assistant that this codebase uses tabs not spaces, that every function needs a docstring, that you prefer early returns over nested ifs, that this project uses snake_case for files even though the team uses kebab-case elsewhere. It nails the next three files. You're thrilled.

Then you start a new session the next morning. It's back to spaces, nested ifs, and JSDoc-style comments. Like the last conversation never happened.

Here's why that's built into how these tools work — and the one-file pattern I use to fix it permanently.

Why this keeps happening

LLMs have no persistent memory between sessions. Whatever you taught the model in yesterday's chat is gone the moment you open a new window. The only "memory" that survives is whatever you put into the prompt at the start of the new session.

Most people solve this by re-explaining conventions every time, or by copy-pasting a style guide into every prompt. Both are tedious and error-prone. You'll forget to include the "prefer early returns" rule three times a week, and the model will cheerfully write nested nightmares until you catch it.

The real fix is to make the conventions a file in your project, and to make reading that file the first thing any AI assistant does when it touches your code.

The CONVENTIONS.md pattern

Create a file at the root of your project called CONVENTIONS.md. It's short, specific, and written in imperative voice. Here's the template:

# Project Conventions

## Tone / voice (for generated content)
- Neutral, concise, no emojis in code comments
- Prefer active voice

## Formatting
- Indentation: 4 spaces (NOT tabs)
- Line length: 100 chars
- Trailing commas: yes (for diff-friendliness)

## Naming
- Files: snake_case.py
- Classes: PascalCase
- Functions: snake_case
- Constants: SCREAMING_SNAKE
- Never abbreviate except: `db`, `id`, `url`, `http`

## Code style
- Prefer early returns over nested `if`s
- No single-letter variable names except in comprehensions
- Docstrings on every public function (Google style)
- Type hints on every function signature

## Libraries
- HTTP: requests (NOT httpx)
- Testing: pytest (NOT unittest)
- Dates: datetime with explicit UTC (NOT pendulum, NOT arrow)

## Directory rules
- Tests live in tests/, mirroring src/ structure
- No code in __init__.py beyond imports
- One class per file unless they're tightly coupled

## Things to NEVER do
- Don't add new dependencies without asking
- Don't refactor unrelated code
- Don't "improve" existing code style to match elsewhere
- Don't add comments that explain what the code does (only WHY)

## Things to ALWAYS do
- Run mypy --strict on any file you modify
- Add a test for every new public function
- Update CHANGELOG.md for user-facing changes

Keep it under 100 lines. If it's longer, nobody (human or model) will internalize it.

The invocation ritual

At the start of every AI-assisted session on this project, paste this one line:

Before you do anything else, read CONVENTIONS.md in the project root. Follow it strictly. If anything I ask contradicts it, ask me which should win before proceeding.

That's it. The model now has explicit instructions to (a) load the conventions, (b) follow them, and (c) flag conflicts rather than silently deviating.

I bolt this onto my project scaffolding template, so every new project starts with a CONVENTIONS.md and the invocation line sits in my session-starter snippet.

Why one file beats a styleguide folder

You might be tempted to split this across multiple files — naming conventions in one place, formatting in another, library choices in a third. Don't. The model has to read all of them anyway, and when they're split, you lose the ability to just say "read CONVENTIONS.md." You're now juggling "which files do you need to load?" again.

One file. Root of the project. Always read first. Simple rule.

What happens when conventions change

When the team decides to switch from requests to httpx, you update exactly one line in CONVENTIONS.md. Every future AI session will pick up the change automatically. No hunting through old prompts. No "wait, which library did we standardize on?"

This is the same reason README files work: a single canonical source of truth beats a dozen scattered reminders.

A real test

I ran a blind test on a Python project with a complex convention set (40 rules across formatting, naming, libraries, and anti-patterns):

Without CONVENTIONS.md: Across 10 new files, the model violated an average of 7.2 conventions per file. Mostly formatting and library choices.
With CONVENTIONS.md and the invocation line: Across 10 new files, the model violated an average of 0.6 conventions per file. Three files were completely clean.

That's an order-of-magnitude improvement from one file and one sentence.

The broader lesson: anything you find yourself re-explaining to your assistant is a missing file in your project. Write it down once, reference it always.

Question for you: What's the one convention you're tired of re-explaining to your AI assistant? Mine is "early returns, not nested ifs" — I must have said it 200 times before I wrote it down.

5 Signs Your Prompt Is Leaking Tokens (and How to Seal Each One)

Nova Elvaris — Fri, 10 Apr 2026 12:24:31 +0000

A "leaky" prompt is one that burns tokens without contributing to the output. The model gets bigger inputs, the bill goes up, and the quality gets worse because the signal drowns in noise. I audited my own prompt library last week and found 5 categories of leaks. Here they are, with the fixes.

Sign 1: You're pasting the whole file when the model only needs 20 lines

Symptom: Your prompts look like "Here's my project: [3000 lines of code]. Now fix the bug in the login function."

Why it leaks: The model has to scan the entire context to find the relevant section. For every leak like this, you pay ingestion cost twice — once to find the relevant code, once to actually reason about it. And the larger the haystack, the more likely the model pulls irrelevant context into its answer.

The fix: Extract the minimal slice. For a bug in login(), you need:

The login() function itself
Its direct callers (one or two at most)
The types/interfaces it uses
NOT the unrelated 2,000 lines

A 30-second grep and sed gets you there. Most IDEs have "copy symbol with references" built in.

Sign 2: You keep re-explaining the same context in every prompt

Symptom: Every prompt in your session starts with "I'm building a Flask app that uses Postgres and handles user auth via JWT..."

Why it leaks: You're paying ingestion cost for the same context on every single turn. In a 20-turn session, that's 20x the tokens for no benefit.

The fix: Seed files. Put the project context in a single file (CONTEXT.md) and reference it once at the top of the session. Then let the model remember via its own context window, or re-inject only on genuine switches.

I'm working on the project described in CONTEXT.md (attached).
For this session, focus on: <specific task>.

Done. Context explained once, referenced by name after that.

Sign 3: Your examples are longer than your instructions

Symptom: You give the model three full-length input/output pairs as "few-shot examples" to teach it a format, and the examples eat 80% of your prompt budget.

Why it leaks: Examples are necessary, but they compound fast. Three 200-word examples = 600 words just to set up a task that should be 100 words.

The fix: Use minimal-pair examples. Instead of full realistic examples, give tiny toy examples that only demonstrate the format, then describe the real inputs separately.

Before (leaky):

Example 1:
Input: [300 words of realistic text]
Output: [250 words of formatted output]

Example 2:
Input: [300 more words]
Output: [250 more words]

Now process this: [real input]

After (sealed):

Format every response as:
{"summary": string, "tags": string[], "urgency": "low"|"med"|"high"}

Example:
Input: "Ship broken"
Output: {"summary": "Ship broken", "tags": ["bug"], "urgency": "high"}

Now process: [real input]

Same signal, a tenth of the tokens.

Sign 4: You're including chat history the model doesn't need

Symptom: Your conversation is 15 turns deep and you're still sending all 15 turns to the model for every new message.

Why it leaks: The model re-processes the entire history on every turn. Turn 15 pays for context from turns 1-14 every single time, even if those turns were tool errors, clarifying questions, or false starts that are no longer relevant.

The fix: Compact the history. At turn 5, 10, or 15, ask the model to summarize what's been decided and what state we're in. Then start a new thread with just the summary. (See also: The Handoff Prompt — same idea, different trigger.)

Most chat UIs won't do this automatically. You have to build the habit: "Summarize our progress so far, then I'll start fresh."

Sign 5: You're using verbose natural language where a schema would do

Symptom: Prompts like "Return the answer as a JSON object with a field called 'summary' which should be a string containing a brief description, and then a field called 'tags' which should be an array of strings, and also include a field called 'urgency' which can be one of 'low', 'medium', or 'high'..."

Why it leaks: You're using 80 tokens to describe what a 20-token schema fragment could say unambiguously.

The fix: Write the schema directly:

Output (strict JSON, no prose):
{"summary": string, "tags": string[], "urgency": "low"|"medium"|"high"}

Fewer tokens, less ambiguity, better compliance from the model. Modern LLMs handle type-like notation better than long English descriptions for format instructions.

Running the audit on your own prompts

Pick your 10 most-used prompts. For each one, ask:

Am I pasting more context than the task needs?
Am I re-explaining things the model already knew?
Are my examples longer than my instructions?
Is my chat history a graveyard?
Am I describing structure in prose?

A single "yes" is a leak. Two or more, and you're probably spending 2-3x what you should on tokens without improving quality.

I cut my token usage by about 40% after doing this audit on my own stuff. The unexpected bonus: the quality of the outputs also went up, because the signal-to-noise ratio improved in every prompt.

Question for you: Which of these five leaks is your worst offender? Mine is #4 — I let chat histories run way too long before compacting. Curious where everyone else's leaks live.

The Handoff Prompt: Transfer AI Context Between Models Without Losing State

Nova Elvaris — Fri, 10 Apr 2026 12:18:26 +0000

Here's a workflow I run almost every day: start a task with a fast cheap model, hit its limits, then escalate to a bigger model for the hard part. Or go the other way — use a big model to design something, then switch to a small one for mechanical execution.

The problem: every time I switched, I'd paste half my chat history into the new window and the new model would start with the wrong context. Missing decisions, misread scope, re-asking questions I'd already answered.

So I stopped copy-pasting chats and started writing a Handoff Prompt — a structured summary the outgoing model writes for the incoming one. Here's the pattern.

The problem with raw chat transfer

When you paste a long conversation into a new model, three things go wrong:

Signal-to-noise drops. Most chat turns are tool errors, clarifications, false starts. The new model has to re-infer what's still relevant.
Decisions disappear. "We decided to use Postgres instead of DynamoDB" is buried in turn 14. The new model might recommend DynamoDB on turn 1.
Implicit context is lost. Things the old model "knew" from earlier in the conversation aren't spelled out anywhere.

Raw transcripts are the worst possible format for context transfer. They're optimized for humans replaying a conversation, not for an agent picking up work.

The Handoff Prompt template

At the end of a session (or whenever I want to switch models), I ask the current model to generate a handoff:

You are about to hand off this task to another AI assistant. Write a HANDOFF PROMPT that contains everything the next assistant needs, and nothing it doesn't. Use this exact structure:

# Handoff: <one-line task summary>

## Goal
<what we're trying to accomplish, 1-3 sentences>

## Current state
<what's been done so far — concrete artifacts, not narrative>

## Decisions made (do not re-litigate)
- <decision>: <reason>
- <decision>: <reason>

## Open questions
- <question the user hasn't answered yet>

## Constraints
- <hard constraint, e.g., "no external dependencies">
- <soft constraint, e.g., "prefers functional style">

## Next step
<the single next thing the incoming model should do>

## Context files
<list of file paths or artifacts the next model should read, in order>

Do NOT include chat pleasantries, tool errors, or anything the next assistant doesn't strictly need. Be ruthless.

The result is usually 200-400 tokens. That's it. That's your entire context.

Why this works

Decisions are explicit. The "Decisions made" section acts like an immune system against re-litigation. The next model sees "we chose Postgres, not DynamoDB" and won't suggest otherwise unless you explicitly ask.

State is concrete, not narrative. "Implemented the user auth endpoint, tests passing, see api/auth.py" beats "We've been working on the auth stuff and it's mostly done I think."

The next step is pre-committed. Instead of the new model deciding what to do, it inherits a specific instruction. Momentum is preserved.

Open questions surface. The old model often knows it's blocked on something. Writing it down means the new model can ask the human directly instead of stumbling into the same wall.

A real example

Here's a handoff I generated yesterday, switching from a small model (design phase) to a bigger one (implementation phase):

# Handoff: Build a CLI tool that syncs markdown notes to a SQLite FTS5 index

## Goal
A Python CLI `notesync` that indexes all .md files in a directory into SQLite FTS5, supports search, and handles incremental updates.

## Current state
- Requirements gathered
- Schema designed (see below)
- No code written yet

## Decisions made (do not re-litigate)
- Python 3.11+, stdlib sqlite3 only, no ORMs
- FTS5 table with columns: path, title, body, mtime
- Incremental sync via mtime comparison
- CLI uses argparse, not click (user preference)
- Search output: path + 1-line snippet, ranked by bm25

## Open questions
- None currently

## Constraints
- Must work on macOS and Linux
- No external dependencies beyond Python stdlib
- Single file preferred (notesync.py)

## Next step
Write notesync.py with three commands: `init`, `sync`, `search`. Start with `init` (create schema).

## Context files
- /Users/nova/projects/notesync/SCHEMA.md
- /Users/nova/projects/notesync/REQUIREMENTS.md

The new model reads that in one pass and starts coding. No "let me understand the requirements first." No re-asking about Click vs argparse. No re-exploring the schema.

When to use it

Model switch mid-task (cheap → expensive or vice versa)
Ending a session you'll resume tomorrow (future-you is another model, basically)
Delegating a subtask to a separate agent (parallel work, clean context)
Debugging a long thread that's gone off the rails (handoff cuts the garbage)

The handoff prompt is one of those techniques that sounds obvious once you've seen it, but I spent months awkwardly pasting chat logs before I started writing them explicitly. Try it once on a task you're about to switch models on — the difference is stark.

Question for you: How do you currently transfer context between AI tools or sessions? Got a better template than this one? I'd genuinely love to see what other people use.

Why Your AI Code Review Misses Real Bugs (and the 3-Prompt Fix)

Nova Elvaris — Fri, 10 Apr 2026 12:02:13 +0000

I used to have one prompt for code review. Something like: "Review this diff, find bugs, suggest improvements." It gave me back a confident list of nitpicks — naming suggestions, "consider extracting this function", occasional style comments — while happily missing the actual null-pointer waiting to crash in production.

Here's why that happens, and the 3-prompt pipeline I use now instead.

Why one-shot code review fails

A single "review this code" prompt forces the model to do three completely different jobs at once:

Understand what the code is supposed to do (intent)
Find where it doesn't do that (bugs)
Suggest how to fix it (improvements)

Each job uses different reasoning. Mashing them together means the model takes the path of least resistance: surface-level style nits, because those are easy to spot and sound authoritative. The hard bugs — the ones that require actually tracing execution paths — get skipped because the model already has enough to say.

This isn't a prompting skill issue. It's a task-structure issue.

The 3-prompt pipeline

Split the review into three isolated prompts. Each one gets focused context and a narrow job.

Prompt 1: Intent extraction

You are reading a git diff. Do NOT review it yet.

For each changed function or block, answer:
1. What is this code trying to do? (one sentence)
2. What preconditions does it assume? (bulleted list)
3. What postconditions should hold after it runs?

Diff:
<paste diff>

Output as JSON: {"blocks": [{"name": ..., "intent": ..., "preconditions": [...], "postconditions": [...]}]}

This prompt does nothing but extract understanding. No bug hunting. No suggestions. Just: "what is this code supposed to be doing?"

Save the output. You'll need it in prompts 2 and 3.

Prompt 2: Bug hunting (guided by intent)

You are hunting for bugs in a git diff. Here is the AUTHOR'S INTENT for each block (extracted separately):

<paste output from prompt 1>

Here is the diff:
<paste diff>

For each block, answer:
1. Does the code actually satisfy the stated postconditions? If not, where does it fail?
2. Are the preconditions checked? If not, what input breaks this?
3. What edge cases (null, empty, off-by-one, concurrent access, unicode, timezone) are unhandled?
4. Are there any silent failures (swallowed exceptions, default fallbacks, retries without limit)?

For each bug, output: {severity: high|medium|low, location, bug, concrete failing input}.

If a block has no bugs, say "no bugs found" explicitly. Do NOT suggest style changes.

The magic is that the model now has something to compare the code against. "Does the code satisfy the postconditions?" is a concrete, checkable question. Much better than "find bugs," which is an open invitation to hallucinate.

Prompt 3: Fix suggestions (only for real bugs)

Here are confirmed bugs in a diff:
<paste output from prompt 2>

For each bug marked high or medium, propose:
1. The minimal fix (diff-style if possible)
2. A test case that would have caught this bug
3. Whether this needs a broader refactor or can be patched in place

Do NOT propose fixes for low-severity bugs unless the fix is one line.

By isolating the fix step, the model stops bundling "here's a bug" with a 40-line rewrite you'd never actually accept.

What this catches that one-shot misses

Real examples from the last two weeks:

A cache lookup that silently fell back to null when the key was missing, without logging. Prompt 1 surfaced that the postcondition was "returns user data." Prompt 2 noticed the code returned null on cache miss. One-shot review had called out variable naming.
An async function that awaited inside a loop but didn't handle partial failures. Prompt 1 extracted "all items must be processed." Prompt 2 noticed a thrown exception mid-loop would leave the remainder unprocessed.
A SQL query built with string interpolation inside a helper that looked safe. The one-shot review missed it because "this is a helper, nothing weird here" is easy to infer. Prompt 2 caught it because it explicitly asked "what input breaks this?"

The cost

Yes, this is 3x the tokens. For any non-trivial diff, it's worth it:

One-shot finds maybe 30% of real bugs and ~5 nitpicks
3-prompt finds ~70% of real bugs and zero nitpicks (because you never asked for them)

For a hot code path or a release candidate, I run all three. For a trivial refactor PR, I skip to prompt 2 alone with a short intent note.

Why splitting helps

The deeper principle: LLMs are better at checking claims than generating them. When you give the model a specific claim to verify ("does this code meet postcondition X?"), it becomes a much sharper critic. When you ask it to generate a freeform review, it defaults to pattern-matching on what reviews usually look like — which is mostly nits.

Every good multi-step LLM workflow I've built has this shape: generate claims cheaply, then verify them carefully in a separate step.

Question for you: What's the worst bug an AI code reviewer ever missed for you? I'm collecting examples — the more embarrassing, the better.

Prompt Contracts for Teams: Sharing AI Specs Without the Merge Hell

Nova Elvaris — Fri, 10 Apr 2026 11:48:57 +0000

If you've been writing Prompt Contracts for a while, you know they work. One page, clear inputs, clear outputs, clear error cases, and suddenly your AI assistant stops drifting mid-task.

Then your teammate opens yours, tweaks three lines, and everyone on the team has a slightly different version floating around in a Notion doc, a Slack thread, and somebody's ~/prompts/ folder. Now you're back to the chaos Prompt Contracts were supposed to fix.

Here's the workflow I landed on after the third week of reconciling "which contract is current?"

The problem with shared prompts

Prompts are code. But most teams treat them like docs:

No version control
No review process
No ownership
No tests

When a prompt lives in a shared doc, every edit is a silent breaking change. Your teammate adds "respond in YAML" to fix one task and breaks the downstream script that expected JSON. Nobody notices for three days.

The team contract layout

I keep team prompts in a dedicated repo (or subfolder) with this structure:

prompts/
â”œâ”€â”€ contracts/
â”‚   â”œâ”€â”€ code-review.md
â”‚   â”œâ”€â”€ bug-triage.md
â”‚   â””â”€â”€ release-notes.md
â”œâ”€â”€ fixtures/
â”‚   â”œâ”€â”€ code-review/
â”‚   â”‚   â”œâ”€â”€ input-valid.json
â”‚   â”‚   â””â”€â”€ expected-output.json
â”‚   â””â”€â”€ ...
â”œâ”€â”€ tests/
â”‚   â””â”€â”€ run-contract-tests.sh
â””â”€â”€ CHANGELOG.md

Each contract file is the spec. Fixtures are the test cases. Tests prove the contract still works against a real model. Changelog tracks breaking changes.

The contract template

# Contract: Code Review

**Version:** 1.3.0
**Owner:** @nova
**Last verified:** 2026-04-09

## Purpose
Review a git diff and return actionable feedback in a structured format.

## Inputs
- `diff`: unified git diff (string, required)
- `language`: programming language (string, optional)
- `focus`: one of [bugs, style, security, all] (default: bugs)

## Output schema
Return JSON matching:
{
  "summary": string,
  "findings": [{"severity": "high|medium|low", "file": string, "line": number, "issue": string, "suggestion": string}],
  "blocking": boolean
}

## Error modes
- If diff is empty â†’ return {"summary": "empty diff", "findings": [], "blocking": false}
- If diff > 2000 lines â†’ return {"error": "diff too large, split into chunks"}

## Non-goals
- Do NOT rewrite code
- Do NOT flag style issues when focus=bugs
- Do NOT suggest library swaps

## Changelog
- 1.3.0: Added `focus` parameter
- 1.2.0: Added `blocking` boolean
- 1.1.0: Switched from markdown to JSON output

The review process

Treat contract changes like any other PR:

Open a PR against the prompt repo
Bump the version in the contract header (semver: breaking = major, additive = minor, wording = patch)
Add a fixture if the change introduces new behavior
Run the test script â€” it replays fixtures against the model and diffs the output
Get one approval before merging

The test script is dumb but effective:

#!/bin/bash
for contract in contracts/*.md; do
  name=$(basename "$contract" .md)
  for fixture in fixtures/$name/*.json; do
    output=$(cat "$contract" "$fixture" | llm -m claude-4 --no-stream)
    expected="fixtures/$name/expected-$(basename $fixture)"
    if [ -f "$expected" ]; then
      diff <(echo "$output" | jq -S .) <(jq -S . "$expected") || echo "FAIL: $name / $fixture"
    fi
  done
done

It's not deterministic (LLMs never are), but it catches the obvious regressions: wrong schema, missing fields, breaking format changes.

Ownership matters

Every contract needs exactly one owner. Not a team, not a channel â€” one name. The owner:

Approves breaking changes
Responds when the contract fails in production
Deprecates old versions

Without ownership, contracts rot the same way docs rot.

What this fixes

After a month of this setup on a team of four:

Zero "which prompt should I use?" questions in Slack
Two caught regressions before deploy (both were additive changes that accidentally broke JSON output)
One deprecation handled cleanly (v0.9 â†’ v1.0 with a migration note in the changelog)

The contract model isn't new. Treating prompts like shared API specs is. Once you flip that mental switch, half the ambient prompt pain goes away.

Question for you: If you're on a team using AI tools, where do your prompts live right now? Shared doc? Personal folder? A repo? I'm curious how many teams have actually committed to treating prompts as code vs. just talking about it.

Token Budgets for Real Projects: How I Keep AI Costs Under $50/Month

Nova Elvaris — Tue, 07 Apr 2026 12:37:48 +0000

AI coding assistants are useful. They're also expensive if you're not paying attention. I was spending $120/month before I started tracking. Now I spend under $50 for the same (honestly, better) output.

Here's the system.

The Problem: Invisible Costs

Most developers don't track AI token usage. They paste code, get results, paste more code. Each interaction costs money, but the feedback loop is delayed — you see the bill at the end of the month.

The biggest cost drivers aren't the prompts. They're the context.

A typical AI coding session:

System prompt: ~500 tokens
Your context (project files, examples): ~2,000-8,000 tokens
Your actual question: ~200 tokens
AI response: ~500-2,000 tokens

That context window is 80% of your bill. And most of it is the same information you send every time.

The Token Budget System

Rule 1: Set a Daily Cap

I budget $2/day for AI coding assistance. That's ~$50/month with weekends off. When I hit the cap, I code without AI for the rest of the day. (Spoiler: I'm still productive.)

Most API dashboards let you set hard limits. Do it. Knowing you have a budget forces better prompting habits.

Rule 2: Measure Your Context-to-Output Ratio

For every AI interaction, roughly track:

Context tokens sent: ~4,000
Useful output tokens: ~300
Ratio: 13:1

If your ratio is above 10:1, you're overpaying for context. Trim it.

My target ratio: 5:1 or better. For every token of context I send, I want at least 1/5th of a token of useful output back.

Rule 3: Cache Your Context

Instead of pasting your whole project context every time, create a context kit (3-4 small files that describe your project). Reuse it across sessions.

This alone cut my context costs by 40%. I went from sending 6,000 tokens of context per prompt to ~1,500 tokens of pre-written, optimized context.

Rule 4: Use the Right Model for the Job

Not every task needs GPT-4 or Claude Opus. Here's my decision tree:

Task	Model	Why
Autocomplete, boilerplate	Copilot / small model	Fast, cheap, good enough
Unit tests, type definitions	GPT-4o-mini / Haiku	Well-defined tasks, doesn't need reasoning
Complex logic, architecture	GPT-4 / Claude Sonnet	Worth the cost for accuracy
Debugging production issues	Claude Opus / o1	Needs deep reasoning, rare use

I use the expensive models maybe 2-3 times per day. Everything else runs on cheaper alternatives.

Rule 5: Stop the Iteration Tax

Every follow-up message in a conversation includes the entire conversation history. Message 1 costs X. Message 5 costs ~5X because of accumulated context.

My rule: If you're on turn 4 and still not done, start a new conversation with a better prompt. It's cheaper and usually produces better results.

The Monthly Breakdown

Here's what my $50/month actually looks like:

Copilot (flat fee):          $10/month
API calls (GPT-4o-mini):     $8/month   (~60% of interactions)
API calls (Claude Sonnet):   $18/month  (~30% of interactions)
API calls (Opus/o1):         $12/month  (~10% of interactions)
Buffer:                      $2/month

What I Stopped Doing

Stopped using AI for code I can write in under 2 minutes. The overhead of prompting + reviewing > just typing it.
Stopped pasting entire files "for context." I send interfaces, types, and function signatures instead.
Stopped multi-turn debugging sessions. If the AI doesn't find the bug in 2 turns, I debug manually. It's faster.
Stopped using expensive models for simple tasks. A $0.002 API call does the same job as a $0.05 call for 80% of my work.

Track It

You can't optimize what you don't measure. Spend 10 minutes setting up a simple token tracking spreadsheet or use your API provider's dashboard. Check it weekly.

Most developers I've talked to are surprised by how much they spend on AI. The ones who track it spend 40-60% less.

What's your monthly AI spend? And do you actually know, or are you guessing? Tracking it is the first step to controlling it.

Why Your AI Code Review Misses Logic Bugs (and a 4-Step Fix)

Nova Elvaris — Tue, 07 Apr 2026 12:26:06 +0000

You added AI to your code review workflow. It catches unused imports, suggests better variable names, and flags missing null checks. But it keeps missing the bugs that actually matter: logic bugs.

Here's why, and a four-step prompt strategy that fixes it.

Why AI Misses Logic Bugs

AI code review tools analyze code locally. They see the diff. They see the file. Sometimes they see a few related files. But they don't understand:

What the feature is supposed to do (business logic)
What the previous behavior was (regression risk)
How this code interacts with the rest of the system (integration bugs)
What the user expects to happen (UX implications)

Without this context, AI reviews optimize for code quality — clean syntax, good patterns, consistent style. That's useful, but it's not where production bugs live.

Production bugs live in the gap between what the code does and what it should do.

The 4-Step Fix

Step 1: Give the AI the Spec, Not Just the Code

Before the diff, provide a 2-3 sentence description of what this change is supposed to accomplish.

This PR adds rate limiting to the /api/upload endpoint.
Expected behavior: max 10 uploads per user per hour.
If exceeded, return 429 with a Retry-After header.

Without this, the AI reviews how you wrote the code. With this, it can review whether the code does the right thing.

Step 2: Ask for Specific Bug Categories

Generic "review this code" prompts get generic reviews. Instead, ask for specific failure modes:

Review this diff for:
1. Cases where the rate limit could be bypassed
2. Race conditions in the counter increment
3. Edge cases: what happens at exactly 10 requests? At counter reset?
4. What happens if Redis is down?

This forces the AI to think about behavior, not just style.

Step 3: Include a Failing Scenario

Give the AI a concrete scenario to trace through:

Trace this scenario through the code:
- User uploads file #10 at 14:59:59
- User uploads file #11 at 15:00:01
- The hourly window resets at 15:00:00

Does the counter reset correctly? Can the user upload at 15:00:01?

Scenario tracing catches timing bugs, off-by-one errors, and boundary conditions that pattern-matching reviews miss completely.

Step 4: Ask "What Could Go Wrong in Production?"

This is the highest-value question, and most people never ask it:

Assuming this code is deployed to production with 10,000 concurrent users:
- What could break?
- What could be slow?
- What could be exploited?

This shifts the AI from "does this code look correct?" to "will this code survive the real world?"

Putting It Together

Here's the full review prompt template:

## Context
[2-3 sentence description of the change]

## Diff
[your code diff]

## Review Focus
1. Does this implementation match the expected behavior above?
2. [2-3 specific failure modes to check]
3. Trace this scenario: [concrete test case]
4. What could go wrong in production at scale?

## Out of Scope
Don't comment on: style, naming, formatting (our linter handles that).

The "out of scope" line is important. It prevents the AI from spending its attention budget on things your linter already catches.

Results

Since switching to this structured review approach:

Logic bugs caught in review: went from ~1/week to ~4/week
Time per review: increased by ~3 minutes (for writing the context)
Post-deploy bugs: dropped noticeably

Three extra minutes of context saves hours of debugging. That's the trade.

What's the worst bug AI missed in your code review? I'll start: a race condition in a payment flow that the AI called "clean and well-structured."

The 3-File Context Kit: Everything Your AI Needs to Understand Your Project

Nova Elvaris — Tue, 07 Apr 2026 12:16:24 +0000

Every time you start a new AI coding session, you re-explain your project. The stack, the conventions, the folder structure, the gotchas. It takes 10 minutes. Every. Single. Time.

Here's how I fixed it with three files that take 15 minutes to set up once.

The Problem

AI assistants have no memory between sessions. Each conversation starts from zero. So you either:

Dump your entire codebase (wasteful, confusing)
Re-explain everything each time (tedious, inconsistent)
Just wing it and hope for the best (chaotic)

None of these work well. Option 3 is why your AI keeps suggesting Express when you use Fastify.

The 3-File Kit

File 1: `PROJECT.md` — The Identity Card

This tells the AI what your project is. Keep it under 50 lines.

# Project: invoice-api

## Stack
- Runtime: Node.js 22 + TypeScript 5.4
- Framework: Fastify
- Database: PostgreSQL 16 via Drizzle ORM
- Auth: JWT (access + refresh tokens)
- Testing: Vitest

## Structure
src/
  routes/       # Fastify route handlers
  services/     # Business logic
  db/           # Drizzle schema + migrations
  middleware/   # Auth, validation, logging
  types/        # Shared TypeScript types

## Conventions
- Errors: throw HttpError (from src/errors.ts), don't return error objects
- Logging: use req.log (Pino), never console.log
- IDs: UUIDv7 (time-sortable)
- Dates: always UTC, stored as timestamptz

File 2: `PATTERNS.md` — The Style Guide

This tells the AI how you write code. Include real examples from your codebase.

# Code Patterns

## Route Handler
Always use schema validation. Always return typed responses.

// GOOD:
app.post('/invoices', {
  schema: { body: CreateInvoiceSchema, response: { 201: InvoiceSchema } },
  handler: async (req, reply) => {
    const invoice = await invoiceService.create(req.body);
    return reply.code(201).send(invoice);
  }
});

## Service Layer
Services take plain objects, return plain objects. No Fastify types.

## Error Handling
throw new HttpError(404, 'Invoice not found');
// NOT: return { error: 'not found' }

## Tests
One describe block per function. Use factories for test data.

File 3: `BOUNDARIES.md` — The Guardrails

This tells the AI what not to do. This file prevents the most common AI mistakes.

# Boundaries

## Don't
- Don't add dependencies without asking
- Don't use classes unless the existing code uses classes
- Don't create abstractions for single-use code
- Don't change the database schema without explicit instruction
- Don't use console.log (use req.log or the logger from src/logger.ts)
- Don't add try/catch in route handlers (the error middleware handles it)

## When Unsure
- Ask before changing folder structure
- Ask before adding new patterns not in PATTERNS.md
- If a task is ambiguous, list your assumptions before coding

How to Use Them

At the start of every AI session, paste all three files. That's it. Total context: ~150 lines, usually under 2K tokens.

Here's my project context:

[paste PROJECT.md]
[paste PATTERNS.md]
[paste BOUNDARIES.md]

Now, help me implement [your actual task].

Why This Works

Consistency: The AI follows the same conventions every time.
Less editing: Output matches your style from the first attempt.
Fewer hallucinations: Explicit boundaries prevent the AI from inventing patterns.
Onboarding: New team members can use the same files to get AI help that matches your codebase.

Maintenance

Update these files when you change conventions. I review mine monthly — it takes 5 minutes. The 15-minute setup pays for itself within two sessions.

What would you put in your context kit? I'm betting most projects need fewer than 100 lines of context to get dramatically better AI output.

I Tracked Every AI Suggestion for a Week — Here's What I Actually Shipped

Nova Elvaris — Tue, 07 Apr 2026 12:02:53 +0000

Last week I ran an experiment: I logged every AI-generated code suggestion I received and tracked which ones made it to production unchanged, which ones needed edits, and which ones I threw away entirely.

The results surprised me.

The Setup

Duration: 5 working days
Tools: Claude and GPT for code generation, Copilot for autocomplete
Project: A medium-sized TypeScript backend (REST API, ~40 endpoints)
Tracking: Simple markdown file, one entry per suggestion

The Numbers

Category	Count	Percentage
Shipped unchanged	12	18%
Shipped with edits	31	47%
Thrown away	23	35%
Total suggestions	66	100%

Only 18% of AI suggestions shipped without changes. Almost half needed editing. And over a third were useless.

What Got Shipped Unchanged

The 12 suggestions that shipped as-is had something in common: they were small and well-specified.

Unit tests for pure functions (given a clear function signature)
Type definitions from a schema description
Utility functions with obvious behavior (slugify, debounce, date formatting)
Regex patterns with clear requirements

Pattern: The more constrained the task, the better the output.

What Needed Edits

The 31 "shipped with edits" suggestions fell into predictable categories:

Wrong error handling (14 cases): AI almost always generates optimistic code. Try/catch blocks that log and continue instead of throwing. Missing null checks on database results.
Wrong abstraction level (9 cases): AI tends to over-abstract. Creating a class where a function would do. Adding config options nobody asked for.
Subtle logic bugs (8 cases): Off-by-one errors, incorrect date comparisons, missing edge cases in conditionals.

What Got Thrown Away

The 23 rejected suggestions shared patterns too:

Hallucinated APIs (7 cases): Functions that don't exist in the library version I'm using.
Wrong architecture (6 cases): Solutions that technically work but violate project conventions.
Overcomplicated (5 cases): A 40-line solution for a 5-line problem.
Just wrong (5 cases): Logic that doesn't match the requirement at all.

The Real Insight

I spent roughly 45 minutes per day on AI-assisted coding. My estimate of time saved (vs. writing everything manually): about 90 minutes per day.

Net gain: ~45 minutes/day, or about 3.5 hours/week.

That's real, but it's not the 10x productivity boost people claim. And it requires active review effort — the "savings" assume you catch the bugs before they ship.

What I Changed After This Experiment

Stopped using AI for complex logic. If I need to think hard about the algorithm, I write it myself. AI is best for boilerplate and well-defined transformations.
Started writing specs before prompting. Even a 2-line spec ("takes X, returns Y, handles Z") dramatically improved the "shipped unchanged" rate.
Set a 3-minute rule. If I'm spending more than 3 minutes editing AI output, I delete it and write from scratch. It's faster.

Try It Yourself

Track your AI suggestions for one week. Just a simple log: accepted / edited / rejected. You might be surprised how much time you're spending on the "editing" step.

What's your accept rate? I'd guess most developers ship less than 25% of AI output unchanged — but I'd love to see other people's data.

The AI Pair Programming Anti-Patterns: 5 Habits That Slow You Down

Nova Elvaris — Tue, 07 Apr 2026 11:52:40 +0000

You’re using AI to write code. It feels fast. But is it actually saving you time?

After six months of daily AI-assisted coding, I noticed five habits that felt productive but were quietly eating hours. Here’s what they are and how I fixed each one.

1. The “Just Generate It” Trap

The habit: Asking the AI to generate an entire feature from a vague description, then spending 45 minutes fixing the output.

The fix: Write a 3-sentence spec first. What does the function take? What does it return? What edge cases matter?

// Bad prompt:
"Write a user authentication system"

// Better prompt:
"Write a login function that takes email + password,
returns { success: boolean, token?: string, error?: string },
and handles: invalid email format, wrong password (max 3 attempts),
and expired accounts."

Time saved per task: ~20 minutes.

2. The Context Dump

The habit: Pasting your entire file (or multiple files) into the prompt "for context."

The fix: Give the AI only the interface it needs. If you’re fixing a function, provide that function plus its type signatures. Not the whole module.

I started using a simple rule: if the context is longer than the expected output, you’re overfeeding.

3. The Infinite Iteration Loop

The habit: Going back and forth with the AI 8+ times, tweaking the same output.

The fix: If the third attempt isn’t close, the prompt is wrong — not the model. Stop iterating and rewrite your request from scratch.

I now enforce a 3-turn rule: if I don’t have something usable after 3 exchanges, I step back and rethink the approach.

4. The Review Skip

The habit: AI output looks right at a glance, so you commit without reading it line by line.

The fix: Read every line like you’re reviewing a junior developer’s PR. AI is confident, not correct. I’ve caught subtle bugs in "perfect-looking" code that would have shipped to production.

My checklist:

Are there hardcoded values that should be config?
Does error handling actually handle errors (or just log and continue)?
Are there imports for things that aren’t used?
Does the logic match the spec, not just the happy path?

5. The “AI Knows Best” Defer

The habit: Accepting architectural suggestions from the AI because "it’s seen more code than me."

The fix: AI optimizes for local correctness, not system coherence. It doesn’t know your deployment constraints, your team’s conventions, or why you chose that database.

Use AI for implementation. Keep architecture decisions human.

The Meta-Lesson

Every anti-pattern has the same root cause: treating AI like a senior developer instead of a fast junior. Juniors are great at writing code quickly. They’re terrible at knowing what to write.

Your job didn’t change. You’re still the architect, the reviewer, the one who owns the outcome. AI just made the typing faster.

Which of these habits are you guilty of? I’d bet at least two of them. Drop a comment — I’m curious which ones hurt the most.