Hassan

Posted on Mar 13

Prompt Versioning in Production: What We Learned Running LLM Agents for 3 Months

#ai #python #startup #engineering

Our SDR agent's system prompt went through seven iterations before it stopped guessing email addresses. Here is what that process taught us about treating prompts as production code.

We run six AI agents in production, daily, on an automated schedule. Each agent has a system prompt stored as a markdown file in a git repository. Over three months, those prompts have accumulated more commits than most of our Python scripts. The prompts are the most frequently edited files in the codebase.

This was not what we expected. We expected to write a prompt, tune it for a week, and leave it alone. What actually happened is that prompts behave like code: they have bugs, they need tests, they regress when you change them, and they require review before deploying to production. The tooling and practices around software engineering apply directly.

Here is what we learned.

Prompts Are Markdown Files in Git

Each agent's system prompt lives in .claude/agents/{agent-name}.md. The CMO agent has cmo.md. The SDR has sdr.md. The CEO orchestrator has instructions in the project's CLAUDE.md.

These are not hidden inside a Python string or a JSON config. They are standalone markdown files, version-controlled like everything else in the repository. A git log --follow .claude/agents/sdr.md shows every change to the SDR's behavior, when it happened, and (via the commit message) why.

This is the first and most important decision: prompts are files. They live in version control. They have history.

The alternative — prompts embedded in application code, stored in a database, or managed through a UI — makes it harder to review changes, harder to correlate behavior shifts with prompt edits, and harder to roll back when something breaks. We tried embedding prompts in the orchestrator script during the first week. Within three days we had lost track of which version was running. Moving them to standalone files with git history solved this immediately.

The SDR Agent: A Case Study in Prompt Iteration

The SDR agent generates lead profiles and drafts outreach emails. Its prompt has been edited more than any other file in the repository. Here is a compressed timeline of why.

Version 1: The initial prompt said "research the company and create a lead profile with scoring." The agent produced profiles, but the scoring was inconsistent. Two companies with similar characteristics would get scores 20 points apart. The scores had no justification.

Version 2: We added explicit scoring dimensions — Budget, Authority, Need, Timeline, Fit — with point ranges for each. The agent now had a rubric. Scores became consistent. But the agent started hallucinating company details to fill scoring fields it could not verify.

Version 3: We added "if you cannot verify a field, leave it blank and note the gap." Hallucinations dropped. But the agent started guessing email addresses using pattern inference (firstname.lastname@company.com) without verifying them. Eighteen percent of our outreach bounced.

Version 4: We added "do not guess email addresses. Use only verified contact information." The agent mostly complied. But "mostly" means one in ten leads still had guessed emails. At our volume, that was several bounces per week.

Version 5: We removed the email guessing problem architecturally. Instead of telling the agent not to guess, we added SMTP RCPT TO verification in the email sending script. The agent could write whatever it wanted in the contact field — the sending layer would verify before dispatching. The prompt still says "use verified contacts," but the enforcement is in code, not in the prompt.

Version 6: We discovered the agent was writing outreach emails that were too long — 300-400 word walls of text referencing funding rounds and company history. We added explicit length constraints: "4-6 sentences maximum. Lead with the signal. No preamble about the company's funding or history."

Version 7: We added acceptance criteria that the CEO orchestrator checks before using SDR output. If a lead profile is missing a score justification, the output is flagged and excluded from the pipeline until the next run fixes it.

Seven versions in three months. Each version was a response to a specific failure observed in production. Not a single change was speculative.

The Core Lesson: Architectural Constraints Beat Prompt Engineering

Version 5 of the SDR prompt is the inflection point in our understanding.

We spent two iterations trying to make the agent stop guessing email addresses by refining the prompt. "Do not guess." "Only use verified information." "If you cannot find a verified email, leave the field blank." Each version reduced the failure rate but never eliminated it.

The fix that actually worked was not a prompt change. It was an architectural change: SMTP verification in the sending script. The agent's output is validated by code before it has any external effect.

This pattern repeated across every agent:

Write restrictions: We could not reliably prevent agents from writing to wrong directories via prompt instructions. The fix was --allowedTools at the CLI level, which blocks unauthorized writes before the filesystem is touched.
Output length: We could not reliably keep social media posts under character limits via prompts. The fix was a validation check in the publishing script that rejects posts exceeding the limit.
Data freshness: We could not stop the CMO agent from citing outdated information via prompt instructions. The fix was passing the current date as context and having the downstream quality gate flag research that references events older than 30 days.

The pattern: if a failure mode matters, do not rely on the prompt to prevent it. Build the constraint into the system around the agent. Prompts are probabilistic. Code is deterministic. Use code for enforcement and prompts for guidance.

This does not mean prompts are unimportant. The SDR produces dramatically better output with version 7 than version 1. But the system is reliable because of the architectural constraints, not because the prompts are perfect.

Testing Prompts: Quality Gates as Acceptance Tests

Each agent has acceptance criteria defined in the orchestrator's configuration. These function like automated tests for prompt output.

Agent	Acceptance Criteria
CMO	Research cites sources. Covers 3+ companies. Includes ICP fit assessment per company.
SDR	All scoring fields populated with evidence. Score justification present. Company URL included.
Social Media	Post passes content rules. Has CTA or closing question. Under character limit.
CTO	Technical claims include proof points. Follows content guidelines.

After each agent run, the CEO orchestrator reads the output and checks these criteria. Failed checks are logged, flagged in the weekly brief, and the output is excluded from downstream use.

This is not sophisticated. There is no eval harness running hundreds of test cases against the prompt. It is a set of boolean checks applied to each output. But it catches the failures that matter: missing data, hallucinated details, content rule violations.

The quality gates also serve as regression tests. When we edit a prompt, the next pipeline run validates the output against the same criteria. If a prompt change causes a previously-passing check to fail, we know immediately.

Monitoring: What You Actually Need

We track three things per agent run:

Token usage (input and output). A sudden spike in input tokens means the agent is reading more context than expected — possibly a file grew or the prompt expanded. A spike in output tokens means the agent is producing more than it should, which usually indicates a loop or an overly verbose response.
Run duration. Each agent has a max_turns limit (25-40 turns depending on the agent). If an agent consistently hits its turn limit, the prompt needs to be more focused or the task needs to be decomposed.
Quality gate pass rate. If the SDR agent's output fails acceptance criteria more than once in three consecutive runs, the prompt needs attention.

These three metrics together tell you everything: is the prompt efficient (tokens), is it focused (duration), and is the output correct (quality gates)?

We also send Telegram alerts for agent failures. A failed agent run sends a push notification immediately. This matters because the agents run unattended at 07:00. Without alerts, a failure would sit unnoticed until someone checked the logs.

Failure Modes We Have Encountered

Three months of daily agent runs produces a catalog of failure modes. These are the ones that taught us something.

Context window overflow. The CMO agent reads market research files that grow over time. After eight weeks, the accumulated research exceeded the context window. The agent started dropping information silently — it would process the first half of the file and ignore the rest. The fix was archiving old research files and keeping only the latest four weeks in the active directory.

Prompt-environment mismatch. The Social Media agent's prompt referenced a content calendar file. We renamed the file during a refactor. The agent could not find it, hallucinated a calendar, and produced posts scheduled for dates in the past. The fix was adding a pre-run check that validates all files referenced in the prompt actually exist.

Cascading state corruption. The CMO agent once wrote a lead profile directly to crm/leads/ instead of research/. The SDR agent read the malformed profile, attempted to enrich it, and produced a corrupted outreach draft. The fix was the write restriction architecture described above. This failure mode has not recurred since.

Drift between prompt and code. The email sending script was updated to include SMTP verification, but the SDR prompt still told the agent to verify emails itself. The agent would spend several turns attempting verification that the code would duplicate downstream. We now treat prompt-code synchronization as part of every code review: if you change the code, check if the prompt references the changed behavior.

Practical Recommendations

Store prompts as standalone files in git. Not in code, not in a database, not in a UI. Files in git give you history, diffs, blame, and rollback for free.

Edit prompts in response to observed failures, not hypothetical ones. Every version of our SDR prompt was a response to a specific bug in production. We never made a speculative edit that stuck.

Build enforcement into the system, not the prompt. If a constraint matters, enforce it in code. Use the prompt for guidance and the architecture for guarantees.

Track tokens, duration, and output quality per agent. These three metrics are sufficient to detect prompt problems before they cascade.

Version prompts atomically with the code that uses them. If you change the email sending script, check if the SDR prompt references email handling. Prompt-code drift is a real bug category.

Set max turn limits per agent. Without them, a confused agent will loop until it hits the API rate limit or your budget cap, whichever comes first.

Accept that prompts will keep changing. Our most stable prompt has been edited five times in three months. The least stable has been edited twelve times. This is normal. Prompts are code, and code has maintenance costs. Budget for it.

Key Takeaways

Prompts are production code. Version them in git, test them with acceptance criteria, review changes before deploying.
Architectural constraints are more reliable than prompt instructions for preventing failure modes. Use prompts for guidance. Use code for enforcement.
Each prompt iteration should be a response to a specific observed failure, not a speculative improvement.
Three metrics per agent run — token usage, duration, quality gate pass rate — are sufficient to monitor prompt health.
Prompt-code synchronization is a real maintenance concern. Treat it as part of every code review.

FAQ

How do you test prompt changes before deploying?

We run the agent manually with the updated prompt against the current state of the filesystem. The quality gate checks run automatically and flag any regressions. For high-risk changes (SDR scoring criteria, outreach templates), we run the agent against three to five known leads and compare output to the previous version before committing.

How often do prompts need updating?

In the first month, we edited prompts almost daily. By month three, edits dropped to one or two per week, mostly in response to new failure modes or scope changes. The rate decreases as the prompts mature, but it never reaches zero.

Do you use prompt templates or parameterized prompts?

The system prompt is static markdown. Dynamic context — the current date, the list of leads to process, the target output directory — is injected into the user prompt at invocation time by the orchestrator. This separation keeps the system prompt stable and the dynamic context explicit.

SifrVentures builds dedicated engineering teams for tech companies. Based in Berlin. Learn how we work | Read more on our blog

DEV Community