DEV Community

Syntora
Syntora

Posted on

Building a Prompt Engineering Feedback Loop: The System That Made My AI Prompts 3x More Effective

Most developers treat prompt engineering like a one-time skill. You read a guide, learn a few tricks, then wing it from there. That is how I started too. It did not work.

I run an AI automation agency. I use the Claude API and Claude Code daily for production systems, everything from generating content at scale to building full-stack features. When your prompts power revenue-generating infrastructure, "good enough" prompts cost real money in wasted tokens, bad outputs, and manual rework.

So I built a feedback loop. After three months of running it, I have 9 reusable prompt templates, 6 saved examples I reference constantly, and a documented list of anti-patterns that would have kept burning me. Here is the system.

The Rating Schema

After every meaningful AI session, I spend 60 seconds recording a rating. The key is making this fast enough that you actually do it.

Rating scale (1-5):

Score Meaning Trigger
1 Unusable Output required complete rewrite or was factually wrong
2 Poor Correct direction but needed heavy editing (>50% changed)
3 Acceptable Usable with moderate edits, nothing surprising
4 Good Minor tweaks only, output matched intent well
5 Excellent Copy-paste ready, better than what I would have written manually

What gets recorded per session:

## 2026-02-10 | Score: 4 | Task: API endpoint generation

**Prompt type:** Code generation (Express handler)
**Model:** claude-sonnet-4-20250514
**What worked:** Giving it the existing handler pattern as a reference.
  Output matched project conventions perfectly.
**What failed:** Did not include error handling until I asked.
**Key insight:** Always include "follow the error handling pattern from
  [reference]" in code gen prompts.
**Template updated:** Yes, added to code-gen-v3.md
Enter fullscreen mode Exit fullscreen mode

This takes less than a minute. The discipline is not in the writing. It is in doing it every single time, especially when the session goes well. You learn as much from a 5 as you do from a 1.

The Review Cadence

Weekly (15 minutes): Scan the last 7 entries. Look for two things: techniques that keep producing 4s and 5s, and failure modes that keep producing 1s and 2s. If you see the same insight three times, it is a pattern. Write it down.

Monthly (30 minutes): Update your templates. Patterns that proved out across multiple weeks get added to your prompt templates. Repeated failures get added to the anti-patterns list. Delete anything that stopped being useful.

Anti-Patterns: Prompts That Consistently Fail

These are specific patterns I tracked across dozens of sessions. Each one seemed reasonable but produced reliably poor results.

1. The Vague Scope Dump

Bad: "Build me a user authentication system with all the features
     a modern app would need."

Score: 1-2, every time.
Enter fullscreen mode Exit fullscreen mode

This fails because "all the features" is unbounded. The model guesses at scope and inevitably picks the wrong features or implements them at the wrong depth. You get a sprawling mess that matches nothing you actually needed.

2. The Over-Constrained Micromanager

Bad: "Write a Python function that takes exactly two arguments,
     the first being a string of length 1-255, validates it
     using regex pattern ^[a-zA-Z0-9_]+$, raises ValueError
     with message 'Invalid input: {input}' if validation fails,
     logs to stdout using print() not logging, returns a dict
     with keys 'status' and 'result'..."
Enter fullscreen mode Exit fullscreen mode

This looks thorough but it produces brittle, literal-minded code. The model follows your spec so precisely that it misses obvious improvements. Worse, you spend more time writing the prompt than you would writing the code.

3. The Missing Context Assumption

Bad: "Add pagination to the list endpoint."
Enter fullscreen mode Exit fullscreen mode

This assumes the model knows your pagination style, your ORM, your response format, your frontend expectations. It does not. You get generic offset/limit pagination when you needed cursor-based, or you get a completely different response envelope than your other endpoints use.

4. The "Be Creative" Trap

Bad: "Write me a really creative and unique solution for caching
     API responses."
Enter fullscreen mode Exit fullscreen mode

"Creative" is not a technical requirement. This produces over-engineered novelty code, often using obscure patterns the model is less reliable at implementing. Your caching layer does not need to be creative. It needs to work.

Proven Patterns: What Consistently Scores 4-5

1. Reference-Based Generation

Proven: "Here is an existing handler that follows our project
conventions:

[paste 30-50 lines of a real handler]

Write a new handler for the /users/:id/preferences endpoint
that follows the same patterns for error handling, response
format, and input validation."
Enter fullscreen mode Exit fullscreen mode

This works because you are giving the model a concrete target instead of an abstract description. It matches style, structure, and conventions without you having to enumerate every rule.

2. Constraint Sandwich (Context, Task, Boundaries)

Proven: "Context: Python FastAPI service, SQLAlchemy ORM,
Pydantic v2 for schemas. We use repository pattern for
data access.

Task: Write a new endpoint POST /api/v1/reports that accepts
a date range and report type, queries the database, and
returns aggregated results.

Boundaries: Do not add any new dependencies. Use existing
database models. Keep the endpoint under 40 lines. Return
errors as HTTPException with appropriate status codes."
Enter fullscreen mode Exit fullscreen mode

Context sets the environment. Task defines the deliverable. Boundaries prevent scope creep. This three-part structure consistently produces focused, usable output.

3. Iterative Refinement With Explicit Feedback

Instead of one mega-prompt, break it into steps and give explicit feedback:

Step 1: "Write the Pydantic schemas for a report request
and response."

[Review output, then:]

Step 2: "Good. The request schema is right. For the response,
change 'data' to 'rows' and add a 'generated_at' timestamp
field. Now write the repository method that queries the
database using these schemas."
Enter fullscreen mode Exit fullscreen mode

This consistently outperforms trying to get everything right in a single shot. Each step is small enough that the model gets it right, and your corrections compound.

4. Anti-Pattern Fencing

Proven: "Write a database migration to add a 'status' column
to the orders table.

Do NOT: create a new table, modify existing columns, add
indexes (we will do that separately), or include seed data."
Enter fullscreen mode Exit fullscreen mode

Explicitly stating what you do not want is as valuable as stating what you do. This eliminates the most common failure mode: the model "helping" by doing more than you asked.

Prompt Template: v1 vs v3

Here is a real template from my system. The v1 version is what I started with. The v3 version is what three months of feedback produced.

v1 (Naive):

# Code Generation Prompt

Write [description of what I need] in [language].
Make it production-ready.
Enter fullscreen mode Exit fullscreen mode

v3 (After Feedback Loop):

# Code Generation Prompt v3

## Reference
[Paste 1 existing file that follows project conventions]

## Context
- Language/framework: [e.g., Python 3.12, FastAPI]
- ORM/DB: [e.g., SQLAlchemy 2.0, PostgreSQL]
- Project patterns: [e.g., repository pattern, dependency injection]

## Task
[One clear deliverable, 1-3 sentences max]

## Boundaries
- Do not add new dependencies
- Do not modify existing files unless specified
- Match the error handling pattern from the reference
- [Any other project-specific constraints]

## Output format
- Single file, ready to save
- Include type hints
- Include docstring with one-line description
Enter fullscreen mode Exit fullscreen mode

What changed between v1 and v3:

  • Added Reference section. This was the single biggest improvement. Giving the model a real example to match cut my edit time by half.
  • Split "make it production-ready" into explicit Boundaries. "Production-ready" means different things to every developer. Spelling out the constraints removed ambiguity.
  • Added Output format. Specifying "single file, ready to save" eliminated the model's tendency to split code across multiple files or add explanatory text I had to strip out.
  • Removed vague qualifiers. No more "make it clean" or "make it good." Every instruction is specific and verifiable.

File Structure for Storing the System

prompt-engineering/
  ratings/
    2026-01.md          # Monthly rating logs
    2026-02.md
  templates/
    code-gen-v3.md       # Code generation (current)
    code-review-v2.md    # Code review checklist
    api-design-v1.md     # API endpoint design
    content-gen-v4.md    # Blog/marketing content
    migration-v2.md      # Database migrations
    bug-fix-v3.md        # Debugging assistance
    refactor-v2.md       # Code refactoring
    test-gen-v2.md       # Test writing
    data-transform-v1.md # Data pipeline scripts
  examples/
    great-code-gen.md    # Scored 5, reference prompt+output
    great-refactor.md    # Scored 5, complex refactor
    great-migration.md   # Scored 5, zero-downtime migration
    failed-vague.md      # Scored 1, lesson in specificity
    failed-scope.md      # Scored 1, scope explosion
    failed-creative.md   # Scored 2, over-engineered result
  anti-patterns.md       # Documented failure modes
  patterns.md            # Documented success patterns
  CHANGELOG.md           # Template version history
Enter fullscreen mode Exit fullscreen mode

Everything is plain Markdown. No special tooling. I keep it in a git repo so template changes are tracked, but a folder on your desktop works fine as a starting point. The format matters less than the habit.

Applying This to Production API Calls

The feedback loop becomes even more valuable when you are making programmatic API calls. In a chat session, you can course-correct in real time. In production, a bad prompt runs hundreds of times before you notice.

Here is how I apply the system to API calls in Python:

import anthropic
from datetime import datetime

client = anthropic.Anthropic()

# Template loaded from your templates/ directory
PROMPT_TEMPLATE = """
## Reference
{reference_example}

## Context
Service: Content generation pipeline
Output format: JSON with keys: title, body, meta_description
Word count target: {word_count}

## Task
{task_description}

## Boundaries
- Output valid JSON only, no markdown wrapping
- Do not include placeholder text
- meta_description must be under 160 characters
"""

def generate_content(task: str, reference: str, word_count: int) -> dict:
    prompt = PROMPT_TEMPLATE.format(
        reference_example=reference,
        task_description=task,
        word_count=word_count,
    )

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2048,
        messages=[{"role": "user", "content": prompt}],
    )

    return response
Enter fullscreen mode Exit fullscreen mode

Key differences for production API prompts:

  1. Version your templates. Store them as files, not inline strings. When you update a template, you can diff it against the previous version and measure whether outputs improved.

  2. Log inputs and outputs. Every API call should log the prompt version, the input variables, and a quality score (automated or manual). This is your ratings/ data at scale.

  3. A/B test template changes. When you update from v2 to v3, run both versions on the same inputs for a week. Compare output quality before fully switching.

  4. Set up automated quality checks. For structured outputs, validate the schema. For content, check word count, reading level, and keyword presence. These automated scores supplement your manual ratings.

The same feedback loop applies. You review production logs weekly, identify which prompts produce the most rework, and update those templates first. The difference is that a 10% improvement in a production prompt template saves hundreds of manual corrections per month.

Getting Started

You do not need to build all of this on day one. Start with the rating habit. After every AI session, spend 60 seconds recording a score and one key insight. Do that for two weeks. The patterns will be obvious, and you will naturally start building templates from what works.

The compound effect is real. Three months in, I spend less time prompting and get better results than I did when I started. Not because I memorized tricks, but because I built a system that learns from every session.


I'm Parker Gawne, founder of Syntora. We build custom Python infrastructure for small and mid-size businesses. syntora.io
(https://syntora.io)

Top comments (2)

Collapse
 
syntora profile image
Syntora

Prompt Engineering is everything. Planning the proper prompts just got easier.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.