AI Game Dev Needs Evidence Gates, Not More Prompt Dumps

Paranoia / FutureArtStudio — Wed, 10 Jun 2026 09:58:15 +0000

Generation is easy to demo. Production needs reviewable claims, validation contracts, small experiments, and rollback.

AI-assisted game development has a strange failure mode:

It can produce more ideas, more critiques, more feature suggestions, more balance notes, more UI opinions, and more implementation plans than a small team can possibly evaluate.

At first, that feels like progress. A model can look at a design note and produce ten alternative mechanics. It can read a prototype description and suggest retention loops. It can review a screenshot and list UX issues. It can turn a vague concept into a roadmap.

But production does not fail because teams have too few suggestions.

Production usually fails because the team cannot tell which suggestions are grounded, which ones are cheap guesses, which ones contradict the current prototype, which ones would damage the game feel, and which ones should be rejected before they turn into design debt.

That is why I think AI game dev does not need more prompt dumps.

It needs evidence gates.

The problem with prompt dumps

A prompt dump is any AI output that skips the hard part:

Here are twenty mechanics.
Here are ten monetization ideas.
Here are five reasons the combat feels weak.
Here is a production plan.
Here is a rewrite of your core loop.
Here is what players will want.

The issue is not that these outputs are always wrong. Sometimes they are useful. Sometimes they surface a possibility the team missed.

The issue is that they often arrive without a clear connection to evidence.

What gameplay clip supports this diagnosis?

Which playtest note points to this problem?

Which screenshot shows the UI failure?

Which constraint makes this proposal realistic?

What would prove the change helped?

What would make us roll it back?

Without those questions, AI suggestions become a kind of production fog. They sound structured, but they do not necessarily reduce uncertainty. They may even increase it by giving every half-formed idea a professional-looking shape.

That is especially dangerous in game design because the important parts are contextual.

The same mechanic can be elegant in one game and noise in another. The same UI simplification can improve readability or erase tension. The same pacing change can make a prototype feel better for one audience and flatter for another. A model can describe design principles, but the prototype decides whether those principles apply.

So the unit of work should not be "generate more advice."

The unit of work should be: make a claim, attach evidence, define a test, and keep rollback possible.

What is an evidence gate?

An evidence gate is a review boundary between AI-generated interpretation and production action.

Before an AI suggestion becomes advice the team acts on, it has to pass through a small set of questions:

What source material is this based on?
What specific issue or opportunity is being claimed?
What validation contract would make the claim testable?
What is the smallest experiment that could check it?
What result would make us keep, revise, or roll back the change?

That gate does not need to be bureaucratic. For a small indie team, it can be a short Markdown block. For a studio, it can become part of an internal review workflow.

The important thing is not the format.

The important thing is that AI-generated work must become inspectable before it becomes operational.

A simple evidence-first workflow

Here is the pattern I am using:

source material
-> evidence notes
-> issue cards
-> validation contracts
-> small experiments
-> keep / revise / rollback

1. Source material

Start with something real.

That can be:

a gameplay recording
a screenshot
a playtest note
telemetry
a design document
a bug report
a player comment
a prototype build
a designer's constraint list

The point is to stop asking the model to reason from an empty room.

If the input is only "make the combat more fun," the output will probably be generic. If the input includes a 45-second clip where players fail to read enemy intent, the AI can be asked to reason about a concrete pattern.

The better question is not:

How do we improve combat?

It is:

Given this clip and these playtest notes, what specific readability failures can we identify, and which ones are supported by evidence?

2. Evidence notes

The first AI pass should not be a solution pass.

It should be an evidence extraction pass.

For example:

Observed evidence:
- Player takes damage before the attack intent is visually clear.
- The hit reaction overlaps with the next enemy windup.
- The UI damage indicator competes with the enemy animation.
- Two playtesters described the hit as "random."

Unsupported claims:
- "The combat is too hard."
- "The enemy design is bad."
- "The player needs more abilities."

That distinction matters.

It is completely possible for a prototype to feel bad for reasons that are not obvious. But if the workflow cannot separate observed evidence from interpretation, the team may start solving the wrong problem.

3. Issue cards

Next, turn evidence into issue cards.

An issue card should be small enough to review and specific enough to test.

Example:

Issue: Enemy hit intent is not readable before damage.

Evidence:
- Clip 02, 00:13-00:17: player is hit before the red flash appears.
- Playtest note A: "I did not know what hit me."
- Playtest note C: "The attack felt random."

Risk:
- If unresolved, players may interpret failure as unfairness rather than mastery.

Not claimed:
- This does not prove the whole combat system is too difficult.

This is where AI becomes more useful. It can help convert messy evidence into reviewable units. But the output is still not a production change. It is a claim package.

The team still needs a validation contract.

4. Validation contracts

A validation contract defines what would make a suggestion actionable.

For the issue above, the contract might be:

Hypothesis:
If enemy attack intent becomes readable at least 300ms before damage, players will describe hits as avoidable rather than random.

Experiment:
- Add a clearer windup frame.
- Delay damage by 300ms after the warning.
- Keep damage value unchanged.
- Test with 3 players or replay the same encounter internally.

Acceptance signal:
- Players can name the incoming attack before being hit.
- Fewer "random hit" notes.
- No major loss of enemy threat.

Rollback:
- Revert the windup timing if the enemy becomes too easy or the combat loses tension.

This is the missing layer in a lot of AI-assisted production workflows.

Models are good at producing confident recommendations. Teams need recommendations that are safe to evaluate.

The contract turns a suggestion into a testable bet.

5. Small experiments

The experiment should be small.

That sounds obvious, but AI often pushes work toward large, elegant, totalizing plans:

rebuild the combat system
redesign the UI
change the enemy roster
add a progression layer
restructure the tutorial

Sometimes large changes are needed. But most AI-assisted suggestions should first be forced into a smaller experiment.

Small experiments protect the team from two failure modes:

A plausible AI diagnosis becomes a week of unvalidated work.
A real issue gets buried inside a large change, making it impossible to know what helped.

The smaller the experiment, the cleaner the learning.

6. Rollback

Rollback is not a sign that the workflow failed.

Rollback is part of the workflow.

If a team cannot easily undo an AI-suggested change, the suggestion is more expensive than it looks. A bad idea that can be rejected quickly is tolerable. A bad idea that contaminates the design direction, documentation, tasks, and implementation is not.

That is why every AI-assisted workflow should ask:

How do we get back to the previous state if this does not work?

This is not just an engineering concern. It is a design concern.

Game design is full of attractive changes that make the game worse in context. A workflow that cannot reject its own suggestions will slowly accumulate confident mistakes.

What changes when teams use evidence gates?

The biggest change is not that AI becomes "smarter."

The biggest change is that the team becomes harder to fool.

With evidence gates, AI is no longer treated as a source of authority. It becomes a tool for structuring claims, surfacing options, and compressing review work.

The designer still owns judgment.

The prototype still owns truth.

The playtest still owns surprise.

The team still decides what matters.

AI becomes useful when it makes that judgment easier, not when it tries to replace it.

This is not anti-AI

I am not arguing that AI has no place in game development.

I use it constantly for research synthesis, design critique, workflow drafting, code review support, localization experiments, tool exploration, and production planning.

The point is narrower:

AI-generated work should not enter production as an unreviewable blob.

It should enter as a structured claim that can be inspected, tested, and rejected.

For game teams, that difference matters more than the prompt.

The future of AI-assisted game design will not be decided by who has the longest prompt library. It will be decided by who can build reliable loops between evidence, interpretation, implementation, and learning.

Where ParanoiaSkills fits

I am building this direction into an open-source project called ParanoiaSkills.

The project is a collection of Markdown-based agent skills and workflows for AI-assisted game design. The goal is not to make AI "design the game." The goal is to make AI-assisted work more reviewable:

turn source material into structured evidence
convert observations into issue cards
define validation contracts
plan small experiments
keep rollback explicit
separate candidate ideas from promoted workflow rules

In other words, it is an attempt to move from:

prompt -> confident advice

to:

evidence -> claim -> validation -> experiment -> learning

That shift is small on paper, but it changes the role of AI in production.

The model stops being the voice that tells the team what to do.

It becomes a collaborator that helps the team make better decisions under constraints.

A practical test

If you are using AI in a game project, try this on one small problem:

Pick a real gameplay clip, screenshot, or playtest note.
Ask the model to extract only evidence, not solutions.
Convert that evidence into one issue card.
Write a validation contract before implementing anything.
Run the smallest experiment you can.
Decide in advance what rollback means.

If the AI output cannot survive that process, it probably was not production advice yet.

It was just a suggestion.

And suggestions are cheap.

Validated learning is not.

ParanoiaSkills is here:

https://github.com/DY-2026/ParanoiaSkills

If you are building games with AI, especially in Godot or indie production contexts, I would be interested in how you validate AI-assisted design work before it becomes production guidance.

That is the part I think deserves more attention.

DEV Community: Paranoia / FutureArtStudio