GEM² Inc.

Posted on Apr 3

Same Prompt. Different Answers Every Time. Here's How I Fixed It.

#ai #llm #productivity #devtools

This is Part 3 of our AI verification series.
Part 1: Three AIs analyzed our product. None passed the truth filter →
Part 2: Human in the loop doesn't scale. Human at the edge does. →

Same prompt. Same AI. Different sessions. Different outputs.

Post 1 showed three different AIs diverging on the same question.

That's expected. Different training, different weights, different answers.

But we didn't stop there. We re-ran the same AI on the same prompt in a new session.

We got materially different outputs again.

Both looked authoritative. Neither warned us they disagreed with each other.

What the same AI said twice

Prompt: "Forecast Korea's AI industry in 2027."

Session 1 produced:

Market size: $10–15B at >25% CAGR
Global positioning: "Global AI G3 powerhouse"
Hardware claim: "All Korean electronics AI-native by 2027" — sourced to a single company's roadmap

Session 2 produced:

Market size: KRW 4.46T (~$3.3B) at 14.3% CAGR
Global positioning: "Top three AI powers" — framed as government target
No hardware claim at all

Same prompt. Same AI. Different session. A 4× market size gap. No flags from either run.

This isn't a hallucination. Both outputs were internally coherent. Both read like credible analyst reports. The problem is deeper than hallucination.

Why this happens: AI inference is non-deterministic

We spent months trying to fix output drift with better prompts, more context, stricter instructions.

It didn't work.

Because the issue isn't the prompt.

AI is optimized to sound right.
Not to prove itself.

What we call "hallucination" is mostly context drift — the model's plausibility engine filling gaps differently depending on what's salient in a given session. Different day, different sampling, different emphasis in the context window — different output. Same confidence posture throughout.

You can't prompt your way out of a non-deterministic system. You need verification as a separate step.

The truth filter didn't just score. It fingerprinted.

We ran both sessions through gem2_truth_filter — not to get a number, but to understand why the outputs diverged.

Session 1 (avg 35%):

Provider	Score	Key violation
Gemini	24%	L→G: "Global AI G3 — no index cited"
ChatGPT	21%	Δe→∫de: single company → industry-wide claim
Claude	59%	S→T: current AI strength = permanent identity

Session 2 (avg 43%):

Provider	Score	Key violation
Gemini	45%	S→T: past-tense framing of future events
ChatGPT	32%	Source attribution FAIL
Claude	51%	Scope mixing — 2033 CAGR back-extrapolated to 2027

The failure types were different. Session 1 overclaimed about Korea's global position. Session 2 failed on temporal framing and citations.

Same prompt. Different inference paths. Different failure signatures.

This is the key finding: AI output drift is not random. It's traceable.

The filter names the exact reasoning pattern that produced the problem. L→G (local to global), S→T (snapshot to trend), Δe→∫de (thin evidence to broad claim). Named patterns mean auditable drift. Auditable drift means fixable systems.

(Note: Korea AI forecasting is a harder grounding task than product analysis — fewer citable sources, more projection-dependent claims. That's why baseline scores here are lower than the results in Part 1. Same tool, same logic — harder domain.)

We stopped trying to fix the output. We fixed the conditions.

This is the shift Post 2 described philosophically. Here's what it looks like in practice.

We didn't rewrite the prompt ourselves. We asked:

"Create a grounded replacement contract prompt using gem2 tools."

One command. The system generated a formal contract — input/output types, invariants, prohibited patterns, confidence requirements. We reviewed it. We approved it. Then we ran the same AI with the contract enforced.

Session 2, contract-compliant (R2):

Provider	Score
Gemini	98%
Claude	81%
ChatGPT	64%
Average	81%

+38 points. Same AI. Same question. Different structural constraints.

The contract doesn't make the AI smarter. It makes the AI's output auditable against a defined standard.

Then the human intervened. Once.

81% — but the output read like a legal document. Every claim cited, scoped, hedged. Epistemically reliable. Practically unreadable.

One instruction:

"Soften the tone. Don't reintroduce any claims the truth filter removed."

Session 2, softened (R3):

Provider	Score
Gemini	95%
Claude	75%
ChatGPT	57%
Average	75%

Down 6 points. More readable. Still grounded.

We chose 75%. Not because it's better than 81%. Because 75% is the right trade-off — readable enough to share, grounded enough to trust. We submitted 75% to gem2 calibration as our standard for narrative AI forecasts.

Human reads the audit.
Human decides the trade-off.
Human defines the standard.

Not reviewing every line. Not trusting blindly. Deciding at the right moment.

What the full arc looks like

Session 1 (no filter)   →  35% avg
Session 2 (no filter)   →  43% avg
Contract applied (R2)   →  81% avg
Human softened (R3)     →  75% avg  ← our standard

Truth is not the score.
Truth is the pattern of drift.
You define the standard.

The workflow: AI audits AI

Human asks  →  AI executes
AI verifies AI  →  AI fixes AI
Human decides at the edge

The verification layer — gem2_truth_filter, tpmn_contract_writer, the composer — runs between generation and delivery. The human sees the audit result, decides the acceptable trade-off, sets the calibration standard.

Human-in-the-loop means the human is the bottleneck — every output passes through before it ships. That doesn't scale. Human-at-the-edge means you define "acceptable" once, and the system enforces it automatically. You intervene only when a genuine judgment call is required — like choosing 75% over 81%.

TPMN is not a checker

TPMN is not a validator, a linter, or a hallucination detector.

TPMN is an epistemic gauge.

It shows what's grounded, what's inferred, what's extrapolated. It fingerprints why outputs differ across sessions. It generates the contracts that stabilize structure. It collects human calibration signals and turns them into a standard.

It doesn't decide. You do.

We're calling the full suite GEM2 Epistemic Studio — 15 tools across four functional groups: analysis, contract authoring, calibration, and execution. TPMN Checker is one group inside it.

Try it on your own output

Paste any AI output into your conversation.
Ask: "Verify this by gem2 truth filter."
Read the score. See what's grounded vs extrapolated.
Ask: "Create a grounded replacement prompt using gem2 contract writer."
Run it again. Watch the difference.

Your AI picks the right tool from 15 available MCP tools automatically. No configuration. No TPMN knowledge required.

The goal isn't a higher score. It's a score you understand and a standard you chose.

→ Try it free at gemsquared.ai

What comes after prompting

The industry is still in the prompting era. Better prompts, longer context, chain-of-thought — all useful, all insufficient.

The next step isn't better prompting. It's verification as infrastructure.

AI generates.
AI verifies.
AI refines.
Human decides at the edge.

We didn't make AI smarter. We made it accountable.

That's measurable: 35% → 75% on the same task, with the same AI, using nothing but a formal contract and one human judgment call.

GEM2 Epistemic Studio — 15 tools, 6 domains, 3 providers. Free to start.

Built by Inseok Seo (David) — GEM²-AI

→ gemsquared.ai
→ TPMN-PSL Specification (open, CC-BY 4.0)
→ GitHub
→ Part 1: Three AIs analyzed our product
→ Part 2: Human at the edge

DEV Community