Stop letting the prompt be your state machine
You shipped an LLM feature six months ago. Now the same user input produces wildly different outputs depending on... nothing you can point to. Something in the sampling? The time the context filled up and a chunk got dropped? Nobody knows. This is what happens when the prompt becomes your runtime.
The trap: the prompt as an accidental runtime
Here is what the trap looks like in TypeScript:
async function handleUserRequest(input: string): Promise<string> {
const prompt = `
You are a helpful assistant.
The user said: ${input}
Previous context: ${someGlobalContext}
Decide what to do, gather any information you need,
format the response, and return it.
`;
return llm.complete(prompt);
}
The model is doing everything here: deciding the intent, gathering data, formatting output, choosing what to persist. That is a footgun. You handed the runtime to a stochastic function.
Gartner attributes many failed agentic AI projects to unclear value and inadequate risk controls. Deterministic, testable workflows address both. The fix is not a better prompt. The fix is to stop using the prompt as an architecture.
What "deterministic" can and cannot mean here
Be honest about what you can and cannot control.
You cannot control: the model's exact output. It is probabilistic by design.
You can control:
- The shape of the output (structured output plus schema validation)
- The steps that run before and after the model call
- What data enters the model
- What happens when the output fails validation
- Whether a human reviews the result before it commits to anything irreversible
Determinism here means: the same inputs, the same workflow steps, the same guardrails every time. Not the same tokens every time. That is a realistic and achievable target. It is also the thing teams skip when they are moving fast.
Typed workflow steps around the model call
Break the work into discrete typed steps. Each step has a clear input type and a clear output type. The model call is one step in the pipeline, not the whole thing.
type WorkflowInput = {
userId: string;
rawRequest: string;
};
type EnrichedInput = WorkflowInput & {
userContext: UserContext;
relevantDocs: string[];
};
type ModelOutput = {
intent: "summarize" | "search" | "draft" | "unknown";
confidence: number;
payload: string;
};
type WorkflowResult = {
response: string;
audit: {
intent: string;
humanReviewed: boolean;
};
};
async function enrich(input: WorkflowInput): Promise<EnrichedInput> {
const [userContext, relevantDocs] = await Promise.all([
fetchUserContext(input.userId),
fetchRelevantDocs(input.rawRequest),
]);
return { ...input, userContext, relevantDocs };
}
async function classify(enriched: EnrichedInput): Promise<ModelOutput> {
// Model call is isolated here, not scattered everywhere
const raw = await llm.complete(buildClassificationPrompt(enriched));
return parseAndValidate(raw);
}
async function respond(output: ModelOutput): Promise<WorkflowResult> {
const response = await generateResponse(output);
return {
response,
audit: { intent: output.intent, humanReviewed: false },
};
}
async function runWorkflow(input: WorkflowInput): Promise<WorkflowResult> {
const enriched = await enrich(input);
const classified = await classify(enriched);
return respond(classified);
}
Each step is independently unit testable. You can mock classify to return a fixed ModelOutput and test respond in complete isolation. That was impossible when the prompt was the runtime.
Structured output + schema validation as a contract
The model call step should never return a raw string when you need structured data. Use JSON mode, tool calling, or a schema constrained completion, then validate immediately.
import { z } from "zod";
const ModelOutputSchema = z.object({
intent: z.enum(["summarize", "search", "draft", "unknown"]),
confidence: z.number().min(0).max(1),
payload: z.string().min(1),
});
async function classify(enriched: EnrichedInput): Promise<ModelOutput> {
const raw = await llm.complete(buildClassificationPrompt(enriched), {
response_format: { type: "json_object" },
});
const parsed = JSON.parse(raw);
const result = ModelOutputSchema.safeParse(parsed);
if (!result.success) {
throw new ClassificationValidationError(result.error, raw);
}
return result.data;
}
Zod gives you a contract. If the model drifts, the validation catches it before the rest of your app sees the output. The answer to "how do you validate LLM responses?" is: schema validation on parse, not on trust.
Retries, idempotency, and failure gates
Validation failures should not crash silently. Wrap the model call with a retry budget and a typed failure signal:
type ClassifyResult =
| { ok: true; data: ModelOutput }
| { ok: false; reason: "validation" | "timeout" | "rate_limit"; raw?: string };
async function classifySafe(
enriched: EnrichedInput,
maxAttempts = 2
): Promise<ClassifyResult> {
for (let attempt = 1; attempt <= maxAttempts; attempt++) {
try {
const data = await classify(enriched);
return { ok: true, data };
} catch (err) {
if (err instanceof ClassificationValidationError && attempt < maxAttempts) {
continue; // one retry on schema failure
}
if (err instanceof RateLimitError) {
return { ok: false, reason: "rate_limit" };
}
return { ok: false, reason: "validation", raw: (err as any).raw };
}
}
return { ok: false, reason: "validation" };
}
Idempotency matters when retries touch external state. If your workflow calls an API inside the model step, wrap it in an idempotency key so a retry does not double the side effect. The workflow layer controls this. The model itself cannot.
Where a human gate belongs
A hybrid memory and retrieval approach (automatic retrieval at request start plus explicit storage) keeps agent state predictable. So does knowing when not to automate the final step.
High impact or irreversible steps should route to a human via a control gate before committing. Not because LLMs are bad. Because some decisions carry real consequences and the cost of a wrong one outweighs the automation gain.
async function runWorkflow(input: WorkflowInput): Promise<WorkflowResult> {
const enriched = await enrich(input);
const classifyResult = await classifySafe(enriched);
if (!classifyResult.ok) {
return queueForHumanReview(enriched, classifyResult.reason);
}
const { data: classified } = classifyResult;
// Irreversible or low-confidence intent routes to human review
if (classified.intent === "draft" && classified.confidence < 0.85) {
return queueForHumanReview(enriched, "low confidence on draft intent");
}
return respond(classified);
}
The control gate is a typed branch in your workflow, not a prompt instruction. "Only do this if you are sure" is not a guardrail. A typed branch is.
If you want to go deeper on how this fits into a full system, I wrote up the production architecture for agents including how to wire these patterns together at scale.
FAQ
How do you make LLM output deterministic?
You cannot make the model itself deterministic. You make the system deterministic around it. Schema validated structured output, typed workflow steps, and retry gates with failure signals are the practical levers. The model is one isolated black box step in an otherwise typed, testable pipeline.
What is structured output?
Structured output means the model returns data in a schema you define rather than freeform prose. Most providers support JSON mode or function calling. You parse and validate the result immediately with a schema library. If it does not match the schema, treat it as a failed call, not a soft warning.
How do you validate LLM responses?
Parse the response as JSON, then run it through a schema validator. Zod is a common choice in TypeScript projects. A safeParse call gives you a typed result: success with data or failure with an error you can act on. Failure is an exception to handle, not a case to log and move on.
If you want a deeper look at how deterministic workflows fit into a full production system, I cover the complete production architecture for agents on my site.
If you want Next.js for AI products wired up end to end, that is exactly the kind of work I take on.
Drop a comment below. Curious what patterns people use to keep LLM features testable in production.

Top comments (0)