Mary Olowu

Posted on May 20

LLMs Are Probabilistic. Your Workflow Shouldn't Be.

#ai #architecture #automation #devops

Most AI app demos make the same mistake:

they treat the model like the application.

Prompt in, answer out, maybe a few tool calls, then everybody acts surprised when the thing becomes weird in production.

The problem is not that the model is useless.

The problem is that we keep asking a probabilistic component to behave like deterministic workflow code.

That is the wrong boundary.

The better pattern is simpler:

let the model interpret
let software validate
let the workflow own state
let high-risk actions require approval

If you remember nothing else, remember this:

Do not let the model own irreversible state transitions.

Why This Matters

The reliability problem is not hypothetical anymore.

Stanford HAI's 2026 AI Index reports that in a new accuracy benchmark, hallucination rates across 26 top models ranged from 22% to 94%. The same report says documented AI incidents rose from 233 in 2024 to 362 in 2025.

And yet adoption keeps rising. In the 2026 AI Index economy chapter, 88% of surveyed organizations reported using AI in at least one business function in 2025, and 79% reported regular generative-AI use in at least one function.

So yes, teams are shipping this stuff.

But the same chapter also says scaled AI-agent use stayed in the single digits across nearly all business functions.

That makes sense. People want the upside of LLMs without letting them quietly become the database, the rules engine, and the compliance department at the same time.

Even the Model Vendors Are Telling You This

Anthropic's "Building Effective Agents" says the most successful implementations they saw used simple, composable patterns, and draws a clear line between:

workflows, where LLMs and tools are orchestrated through predefined code paths
agents, where models dynamically direct their own processes and tool usage

They also recommend starting with the simplest solution possible and only adding complexity when needed.

OpenAI says something similar in a different way. In the Structured Outputs launch, they explicitly note that model behavior is inherently non-deterministic, and that better model performance alone still did not meet the reliability developers need for robust applications. Their answer was not "prompt harder." It was deterministic constrained decoding around model output.

That is the pattern.

Use the model where probability is acceptable.
Use deterministic engineering where correctness has to be enforced.

The Architectural Rule

Here is the rule I trust:

LLMs should propose actions. Software should decide whether those actions are allowed.

That means the model is great for:

intent extraction
document understanding
summarization
classification
drafting
tool selection in bounded contexts

And it should usually not be the final authority for:

permissions
pricing
payment execution
account state
refunds
contract interpretation
compliance gates
destructive writes

If the model says "refund this order," that is not a refund. That is a recommendation.

Your application should still check:

does the order exist?
is it refundable?
is the amount within policy?
does the caller have permission?
is there already a refund in flight?
does this require human approval?

That logic belongs in code, not in hope.

The Boring Architecture That Wins

This is the stack I would trust in production.

1. LLM as interpreter

The model turns unstructured input into a candidate action:

{
  "intent": "issue_refund",
  "order_id": "ord_123",
  "reason": "duplicate_charge",
  "amount": 49.0
}

2. Typed output contract

Do not parse vibes. Parse a schema.

OpenAI's Structured Outputs guide exists for a reason: even when model quality improves, you still need deterministic enforcement around output shape.

3. Deterministic validator

Now your real application runs checks:

schema validation
authz
business rules
idempotency
resource existence
threshold checks
rate limits

4. Workflow engine or state machine

The model does not own the state transition. Your workflow does.

For example:

requested -> validated -> approved -> executed -> recorded

If validation fails, the workflow branches. If approval is required, it pauses. If execution fails, it retries or dead-letters.

5. Scoped tools

Tools should be narrow, explicit, and permissioned.

OpenAI's practical guide to building agents recommends assessing tool risk based on things like write access, reversibility, and financial impact, then pausing or escalating to a human for high-risk functions.

That is not compliance theater. That is basic architecture.

6. Tracing, logs, and evals

If you cannot inspect which prompt, tool result, validation step, and branch decision led to an action, you are not debugging a system. You are guessing.

The Anti-Pattern

Here is the version I do not trust:

User message goes straight to agent.
Agent decides what the business rule probably is.
Agent writes to the database.
Agent sends the email.
Agent updates the CRM.
Everybody hopes the prompt was good enough.

This looks fast in a demo because all the complexity is hidden.

It breaks in production because all the accountability is hidden too.

When something goes wrong, you will not know whether the failure came from:

bad retrieval
bad prompt assumptions
stale tool data
missing permission checks
race conditions
duplicate writes
a plain old hallucination

And worse, you may not know until after money moved or customers were contacted.

NIST Is Basically Handing You the Blueprint

NIST's AI Risk Management Framework is not just policy paperwork. It is practical engineering guidance if you read it that way.

Some of the most useful parts for builders:

define and differentiate human and AI roles
document knowledge limits and how outputs will be overseen
test systems before deployment and regularly in production
monitor functionality and behavior in production
make sure systems can fail safely beyond their knowledge limits
document risks, controls, and third-party dependencies

That is just good software engineering with a better vocabulary.

NIST's Generative AI Profile goes further and says generative AI may require additional human review, tracking, documentation, and management oversight.

Which is exactly what experienced teams discover after the first few production incidents anyway.

A Good Mental Model

Think about the LLM as a perception layer, not a transaction layer.

It helps you turn messy human input into structured candidates.

It does not get to redefine the core invariants of your system.

So instead of this:

model -> action

Build this:

model -> proposal -> validation -> policy check -> approval (if needed) -> execution -> audit log

That extra machinery is not bureaucracy.

It is the difference between an AI feature and an AI system you can trust.

What I Would Build First

If I were building an AI workflow today, I would start in this order:

One narrow use case with real pain.
One model call that returns typed output.
One validator layer with explicit business rules.
One small set of tools with clear permissions.
One approval path for high-risk actions.
One trace per run.
One eval set based on real failures, not synthetic optimism.

Anthropic's guidance to start simple is right. OpenAI's guardrail guidance is right. The teams getting burned are usually the ones skipping the boring layers because the model looked good in staging.

The Take

AI does not remove the need for software architecture.

It raises the price of getting architecture wrong.

LLMs are powerful because they handle ambiguity well. They are dangerous when you let that ambiguity leak into the parts of your system that are supposed to be exact.

So let the model read.
Let the model classify.
Let the model draft.
Let the model suggest.

But let deterministic systems own:

truth
policy
permissions
state transitions
side effects

AI makes mistakes.

That is not a reason to avoid building with it.

It is a reason to build the part around it like an engineer.

Sources

Stanford HAI, 2026 AI Index, Responsible AI chapter: https://hai.stanford.edu/ai-index/2026-ai-index-report/responsible-ai
Stanford HAI, 2026 AI Index, Economy chapter: https://hai.stanford.edu/assets/files/ai_index_report_2026_chapter_4_economy.pdf
Anthropic, "Building Effective Agents": https://www.anthropic.com/engineering/building-effective-agents
OpenAI, "Introducing Structured Outputs in the API": https://openai.com/index/introducing-structured-outputs-in-the-api/
OpenAI, "A practical guide to building agents": https://openai.com/business/guides-and-resources/a-practical-guide-to-building-ai-agents/
NIST AI RMF Core: https://airc.nist.gov/airmf-resources/airmf/5-sec-core/
NIST Generative AI Profile (NIST-AI-600-1): https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf

DEV Community