Glendel Joubert Fyne Acosta

Posted on May 30

Building AI Workflows Is Easy. Making Them Reliable Is Systems Engineering

#ai #opensource #architecture #machinelearning

Building the first version of an AI workflow is usually easy.

Connect an LLM to a few tools.
Add some instructions.
Let the model decide what to do next.
Run the demo.
It works.

The problem starts later, when that workflow becomes part of a real process.

Suddenly the important questions are not about the prompt anymore.

They are about reliability.

What happens when a tool fails ?
What happens when the model retries the wrong thing ?
What happens when the workflow changes state but the agent still claims failure ?
What happens when the agent claims success but no tool actually ran ?
What happens when one agent hands bad context to another agent ?

This is where AI workflows stop being prompt engineering.

They become Systems Engineering.

The Demo Is Not The System

A lot of AI workflow demos optimize for the happy path.

The user asks for something.
The agent thinks.
The agent calls a tool.
The tool returns a result.
The agent summarizes the result.
Everyone claps.

But production workflows do not live on the happy path.

They live in the messy reality of:

Partial failures
Bad inputs
Timeout errors
Invalid tool responses
Duplicate retries
Missing context
Permission denials
State inconsistencies
Cost limits
Human approvals
Recovery paths

The first version proves that the idea is possible.

The production version needs to prove that the system is dependable.

Those are very different goals.

Prompts Can Guide Reasoning. They Cannot Manage Reliability.

Prompts are important.

They help the model understand:

What role it is playing
What goal it should pursue
How it should reason
What tone it should use
What constraints it should consider

But prompts should not be responsible for the reliability of the whole workflow.

A prompt should not be the only thing preventing an unsafe action.

A prompt should not be the only thing remembering which step already completed.

A prompt should not be the only thing deciding whether a retry is safe.

A prompt should not be the only thing proving that a tool actually executed.

Once an AI workflow affects real systems, the runtime needs to take responsibility for the parts that require consistency.

"The model can reason. The system must govern."

The Core Split: Reasoning, Execution, State, Evidence

A reliable AI workflow needs a clean separation between four concerns:

Reasoning: The model handles reasoning.
Execution: The runtime handles execution.
State: The workflow engine manages state.
Evidence: The audit layer records evidence.

When these responsibilities are mixed together, debugging becomes painful.

For example, this is fragile:

const result = await agent.run(`
  Read the customer complaint,
  decide whether it needs escalation,
  send the email if needed,
  and tell me when you're done.
`);

Why?

Because too much is hidden inside one probabilistic step.

Did the agent actually send the email ?
Was the action allowed ?
Was the customer data valid ?
Did the escalation rule trigger ?
Did the email tool fail ?
Was the final response based on evidence or assumption ?

A more reliable architecture separates the work:

const decision = await agent.reason({
  task: "Should this complaint be escalated?",
  context
});

const permission = runtime.permissions.verify({
  actor: agent.id,
  action: "send_escalation_email",
  resource: complaint.id
});

if (!permission.allowed) {
  return runtime.recordDeniedAction(decision, permission);
}

const execution = await runtime.tools.sendEmail({
  to: escalationTeam,
  template: "complaint_escalation",
  complaintId: complaint.id
});

const evidence = runtime.audit.record({
  actor: agent.id,
  decision,
  permission,
  execution
});

return agent.summarize({
  evidenceId: evidence.id,
  executionStatus: execution.status
});

This is less magical.

It is also much easier to trust.

The Retry Problem

Retries are one of the most underestimated problems in AI workflows.

In traditional software, retrying a failed API call is usually straightforward.

If the request times out, try again.

But AI workflows introduce different kinds of failure.

A tool call failing is not the same as a model reasoning step failing.
A network timeout is not the same as a bad plan.
A malformed JSON response is not the same as missing business context.
A low-quality answer is not the same as an unavailable dependency.

Different failures need different retry strategies.

For example:

switch (failure.type) {
  case "tool_timeout":
    return retrySameToolCall();

  case "invalid_tool_payload":
    return askModelToRepairPayload();

  case "bad_reasoning":
    return resetContextAndReplan();

  case "permission_denied":
    return escalateToHuman();

  case "cost_budget_exceeded":
    return stopWorkflow();
}

If every failure is handled with "just run the agent again", the system can become expensive, slow, and unreliable.

Sometimes the correct retry is not retrying.

Sometimes the correct response is:

Reduce scope
Reset context
Ask for clarification
Escalate to a human
Stop the workflow
Record the failure

Cost-aware retries are not just a billing concern.

They are a reliability concern.

State Must Be Explicit

A workflow that cannot explain its current state cannot be reliably recovered.

If an Agent is halfway through a process, the system should know:

Which step is running
Which steps completed
Which tools executed
Which outputs were produced
Which approvals are pending
Which errors occurred
What can safely happen next

Without explicit state, recovery becomes guesswork.

This is especially dangerous when the workflow mutates external systems.

Imagine a workflow that:

Reads a customer complaint.
Creates an internal ticket.
Sends an escalation email.
Updates the CRM.
Marks the complaint as handled.

If the workflow fails at step 4, what should happen?

Should it restart from step 1 ?
Should it send the email again ?
Should it create a duplicate ticket ?
Should it mark the complaint as handled ?

The answer depends on state.

Reliable workflows need checkpoints.

workflow.checkpoint("ticket_created", {
  ticketId,
  complaintId,
  timestamp
});

workflow.checkpoint("email_sent", {
  messageId,
  recipient,
  timestamp
});

Checkpoints make recovery possible.

They also make debugging possible.

Evidence Beats Claims

One of the most dangerous failure modes in AI workflows is false completion.

The agent says:

"Done, I sent the email."

But no email was sent.

Or the email tool failed.

Or permission was denied.

Or the agent never called the tool.

The model's final answer is not evidence.

It is a claim.

A reliable workflow should be able to prove what happened.

An evidence record might include:

{
  "actor": "support-agent-01",
  "action": "send_email",
  "permission": "granted",
  "tool": "email_sender",
  "status": "success",
  "messageId": "msg_123",
  "timestamp": "2026-05-29T14:32:10Z",
  "auditId": "audit_789"
}

Now the system can answer:

Who acted
What was requested
Whether it was allowed
What executed
What result came back
When it happened
What proves it

That is the difference between trusting the agent and trusting the system.

Multi-Agent Workflows Make Reliability Harder

Multi-Agent Systems (MAS) amplify every reliability problem.

In a Single-Agent workflow, one model may lose context or make a bad assumption.

In a Multi-Agent workflow, one agent's unsupported claim can become another agent's input.

For example:

Research Agent says it collected the correct data.
Analyst Agent uses that data to generate a report.
Reviewer Agent approves the report.
Communication Agent sends it to the customer.

If the first claim was wrong, the entire workflow becomes unreliable.

The final output may look coherent.

But the foundation is broken.

That is why Multi-Agent workflows need strong boundaries:

Explicit handoffs
Scoped context
Evidence records
Validation gates
Responsibility tracking
State checkpoints

Agents should not pass vague natural-language summaries to each other as if they were verified facts.

A good handoff should include:

{
  "from": "research-agent",
  "to": "analyst-agent",
  "task": "analyze_customer_churn",
  "artifactId": "dataset_456",
  "evidenceId": "audit_123",
  "status": "verified",
  "scope": "Q1 customer data only"
}

That is much more reliable than:

"I collected the data. You can continue."

Observability Is Not Optional

Once AI workflows become operational, observability becomes foundational.

A useful trace should show:

What the model intended
What context it received
What action it requested
Whether permission was granted
What tool executed
What state changed
What evidence was recorded
What the agent claimed afterward

Without this, teams end up debugging through transcripts and guesses.

That does not scale.

Traditional logs tell you that something happened.

AI workflow observability needs to explain why something happened, what the model believed, what the runtime allowed, and what actually executed.

That means observability must include both:

Reasoning traces
Runtime evidence

One without the other is incomplete.

The Architecture Pattern

A production AI workflow should not be one big prompt chain.

It should look more like this:

User Request
     ↓
Intent Resolution
     ↓
Context Assembly
     ↓
Model Reasoning
     ↓
Action Request
     ↓
Permission Check
     ↓
Tool Execution
     ↓
Evidence Record
     ↓
State Checkpoint
     ↓
Agent Summary
     ↓
Verification / Escalation

The model is still important.

But it is no longer responsible for everything.

It reasons inside a system that manages boundaries, execution, and recovery.

That is the shift.

AI Workflows Are Operational Systems

When an AI workflow becomes part of a business process, it needs the same engineering discipline as any other operational system.

It needs:

Clear inputs
Explicit state
Bounded execution
Permission checks
Retry policies
Failure handling
Observability
Audit trails
Recovery paths
Verification gates

This is not bureaucracy.

This is what makes the workflow dependable.

The more responsibility we give AI Agents, the more important the surrounding system becomes.

Conclusion

Building an AI workflow is easy.

Making it reliable is the hard part.

The future of AI agents will not be won only by better prompts or bigger models.

It will be won by better runtime architecture.

Prompts guide reasoning.

But reliable AI workflows need:

Checkpoints
Retries
Permissions
Execution Boundaries
Observability
Audit Trails
Evidence
Recovery

That is why production AI workflows are not just prompt engineering.

They are Systems Engineering.

Top comments (2)

Harjot Singh • May 31

This title is the whole thesis of production AI in one line. The demo is a prompt; reliability is retries, idempotency, timeouts, fallbacks, observability, graceful degradation when the model returns garbage - i.e. classic distributed-systems engineering, just with a non-deterministic component in the middle. The non-determinism is what makes it harder than normal systems work: you can't assume the same input gives the same output, so every step needs to be defensive.

The mental shift that helped me: treat the LLM as an unreliable network call, not a function. You'd never call a flaky external API without retries, validation, and a timeout - same posture for the model. That framing is the backbone of Moonshift (a multi-agent pipeline shipping a prompt to a real SaaS) - the orchestration treats each agent step as a fallible service with verification gates around it. ~$3 flat, first run free. Excellent framing - what's the reliability primitive people underestimate most? Mine's idempotency, because agents love to retry and double-execute side effects.

Glendel Joubert Fyne Acosta • Jun 1

Great framing: "treat the LLM as an unreliable network call" is exactly the right mental model.

I agree that idempotency is probably the most underestimated primitive, especially once agents can trigger side effects.

A retry is harmless when the step is pure reasoning. It becomes dangerous when the step sends an email, creates a ticket, charges a card, updates a CRM record, or triggers another workflow.

For me, the minimum reliable pattern is:

Stable operation IDs
Explicit workflow state
Side-effect deduplication
Tool-level idempotency keys
Evidence records for executed actions
Retry policy based on failure type

The hard part is that agents often retry semantically, not mechanically. They do not always repeat the exact same API call. They may reformulate the task and accidentally create a second side effect.

That is why retries need to be governed at the workflow/runtime layer, not left entirely to the model.