DEV Community: Glendel Joubert Fyne Acosta

AI Agents Don't Need More Memory. They Need Governed Recall.

Glendel Joubert Fyne Acosta — Thu, 18 Jun 2026 01:51:06 +0000

Most AI Agent Memory discussions start from the same assumption:

If the agent forgets, give it more memory.

More chat history.
More retrieved documents.
More summaries.
More vector storage.
More context window.
More persistence.

But the more I look at real agent workflows, the more I think this framing is incomplete.

The hard problem is not simply giving agents more memory.

The hard problem is deciding what the agent is allowed to recall.

That is a different architectural problem.

And it matters a lot.

More Memory Is Not Always Better

At first, adding memory makes agents look smarter.

They remember previous conversations.
They reuse past decisions.
They recover project details.
They avoid asking the same questions again.
They feel more continuous.

But after a while, something strange happens.

The agent starts getting worse.

It recalls stale assumptions.
It treats old context as current state.
It uses generated summaries as if they were facts.
It mixes user preferences with workflow evidence.
It retrieves private or irrelevant information.
It acts on something that was true yesterday, but false today.

The agent is not failing because it forgot.

It is failing because it remembered without governance.

That is the uncomfortable truth:

More memory can make agents less reliable.

The Real Problem Is Recall

Memory is usually framed as a storage problem.

Where do we store it ?
A vector database ?
A relational database ?
Files ?
A graph ?
A long context window ?
A model's own weights ?

Those are important implementation choices, but they do not answer the deeper question.

For any specific task, the system still needs to decide:

What should be recalled ?
Who is allowed to recall it ?
Is it still fresh ?
Where did it come from ?
What authority does it have ?
Does newer evidence override it ?
Should it be shown to this agent ?
Should it affect this decision ?

That is not just retrieval.

That is recall policy.

And recall policy is where agent memory becomes a runtime architecture problem.

Retrieval Is Not Governance

A retrieval system can answer:

"What information is semantically similar to this query ?"

But an agent memory system needs to answer:

"What information is this agent allowed to use for this task right now ?"

Those are not the same question.

Semantic similarity is useful, but it is not enough.

A stale memory can be semantically relevant.
A private document can be semantically relevant.
A low-authority summary can be semantically relevant.
A model-generated assumption can be semantically relevant.
A superseded workflow state can be semantically relevant.

That does not mean it should enter the prompt.

Retrieval finds candidates.

Governed recall decides what is allowed to become active.

Memory Needs Authority

Not all memory should have the same power over future agent behavior.

A previous chat message is not the same as a tool result.
A generated summary is not the same as an approved policy.
A model assumption is not the same as runtime evidence.
A user preference is not the same as workflow state.
A retrieved document is not automatically more trustworthy than a current system record.

Yet many agent systems flatten these into the same prompt as plain text.

Once that happens, the model has to infer authority from language.

That is fragile.

A production memory system should distinguish between different kinds of memory:

Runtime evidence
Workflow state
Approved policies
User preferences
Retrieved knowledge
Generated summaries
Model assumptions
Prior messages
External observations
Human approvals

These should not enter context as equal facts.

The runtime should preserve their authority before the model reasons over them.

Runtime Evidence Should Beat Model Assumptions

This boundary is critical.

If the model says:

"I sent the email".

That is a claim.

If the email API returns a message ID and timestamp, that is evidence.

If the model says:

"The customer probably prefers option A".

That is an assumption.

If the customer explicitly selected option B in a form, that is evidence.

If the model says:

"This task is already complete".

That is a claim.

If the workflow state shows required artifacts are missing, the task is not complete.

Agent systems become dangerous when claims, assumptions, summaries, and evidence all enter memory with the same authority.

Governed recall means the system knows the difference.

The model can reason.

But the runtime should know what actually happened.

Freshness Matters

A memory can be true and still be dangerous.

Because it may no longer be true.

This is one of the biggest problems in long-running agent workflows.

An agent may remember:

"The deployment is blocked".

But the deployment was unblocked an hour ago.

It may remember:

"The customer has not paid".

But payment cleared this morning.

It may remember:

"Approval is still pending".

But approval was granted yesterday.

It may remember:

"The user prefers short answers".

But that preference may apply only to casual updates, not technical reports.

Freshness is not a small detail.

It determines whether memory should still influence behavior.

A memory system should not only ask:

"Have we seen something like this before ?"

It should ask:

"Is this still valid ?"

Scope Matters

An organization does not give every person access to every memory.

A finance role sees different information than a support role.
A contractor sees different information than an executive.
A customer-facing workflow sees different context than an internal strategy workflow.

AI Agents need the same boundaries.

Memory should be scoped by:

Agent role
User
Organization
Workflow
Task
Permission level
Data sensitivity
Operational context

Without scope, memory becomes a leak.

The issue is not only that the agent may retrieve the wrong information.

The issue is that the agent may retrieve information it should never have seen.

In real systems, memory access is authorization.

Provenance Matters

A memory without provenance is dangerous because the system no longer knows how much to trust it.

Where did this memory come from ?
Was it written by a human ?
Was it inferred by a model ?
Was it extracted from a document ?
Was it generated as a summary ?
Was it produced by a tool call ?
Was it approved ?
Was it observed ?
Was it imported from an external system ?
Was it created during a failed workflow ?

These distinctions matter.

A model-generated summary should not carry the same weight as the original source.
A user comment should not carry the same weight as an approved policy.
A tool result should not carry the same weight as a model's interpretation of that result.

Provenance is what prevents memory from becoming anonymous context.

And anonymous context is hard to trust.

The Model Should Not Govern Its Own Recall

One tempting pattern is to give the model access to a memory store and ask it to decide what it needs.

This can work in demos.

But for real workflows, it creates a weak boundary.

The same probabilistic system that will reason over the memory is also deciding what memory it should see.

That is risky. The model may retrieve too much.

It may retrieve stale context.
It may retrieve unauthorized context.
It may overvalue its own previous assumptions.
It may ignore stronger runtime evidence.
It may fail to notice that a memory has been superseded.

So the runtime needs to sit between memory and the model.

The model should not receive memory just because memory exists.

The runtime should curate recall.

Governed Recall

Governed recall means memory access is controlled before context reaches the model.

The runtime asks:

Is this memory relevant to the current task ?
Is the agent allowed to see it ?
Is it fresh enough ?
What is its source ?
What authority does it carry ?
Does stronger evidence override it ?
Is it scoped to this workflow ?
Has it expired ?
Has it been superseded ?
Should it be summarized ?
Should it be hidden ?
Should it trigger a human review ?

Only after those checks should memory enter the model context.

This is the difference between retrieval and governed recall.

Retrieval says:

"This looks similar".

Governed recall says:

"This is allowed, relevant, current, scoped, and trustworthy enough to influence this task".

Memory Is Policy

Once agents start operating inside real workflows, memory becomes policy.

What the agent remembers determines what it believes.
What it believes influences what it does.
What it does affects real systems.

So memory is not neutral.

It is an operational control surface.
If an agent recalls the wrong thing, it may take the wrong action.
If it recalls stale state, it may repeat work.
If it recalls private information, it may leak data.
If it recalls a weak assumption as fact, it may produce bad decisions.
If it fails to recall an obligation at the right time, it may miss a commitment.

Memory shapes behavior.

That means memory needs governance.

The Future Problem: Knowing When to Remember

There is another layer beyond what to recall.

When should memory become active ?

Most systems retrieve memory reactively.

A user asks something.
The system searches.
The model receives context.

But many organizational workflows require memory to activate later.

For example:

"Follow up with this customer if payment has not cleared by Friday".

That is not just a fact to store.

It is an intention with future activation conditions.

The memory should become relevant when time passes or when an event happens.

Most systems solve this with cron jobs, workflow engines, reminders, or external orchestration.

That works, but it shows something important:

Agent memory is not only about answering questions.

Sometimes memory needs to trigger action.

That is a much deeper problem.

And it is one of the reasons memory belongs in the runtime architecture, not only in the prompt.

A Better Mental Model

Instead of:

"The agent has memory".

Think:

"The system governs what the agent can recall".

This small shift changes the design.

The model is no longer treated as the owner of memory.
The runtime owns memory access.
The workflow owns state.
The tools produce evidence.
Permissions define boundaries.
Policies define authority.
The model receives curated context and reasons over it.

That is a much safer architecture.

Why This Matters

The AI World is moving very fast.

Every week, a new model appears.

A better brain.
A larger context window.
A stronger coding model.
A faster reasoning model.

Those improvements matter.

But smarter brains are not enough.

If AI Agents are going to operate inside real organizations, they need architecture around them.

They need permissions.
They need runtime boundaries.
They need workflow state.
They need evidence.
They need memory governance.
They need recall policies.

A powerful model without governed recall can still act on stale, unauthorized, or low-authority context.

That is not an intelligence problem.

That is a Systems Engineering problem.

Final Thought

AI agents do not need more memory by default.

They need better rules for what memory is allowed to become active.
They need memory with scope, provenance, freshness, permissions, authority, and evidence.
They need runtime-governed recall.

Because the real question is not:

"How much can the agent remember ?"

The real question is:

"Can we trust what the agent is allowed to recall ?"

AI Agent Memory Is Not Chat History

Glendel Joubert Fyne Acosta — Thu, 11 Jun 2026 00:12:05 +0000

Most AI agent systems start with a simple idea:

"Let's give the Agent Memory".

At first, this usually means saving previous messages, retrieving similar chunks, and injecting them back into the prompt.

That works for demos.

It does not work reliably for real organizational workflows.

Because chat history is not memory.
A vector database is not memory.
A bigger context window is not memory.

Those are storage and retrieval mechanisms. Useful, yes. But memory in an AI Agent System is not just about remembering more information.

It is about deciding what should influence future behavior.

And that is a much harder problem.

The Simple Version

When people say "Agent Memory", they often mix together very different things:

Conversation history
User preferences
Workflow state
Previous tool results
Retrieved documents
Task summaries
Business rules
Approved policies
Model-generated assumptions
Evidence of completed actions

But these should not all be treated the same way.

A user saying "I usually prefer short answers" is not the same kind of memory as "invoice #123 was paid".
A model saying "the client is probably interested" is not the same as a CRM record.
A previous chat message is not the same as a runtime audit log.
An approved company policy is not the same as a generated summary.

When all of these are thrown into the same context window, the agent may look smarter for a while.

Then it slowly becomes unreliable.

More Context Can Make Agents Worse

A common instinct is to give the agent more context.

More history.
More documents.
More summaries.
More retrieved chunks.
More memory.

But more context does not automatically mean better reasoning.

Sometimes it means more noise.
Sometimes it means stale information.
Sometimes it means private information leaking into the wrong task.
Sometimes it means the model starts treating old assumptions as current facts.
Sometimes it means low-authority memory overrides high-authority evidence.

This is one of the strange things about AI Agents:

The Agent can become worse because it remembers too much without knowing what should matter.

The problem is not only forgetting.

The problem is remembering without governance.

Memory Needs Scope

A human organization does not give every worker access to every memory.
A sales person does not automatically see payroll data.
A support agent does not automatically see executive board notes.
A contractor does not automatically see internal security policies.

Access depends on role, task, permission, and context.

AI agents need the same kind of boundaries.

If an agent has a role, its memory should be scoped to that role.

A finance agent should not recall unrelated HR details.
A support agent should not receive private strategy documents unless explicitly authorized.
A research agent should not inherit operational permissions just because it saw previous context.

Memory without scope becomes a data leak waiting to happen.

Memory Needs Provenance

Not all memory has the same authority.

Where did this memory come from ?
Was it written by a user ?
Was it retrieved from a document ?
Was it produced by another agent ?
Was it inferred by a model ?
Was it approved by a human ?
Was it produced by a tool execution ?
Was it recorded by the runtime ?

These distinctions matter.

For example:

"The agent thinks the customer is unhappy".

is not the same as:

"The customer wrote: 'I am unhappy with the delay'".

And neither of those is the same as:

"A support ticket was escalated by a human manager".

If the system does not track provenance, the model may treat all memory as equally trustworthy.

That is dangerous.

A model-generated assumption should not have the same authority as runtime evidence.

Memory Needs Freshness

Some memories expire.
Some facts change.
Some decisions are superseded.
Some preferences are temporary.
Some business rules are updated.
Some project states become obsolete.

If the memory layer does not understand freshness, agents can become confidently wrong.

This is especially dangerous in long-running workflows.

An agent might remember:

"The client prefers option A".

But maybe the client changed their mind yesterday.

An agent might remember:

"The deployment is blocked".

But maybe the deployment was completed two hours ago.

An agent might remember:

"This task is waiting for approval".

But maybe approval was already granted.

Memory should not only answer:

"Have I seen something like this before ?"

It should also answer:

"Is this still true ?"

Memory Needs Authority Levels

A production agent memory system should distinguish between different authority levels.

For example:

Runtime Evidence:
What actually happened: tool calls, outputs, timestamps, approvals, errors.
Approved Knowledge:
Policies, procedures, user-approved facts, business rules.
Observed Facts:
Information extracted from emails, documents, tickets, repositories, databases.
User Preferences:
Stable preferences explicitly stated by the user.
Generated Summaries:
Useful compression, but lossy and potentially wrong.
Model Assumptions:
Hypotheses, guesses, interpretations, incomplete reasoning.

These should not have equal weight.

A generated summary should not override a tool result.
A model assumption should not override a policy.
A retrieved chunk should not override a runtime audit log.

Memory needs hierarchy.

Otherwise the agent is just reasoning over a pile of mixed authority text.

Workflow State Is Not Memory

One major mistake is treating workflow state as memory.

Workflow state is not "something the agent remembers".

Workflow state is something the system owns.

For example:

Current step
Completed step
Failed step
Pending approval
Retry count
Tool result
Assigned agent
Deadline
Execution status

This should not depend on the model remembering correctly.

The runtime should know.

If an agent claims:

"I sent the email".

The system should be able to verify whether the email was actually sent.

If an agent claims:

"The task is complete".

The system should be able to check whether the required artifact exists.

If an agent claims:

"I already asked for approval".

The system should know whether an approval request was actually created.

Workflow state belongs outside the model.

The model can reason about state.

But the runtime should own state.

Memory Is Not Just Retrieval

RAG is useful.
Vector search is useful.
Embeddings are useful.
Long context is useful.

But none of them solve memory by themselves.

Retrieval answers:

"What information is semantically similar to this query ?"

Agent memory needs to answer:

"What information should this agent be allowed to use for this task right now ?"

That is a different question.

A memory system should consider:

Relevance
Permission
Freshness
Provenance
Authority
Task scope
Privacy
Retention
Evidence
Lifecycle

Without those controls, memory becomes a context injection mechanism.

And context injection is not governance.

The Runtime Should Curate Memory

In a reliable AI Agent System, the model should not receive memory simply because memory exists.

There should be a runtime or context layer that decides what enters the prompt.

That layer should ask:

Is this memory relevant to the current task ?
Is this agent allowed to access it ?
Is this memory still valid ?
What source created it ?
What authority level does it have ?
Has it expired ?
Has it been superseded ?
Does it conflict with stronger evidence ?
Should this memory be summarized or passed directly ?
Should this memory be hidden from the model ?

This is where agent memory becomes an architectural problem.

It is not just about storing text.

It is about governing recall.

A Better Mental Model

Instead of thinking:

"The agent has memory".

Think:

"The system controls what the agent is allowed to recall".

That small shift changes the architecture.

The agent does not own memory.
The runtime owns memory access.
The model reasons.
The runtime curates context.
The system records evidence.
The workflow tracks state.
Permissions control access.
Policies define boundaries.

This separation is important because models are probabilistic.

Memory governance should not be.

A Practical Architecture

A more reliable Agent Memory Architecture might separate memory into layers:

1. Conversation Context:

Recent interaction history.

Useful for continuity.

Not authoritative by default.

2. Working State:

The current task state.

Owned by the runtime, not the model.

3. Episodic Memory:

Past events and interactions.

Useful, but should include timestamps, sources, and scope.

4. Semantic Knowledge:

Documents, knowledge bases, policies, procedures.

Should include provenance and authority.

5. Runtime Evidence:

Tool calls, approvals, outputs, logs, completed actions.

This should have higher authority than model claims.

6. Preferences:

User or organization preferences.

Should be explicit, scoped, and editable.

7. Summaries:

Compressed context.

Useful, but lossy. Should not be treated as truth without source references.

The key is not only storing these separately.

The key is applying different rules to each one.

Why This Matters More in Multi-Agent Systems

Memory gets even harder when Multiple Agents are involved.

If Agent A writes something into shared memory, should Agent B trust it ?
Should Agent B see it ?
Was it an observation, an inference, or a completed action ?
Did a human approve it ?
Was it generated from stale context ?
Was it meant to be private to one workflow ?

In Multi-Agent Systems, memory becomes a coordination surface.

Bad memory can propagate across agents.

One agent makes an assumption.
Another agent reads it as fact.
A third agent acts on it.

Now the system has transformed an uncertain inference into operational behavior.

That is how Unreliable Agent Systems drift.

Multi-Agent Memory needs boundaries, ownership, and evidence.

Not just shared context.

The Real Problem

The real problem is not:

"How do we make agents remember more ?"

The real problem is:

"How do we make agents remember safely ?"

That means memory must be:

Scoped
Permissioned
Current
Traceable
Auditable
Ranked by authority
Connected to evidence
Separated from workflow state
Governed by runtime rules

Without this, agent memory becomes another source of hallucination.

A very convincing one.

Final Thought

AI agent memory is not chat history.
It is not a vector database.
It is not a bigger context window.
It is not a pile of summaries.

Real agent memory is governed recall.

For agents operating inside real organizations, memory must answer more than:

"What might be useful ?"

It must also answer:

"What is allowed, current, relevant, trustworthy, and supported by evidence ?"

That is the difference between an agent that remembers things and an agent whose memory can be trusted.

Building AI Workflows Is Easy. Making Them Reliable Is Systems Engineering

Glendel Joubert Fyne Acosta — Sat, 30 May 2026 02:14:54 +0000

Building the first version of an AI workflow is usually easy.

Connect an LLM to a few tools.
Add some instructions.
Let the model decide what to do next.
Run the demo.
It works.

The problem starts later, when that workflow becomes part of a real process.

Suddenly the important questions are not about the prompt anymore.

They are about reliability.

What happens when a tool fails ?
What happens when the model retries the wrong thing ?
What happens when the workflow changes state but the agent still claims failure ?
What happens when the agent claims success but no tool actually ran ?
What happens when one agent hands bad context to another agent ?

This is where AI workflows stop being prompt engineering.

They become Systems Engineering.

The Demo Is Not The System

A lot of AI workflow demos optimize for the happy path.

The user asks for something.
The agent thinks.
The agent calls a tool.
The tool returns a result.
The agent summarizes the result.
Everyone claps.

But production workflows do not live on the happy path.

They live in the messy reality of:

Partial failures
Bad inputs
Timeout errors
Invalid tool responses
Duplicate retries
Missing context
Permission denials
State inconsistencies
Cost limits
Human approvals
Recovery paths

The first version proves that the idea is possible.

The production version needs to prove that the system is dependable.

Those are very different goals.

Prompts Can Guide Reasoning. They Cannot Manage Reliability.

Prompts are important.

They help the model understand:

What role it is playing
What goal it should pursue
How it should reason
What tone it should use
What constraints it should consider

But prompts should not be responsible for the reliability of the whole workflow.

A prompt should not be the only thing preventing an unsafe action.

A prompt should not be the only thing remembering which step already completed.

A prompt should not be the only thing deciding whether a retry is safe.

A prompt should not be the only thing proving that a tool actually executed.

Once an AI workflow affects real systems, the runtime needs to take responsibility for the parts that require consistency.

"The model can reason. The system must govern."

The Core Split: Reasoning, Execution, State, Evidence

A reliable AI workflow needs a clean separation between four concerns:

Reasoning: The model handles reasoning.
Execution: The runtime handles execution.
State: The workflow engine manages state.
Evidence: The audit layer records evidence.

When these responsibilities are mixed together, debugging becomes painful.

For example, this is fragile:

const result = await agent.run(`
  Read the customer complaint,
  decide whether it needs escalation,
  send the email if needed,
  and tell me when you're done.
`);

Why?

Because too much is hidden inside one probabilistic step.

Did the agent actually send the email ?
Was the action allowed ?
Was the customer data valid ?
Did the escalation rule trigger ?
Did the email tool fail ?
Was the final response based on evidence or assumption ?

A more reliable architecture separates the work:

const decision = await agent.reason({
  task: "Should this complaint be escalated?",
  context
});

const permission = runtime.permissions.verify({
  actor: agent.id,
  action: "send_escalation_email",
  resource: complaint.id
});

if (!permission.allowed) {
  return runtime.recordDeniedAction(decision, permission);
}

const execution = await runtime.tools.sendEmail({
  to: escalationTeam,
  template: "complaint_escalation",
  complaintId: complaint.id
});

const evidence = runtime.audit.record({
  actor: agent.id,
  decision,
  permission,
  execution
});

return agent.summarize({
  evidenceId: evidence.id,
  executionStatus: execution.status
});

This is less magical.

It is also much easier to trust.

The Retry Problem

Retries are one of the most underestimated problems in AI workflows.

In traditional software, retrying a failed API call is usually straightforward.

If the request times out, try again.

But AI workflows introduce different kinds of failure.

A tool call failing is not the same as a model reasoning step failing.
A network timeout is not the same as a bad plan.
A malformed JSON response is not the same as missing business context.
A low-quality answer is not the same as an unavailable dependency.

Different failures need different retry strategies.

For example:

switch (failure.type) {
  case "tool_timeout":
    return retrySameToolCall();

  case "invalid_tool_payload":
    return askModelToRepairPayload();

  case "bad_reasoning":
    return resetContextAndReplan();

  case "permission_denied":
    return escalateToHuman();

  case "cost_budget_exceeded":
    return stopWorkflow();
}

If every failure is handled with "just run the agent again", the system can become expensive, slow, and unreliable.

Sometimes the correct retry is not retrying.

Sometimes the correct response is:

Reduce scope
Reset context
Ask for clarification
Escalate to a human
Stop the workflow
Record the failure

Cost-aware retries are not just a billing concern.

They are a reliability concern.

State Must Be Explicit

A workflow that cannot explain its current state cannot be reliably recovered.

If an Agent is halfway through a process, the system should know:

Which step is running
Which steps completed
Which tools executed
Which outputs were produced
Which approvals are pending
Which errors occurred
What can safely happen next

Without explicit state, recovery becomes guesswork.

This is especially dangerous when the workflow mutates external systems.

Imagine a workflow that:

Reads a customer complaint.
Creates an internal ticket.
Sends an escalation email.
Updates the CRM.
Marks the complaint as handled.

If the workflow fails at step 4, what should happen?

Should it restart from step 1 ?
Should it send the email again ?
Should it create a duplicate ticket ?
Should it mark the complaint as handled ?

The answer depends on state.

Reliable workflows need checkpoints.

workflow.checkpoint("ticket_created", {
  ticketId,
  complaintId,
  timestamp
});

workflow.checkpoint("email_sent", {
  messageId,
  recipient,
  timestamp
});

Checkpoints make recovery possible.

They also make debugging possible.

Evidence Beats Claims

One of the most dangerous failure modes in AI workflows is false completion.

The agent says:

"Done, I sent the email."

But no email was sent.

Or the email tool failed.

Or permission was denied.

Or the agent never called the tool.

The model's final answer is not evidence.

It is a claim.

A reliable workflow should be able to prove what happened.

An evidence record might include:

{
  "actor": "support-agent-01",
  "action": "send_email",
  "permission": "granted",
  "tool": "email_sender",
  "status": "success",
  "messageId": "msg_123",
  "timestamp": "2026-05-29T14:32:10Z",
  "auditId": "audit_789"
}

Now the system can answer:

Who acted
What was requested
Whether it was allowed
What executed
What result came back
When it happened
What proves it

That is the difference between trusting the agent and trusting the system.

Multi-Agent Workflows Make Reliability Harder

Multi-Agent Systems (MAS) amplify every reliability problem.

In a Single-Agent workflow, one model may lose context or make a bad assumption.

In a Multi-Agent workflow, one agent's unsupported claim can become another agent's input.

For example:

Research Agent says it collected the correct data.
Analyst Agent uses that data to generate a report.
Reviewer Agent approves the report.
Communication Agent sends it to the customer.

If the first claim was wrong, the entire workflow becomes unreliable.

The final output may look coherent.

But the foundation is broken.

That is why Multi-Agent workflows need strong boundaries:

Explicit handoffs
Scoped context
Evidence records
Validation gates
Responsibility tracking
State checkpoints

Agents should not pass vague natural-language summaries to each other as if they were verified facts.

A good handoff should include:

{
  "from": "research-agent",
  "to": "analyst-agent",
  "task": "analyze_customer_churn",
  "artifactId": "dataset_456",
  "evidenceId": "audit_123",
  "status": "verified",
  "scope": "Q1 customer data only"
}

That is much more reliable than:

"I collected the data. You can continue."

Observability Is Not Optional

Once AI workflows become operational, observability becomes foundational.

A useful trace should show:

What the model intended
What context it received
What action it requested
Whether permission was granted
What tool executed
What state changed
What evidence was recorded
What the agent claimed afterward

Without this, teams end up debugging through transcripts and guesses.

That does not scale.

Traditional logs tell you that something happened.

AI workflow observability needs to explain why something happened, what the model believed, what the runtime allowed, and what actually executed.

That means observability must include both:

Reasoning traces
Runtime evidence

One without the other is incomplete.

The Architecture Pattern

A production AI workflow should not be one big prompt chain.

It should look more like this:

User Request
     ↓
Intent Resolution
     ↓
Context Assembly
     ↓
Model Reasoning
     ↓
Action Request
     ↓
Permission Check
     ↓
Tool Execution
     ↓
Evidence Record
     ↓
State Checkpoint
     ↓
Agent Summary
     ↓
Verification / Escalation

The model is still important.

But it is no longer responsible for everything.

It reasons inside a system that manages boundaries, execution, and recovery.

That is the shift.

AI Workflows Are Operational Systems

When an AI workflow becomes part of a business process, it needs the same engineering discipline as any other operational system.

It needs:

Clear inputs
Explicit state
Bounded execution
Permission checks
Retry policies
Failure handling
Observability
Audit trails
Recovery paths
Verification gates

This is not bureaucracy.

This is what makes the workflow dependable.

The more responsibility we give AI Agents, the more important the surrounding system becomes.

Conclusion

Building an AI workflow is easy.

Making it reliable is the hard part.

The future of AI agents will not be won only by better prompts or bigger models.

It will be won by better runtime architecture.

Prompts guide reasoning.

But reliable AI workflows need:

Checkpoints
Retries
Permissions
Execution Boundaries
Observability
Audit Trails
Evidence
Recovery

That is why production AI workflows are not just prompt engineering.

They are Systems Engineering.

Evidence Beats Claims: Why AI Agents Need Runtime Proof

Glendel Joubert Fyne Acosta — Tue, 26 May 2026 01:49:17 +0000

An AI agent saying "I did it" is not proof that anything happened.

"I sent the email."

"I updated the database."

"I escalated the issue."

"I published the post."

Those are claims.

In a real production system, claims are not enough.

If an AI Agent performs work that affects users, data, money, operations, or another system, the runtime must be able to prove what actually happened.

The Problem

Language models are very good at producing confident completion statements.

That confidence can be useful in conversation, but dangerous in infrastructure.

A model may say:

"Done, I sent the email."

But what actually happened ?

Maybe the email tool succeeded.

Maybe the permission check failed.

Maybe the API timed out.

Maybe the retry limit was reached.

Maybe the tool was never called.

Maybe the model only assumed the action happened because that was the most natural response in the conversation.

This is one of the most important differences between a demo and a production AI system.

In a demo, the agent saying "done" feels impressive.

In production, "done" needs evidence.

Model Claims vs Runtime Evidence

A model claim is what the AI says happened.

Runtime evidence is what the system can prove happened.

Those are not the same thing.

A serious AI Agent system should separate them clearly.

For example:

const response = await agent.run("Send the customer follow-up email");

// This is only a model-generated claim
console.log(response.message);
// "Done, I sent the email."

That message is not enough.

A production system should also have a runtime record:

{
  "actor": "support-agent-01",
  "tool": "send_email",
  "permission": "granted",
  "input": {
    "to": "customer@example.com",
    "template": "follow_up"
  },
  "status": "success",
  "providerMessageId": "msg_abc123",
  "timestamp": "2026-05-25T14:32:10Z",
  "auditId": "audit_789"
}

Now the system can answer:

who requested the action
which tool executed
whether permission was granted
what input was used
what result came back
when it happened
what audit record proves it

That is the difference between trusting text and trusting infrastructure.

Why This Matters

AI agents are moving from chat interfaces into real workflows.

They are not just answering questions anymore.

They are:

sending messages
creating tickets
updating records
calling APIs
reading customer data
triggering workflows
escalating incidents
generating reports

Once agents do real work, organizations need more than fluent responses.

They need accountability.

If an agent says it updated a record, the system must prove the record was updated.

If an agent says it escalated a complaint, the system must prove the escalation happened.

If an agent says it sent a message, the system must prove the message was sent.

Otherwise, the organization is not operating on evidence.

It is operating on model confidence.

The Dangerous Failure Mode

The dangerous failure mode is not always a loud crash.

Sometimes the agent simply says:

"Done."

And everyone believes it.

But behind the scenes:

the tool failed
the permission was denied
the payload was invalid
the API returned an error
the action was never executed
the workflow stopped halfway

This creates a false sense of completion.

The user thinks the task is finished.

The agent thinks the task is finished.

The organization acts as if the task is finished.

But the runtime has no proof that the task ever happened.

That is a serious reliability problem.

Multi-Agent Systems Make This Worse

This problem becomes even more dangerous in Multi-Agent Systems (MAS).

Imagine this flow:

Agent A says it collected the customer data.
Agent B uses that claim to draft a response.
Agent C sends the response.
Agent D summarizes the case as resolved.

If Agent A's claim was unsupported, the entire chain becomes unreliable.

One unsupported claim becomes another agent's input.

The error propagates across the system.

By the end, the final result may look coherent, but the foundation is wrong.

This is why Multi-Agent Systems need runtime evidence at every important boundary.

Agents should not pass around unsupported claims as if they were facts.

They should pass around claims connected to evidence.

The Architecture Pattern

A better architecture separates three things:

Reasoning
Execution
Evidence

The AI agent reasons about what should happen.

The runtime executes the action if it is allowed.

The system records evidence of what actually happened.

const request = await agent.decideNextAction(context);

const permission = runtime.permissions.verify(request);

if (!permission.allowed) {
  return runtime.recordDeniedAction(request, permission.reason);
}

const result = await runtime.execute(request);

const evidence = await runtime.recordEvidence({
  request,
  permission,
  result
});

return agent.summarizeResult({
  result,
  evidenceId: evidence.id
});

The model can still explain the result to the user.

But the explanation is now grounded in runtime evidence.

The agent is no longer saying:

"Trust me."

It is saying:

"Here is what happened, and here is the evidence."

What Runtime Evidence Should Include

At minimum, an evidence record should capture:

actor identity
requested action
permission result
tool or workflow used
input payload
execution result
timestamps
failure reason, if any
retry attempts
audit/reference ID

For sensitive systems, it may also include:

approval record
policy version
resource identifier
provider response metadata
verification result
human review state

The goal is not to create bureaucracy.

The goal is to make AI work inspectable, debuggable, and trustworthy.

The Rule

A simple rule for production AI systems:

If the agent claims an external action happened, the runtime should have evidence.

No evidence means the claim is unsupported.

Not necessarily false.

But unsupported.

That distinction matters.

An unsupported claim should not be treated as completed work.

It should trigger one of three outcomes:

retry
verify
escalate

That is how AI Systems become operationally reliable.

From Chatbots To Organizational AI Systems

Chatbots can get away with claims.

Organizational AI Systems cannot.

When AI agents operate inside real organizations, they need:

permissions
execution boundaries
audit trails
verification gates
runtime evidence
human escalation paths

The more responsibility we give agents, the more important evidence becomes.

A confident answer is not enough.

A fluent summary is not enough.

A completed-looking workflow is not enough.

The system must be able to prove what happened.

Conclusion

AI Agents should reason.

Runtimes should execute.

Evidence should prove.

That separation is what turns agent behavior from conversation into infrastructure.

If we want AI Agents to operate inside real organizations, we need to stop treating model-generated claims as proof of completed work.

Evidence beats claims.

AI Agents Don't Have Permissions — Runtimes Do

Glendel Joubert Fyne Acosta — Thu, 21 May 2026 00:45:50 +0000

Right now, many Multi-Agent Systems are implementing permissions inside prompts.

"You may access the CRM."

"You are allowed to send emails."

"Do not modify billing records."

This is becoming one of the biggest architectural mistakes in modern AI systems.

A prompt is not a security boundary.

Language models are probabilistic reasoning engines. They are excellent at planning, summarizing, reasoning, and interpreting context. But they are not deterministic authorization systems.

If your application's security model depends on the LLM consistently obeying natural-language instructions, your system does not actually have runtime governance.

It has probabilistic behavior shaping.

The Problem

I keep seeing architectures where the agent itself is expected to decide whether an action is allowed:

const prompt = `
You are an AI Agent.

The user wants to delete a customer record.
The user's permissions are: ${permissions}.

Should you allow this action?
`;

const decision = await llm.generate(prompt);

This looks flexible.

It also creates several major problems immediately:

prompts can conflict
context windows drift
instructions can be overridden
reasoning can hallucinate
behavior changes across models
authorization becomes non-auditable

And once you move into multi-agent systems, the situation becomes even worse.

One agent may interpret permissions differently from another. Handoffs may lose constraints. Context summarization may remove critical security instructions entirely.

Now your governance model depends on whether probabilistic agents correctly preserve natural-language policy across multiple reasoning steps.

That is not enterprise architecture.

The Runtime Must Enforce Boundaries

The AI should reason about what needs to happen.

The runtime should determine whether it is allowed to happen.

This distinction is critical.

A governed architecture should look more like this:

if (!runtime.permissions.verify({
  agent: agentId,
  action: "delete_customer",
  resource: customerId
})) {
  throw new UnauthorizedError();
}

const result = await executor.deleteCustomer(customerId);

The LLM may request the action.

The deterministic runtime decides whether execution is permitted.

That is a real security boundary.

The Cognitive Layer vs The Deterministic Layer

I think a lot of confusion in the current AI ecosystem comes from mixing these two responsibilities together.

The Cognitive Layer:

reasoning
planning
interpretation
summarization
decision support

The Deterministic Layer:

permissions
schema validation
execution
workflows
retries
state transitions
audit logs
policy enforcement

The AI should not govern itself.

The framework must govern the AI.

Why This Matters More In Multi-Agent Systems

Single-agent systems are already difficult to debug.

Multi-agent systems amplify the problem dramatically:

context drift compounds
handoff failures appear
responsibilities blur
state becomes harder to trace
authorization assumptions leak between agents

Without deterministic runtime enforcement, governance becomes almost impossible to reason about operationally.

And when systems fail, the incident report becomes:

"The model ignored the instruction."

No serious infrastructure team will accept that as a security architecture.

Organizational AI Systems Need Runtime Authority

As AI systems move into real organizations, governance stops being optional.

Enterprises need:

auditability
traceability
deterministic enforcement
runtime evidence
policy validation
observability

Natural-language instructions alone cannot provide these guarantees.

The future of Organizational AI Systems will depend on separating:

probabilistic reasoning from
deterministic governance.

AI Agents should reason.

Runtimes should govern.

The AI FOMO Trap: Why your Multi-Agent System is brittle (and how to fix it)

Glendel Joubert Fyne Acosta — Thu, 14 May 2026 00:45:30 +0000

A developer on Reddit recently told me: "Companies right now are risking the LLM-led parts of their architecture due to FOMO. We'll see how far they get".

He is absolutely right. Fear Of Missing Out is driving engineering teams to ship "Autonomous Agents" at breakneck speed. But in the rush to production, we are abandoning 20 years of established software engineering principles.

We are letting probabilistic models control deterministic runtimes.

If you are routing network traffic, validating data schemas, or checking user permissions using an LLM prompt, you are not building a resilient system. You are building a fragile prompt-chain wrapped in hope. When it fails (and it will), it will be slow, expensive, and completely un-auditable. InfoSec won't accept "the model hallucinated the auth check" as a valid incident report.

The Cure: The Manager-Executor Pattern

To build enterprise-grade Multi-Agent Systems, we must separate the Cognitive from the Deterministic.

1. The Manager (Probabilistic) This is the LLM. Its only job is to reason, plan, and analyze context. It decides what needs to be done. It does not execute code. It does not manage its own memory. It requests actions via strict JSON schemas.
2. The Executor (Deterministic) This is your runtime framework. It acts as the boundary. When the Manager requests an action, the Executor:

Verifies the agent's permissions.
Validates the payload against a strict schema.
Checks the token/cost budget.
Executes the code (API call, DB write).
Returns the exact result to the Manager.

The Framework Controls the AI

The fundamental shift required in MAS architecture is understanding that the framework must control the LLM; the LLM must never control the framework.

Right now, developers are having to build these custom state machines and validation layers from scratch because popular frameworks default to LLM-routing. It's time we standardize this. We need "A Real Framework" for Multi-Agent Systems—a framework that enforces the Manager-Executor pattern by default.

Stop relying on vibes-based engineering. Let's get back to rigorous software architecture.

The Token Waste Problem: Why your AI Agents shouldn't evaluate permissions

Glendel Joubert Fyne Acosta — Sat, 09 May 2026 00:47:02 +0000

We are burning millions of API tokens on problems that if statements solved 20 years ago.

I speak with developers building Multi-Agent Systems (MAS) every day, and I keep seeing the same massive architectural anti-pattern: Routing everything through the AI model.

Need to check an agent's permissions? "Ask the LLM."
Need to route a message? "Ask the LLM."
Need to validate a data schema? "Ask the LLM."

Language models are extraordinary reasoning engines. But they are also expensive, probabilistic, and relatively slow. If a problem has a deterministic, correct answer (like checking an access policy), it should be evaluated by runtime code, not guessed by a neural network.

The Anti-Pattern

Instead of doing this (Probabilistic):

// BAD: Asking the LLM to check permissions
const prompt = `You are an agent. The user wants to delete a file. 
Here are their permissions: ${user.permissions}. 
Should you allow it?`;

const decision = await llm.generate(prompt);

The Solution

We need to get back to doing this (Deterministic):

// GOOD: Let code handle policy, let AI handle reasoning
if (!user.hasPermission('delete_file')) {
  throw new Error("Unauthorized"); 
}

// Only call the LLM for actual cognitive tasks
const plan = await agent.reasonAboutFile(file);

AI should decide what to do. Deterministic code should execute it and enforce the boundaries.

Are we forgetting basic software engineering principles just because AI is exciting? The MAS space doesn't need more wrappers; we need standardized frameworks that enforce these boundaries. Let's get back to building solid infrastructure.