Building AI Agents That Actually Execute Workflows, Not Just Answer Questions

Daniel R. Foster — Thu, 07 May 2026 03:04:55 +0000

Building AI Agents That Actually Execute Workflows, Not Just Answer Questions

Most AI agent demos look impressive because the environment is clean.

A user asks something. The model understands it. The agent calls a tool. A nice response comes back.

It feels like automation.

But in a real business, that is usually the easiest part.

The harder question is not:

Can the AI call an API?

The harder question is:

Should the AI call this API, with this data, under this condition, for this customer, at this point in the workflow, without creating operational risk?

That is where most “AI agents” start to break.

A chatbot can answer a question.

A workflow agent has to make progress through a business process.

Those are different systems.

Businesses do not run on prompts

A lot of AI products still assume the main interface is conversation.

The user types:

“Can this customer get a refund?”

The AI responds:

“Based on the policy, this customer may be eligible.”

That is useful, but it is not execution.

In a real company, the refund process probably involves several steps:

Check the order status
Verify payment settlement
Read the refund policy
Check customer history
Detect abuse patterns
Calculate refund amount
Decide whether approval is required
Create an internal note
Trigger the refund
Notify the customer
Update CRM
Log the decision

That workflow may touch Stripe, HubSpot, Zendesk, Postgres, internal admin tools, Slack, and a finance dashboard.

The AI response is only one small part.

The actual value is in moving the process forward safely.

A chatbot explains. A workflow agent executes.

A chatbot is optimized for interaction.

A workflow agent is optimized for controlled execution.

The difference is not only technical. It changes the entire architecture.

A basic chatbot usually looks like this:

User message  -> LLM  -> Response

A tool-using chatbot looks like this:

User message  -> LLM  -> Tool call  -> Tool result  -> Response

A real workflow agent needs something closer to this:

Trigger  -> Intent classification  -> Context retrieval  -> Policy/rule evaluation  -> Risk scoring  -> Action planning  -> Permission check  -> Tool execution  -> State update  -> Audit log  -> Human approval if needed  -> Final user/internal response

The LLM is still useful, but it is not the whole system.

The core system is the execution layer around the LLM.

Tool calling is not workflow automation

Tool calling is often treated as the definition of an AI agent.

That is a weak definition.

If an LLM can call refundCustomer() or updateTicketStatus(), that does not mean the business process is automated.

It only means the model has access to a dangerous button.

The real work is everything around that button.

For example, imagine this tool:

type RefundCustomerInput = {
  customerId: string;
  orderId: string;
  amount: number;
  reason: string;
};

async function refundCustomer(input: RefundCustomerInput) {
  // Create refund through payment provider
}

The tool is simple.

The workflow is not.

Before calling it, the system needs to know:

Question	Why it matters
Is the order refundable?	Prevents policy violations
Has the payment settled?	Avoids invalid refund attempts
Is the request inside the refund window?	Enforces business rules
Has this customer requested too many refunds?	Detects abuse
Is the amount above the auto-approval threshold?	Controls financial risk
Is there an open chargeback?	Prevents duplicate financial actions
Is the product category excluded?	Handles special cases
Was partial credit already issued?	Avoids over-refunding

The tool call is one line.

The decision boundary is the hard part.

The agent should not be the source of truth

One common mistake is letting the LLM “decide” business policy from natural language alone.

That is risky.

The agent should understand the request, summarize context, and propose next actions.

But business rules should live outside the model where possible.

For example:

refund_policy:
  auto_approve:
    max_amount_usd: 100
    within_days: 14
    customer_risk_score_below: 0.35
  require_human_approval:
    amount_above_usd: 100
    customer_has_prior_refunds: true
    fraud_signal_detected: true
    open_chargeback: true
  never_refund_automatically:
    product_type:
      - enterprise_contract
      - custom_service
    account_status:
      - suspended_for_abuse

A better pattern is:

Component	Role
LLM	Reasoning and language interface
Rules engine	Business constraints
Tools	Execution
Workflow engine	State and orchestration
Human operator	Approval for risk
Logs	Accountability

The LLM can interpret messy inputs.

The rules engine should decide what is allowed.

This keeps the AI useful without giving it unchecked authority.

Example: support ticket automation

Consider a SaaS company receiving this support ticket:

“I was charged twice this month. Please refund the duplicate payment.”

A chatbot might say:

“I’m sorry about that. I can help check your billing.”

A workflow agent should do more.

It should run a controlled process:

Identify customer account from ticket
Retrieve invoices from billing provider
Check duplicate payment condition
Compare invoice IDs, timestamps, and payment status
Check refund eligibility
Determine whether the amount is within auto-refund limit
Draft customer response
If safe, initiate refund
Add internal note to ticket
Update ticket status
Log every action

This is what the agent execution might look like internally:

{
  "workflow": "duplicate_payment_refund",
  "ticket_id": "TCK-48291",
  "customer_id": "cus_10928",
  "detected_intent": "billing_duplicate_charge",
  "confidence": 0.91,
  "retrieved_context": {
    "invoices_found": 2,
    "duplicate_payment_detected": true,
    "payment_provider": "stripe",
    "amount_usd": 49
  },
  "policy_result": {
    "auto_refund_allowed": true,
    "requires_approval": false,
    "reason": "Duplicate charge confirmed; amount below threshold"
  },
  "planned_actions": [
    "create_refund",
    "add_ticket_note",
    "send_customer_reply",
    "close_ticket"
  ]
}

The important part is not that the AI wrote a polite answer.

The important part is that the system verified the condition, checked policy, executed the refund, and left an audit trail.

Production agents need state

A lot of agent demos are stateless.

They run once, return an answer, and disappear.

Business workflows are rarely like that.

A real workflow may pause, wait for data, require approval, retry later, or resume after a human decision.

Example:

Ticket received  -> Agent checks account  -> Missing invoice data  -> Agent requests billing sync  -> Workflow pauses  -> Billing sync completes  -> Agent resumes  -> Refund requires approval  -> Manager approves  -> Agent executes refund  -> Ticket closes

This requires workflow state.

Not just chat history.

Chat history tells you what was said.

Workflow state tells you what has been done, what is pending, what failed, and what can happen next.

A useful workflow state might include:

{
  "workflow_id": "wf_78321",
  "current_step": "waiting_for_manager_approval",
  "completed_steps": [
    "classify_ticket",
    "retrieve_customer",
    "check_invoice",
    "evaluate_policy"
  ],
  "pending_actions": [
    "manager_approval"
  ],
  "blocked_reason": "refund_amount_above_auto_threshold",
  "next_allowed_actions": [
    "approve_refund",
    "reject_refund",
    "request_more_info"
  ]
}

Without state, the agent is just improvising every time.

That is not acceptable for operations.

Human approval is not a weakness

There is a strange assumption in AI automation that full autonomy is always the goal.

In enterprise workflows, that is often wrong.

The goal is not to remove humans from every decision.

The goal is to remove unnecessary human labor while keeping humans in control of high-risk decisions.

Actions that often need approval:

Refunds above a threshold
Account suspension
Contract changes
Production infrastructure changes
High-value credit issuance
Data deletion
Security exceptions
Legal or compliance-sensitive responses

A practical approval flow may look like this:

Agent prepares recommendation  -> Shows evidence  -> Lists proposed action  -> Explains policy match  -> Human approves/rejects  -> Agent executes approved action  -> System logs approver and timestamp

This design is much safer than asking the AI to act autonomously in every case.

It also fits how businesses already operate.

Most companies do not want magic.

They want reliable delegation.

Agents need permission boundaries

A real AI agent should not have access to everything.

It should have scoped permissions based on role, workflow, and risk level.

For example:

Support Refund Agent

Can:

Read customer profile
Read invoice history
Create refund below $100
Draft ticket replies
Add internal notes

Cannot:

Refund above $100 without approval
Delete customer data
Modify subscription plans
Issue account credits manually
Access unrelated customer records

This matters because LLMs are probabilistic.

Even if the model is good, the system should assume mistakes can happen.

Good architecture limits the blast radius.

The agent should not be trusted because it is intelligent.

It should be trusted because the system around it constrains what it can do.

Logs are part of the product

For internal AI systems, audit logs are not optional.

If an agent performs an action, the company needs to know:

What triggered the workflow?
What data did the agent retrieve?
What did the agent decide?
Which policy was applied?
Which tools were called?
What changed in external systems?
Did a human approve it?
What was the final outcome?

A weak log looks like this:

Agent refunded customer.

A useful audit log looks like this:

{
  "event": "refund_created",
  "workflow_id": "wf_78321",
  "actor": "ai_agent:support_refund_agent",
  "human_approver": null,
  "customer_id": "cus_10928",
  "amount_usd": 49,
  "policy_version": "refund_policy_v3",
  "reason": "duplicate_payment_confirmed",
  "tool_called": "stripe.refunds.create",
  "external_reference": "re_12345",
  "timestamp": "2026-05-07T10:24:18Z"
}

This is important for debugging, compliance, customer disputes, and internal trust.

If people cannot inspect what the agent did, they will not trust it with real work.

The agent must handle failure like software, not like a chatbot

APIs fail.

Databases return incomplete records.

CRMs contain stale data.

Customers provide wrong information.

Internal tools time out.

A workflow agent needs explicit failure handling.

Example:

If payment provider timeout:
  -> retry twice
  -> if still failing, pause workflow
  -> notify support operator
  -> do not tell customer refund was created

If customer account not found:
  -> ask for additional identifier
  -> do not guess account

If policy conflict detected:
  -> escalate to human
  -> include conflict explanation

This is where many AI systems become dangerous.

When an LLM lacks data, it may still produce a confident answer.

A workflow system should do the opposite.

When required data is missing, it should stop.

A better architecture for operational agents

A practical enterprise agent architecture might look like this:

                 ┌────────────────────┐
                 │ Incoming request    │
                 │ ticket/email/event  │
                 └─────────┬──────────┘
                           │
                           ▼
                 ┌────────────────────┐
                 │ Intent classifier   │
                 └─────────┬──────────┘
                           │
                           ▼
                 ┌────────────────────┐
                 │ Context retrieval   │
                 │ CRM, DB, API, docs  │
                 └─────────┬──────────┘
                           │
                           ▼
                 ┌────────────────────┐
                 │ Policy evaluation   │
                 │ rules, SOPs, limits │
                 └─────────┬──────────┘
                           │
                           ▼
                 ┌────────────────────┐
                 │ Action planner      │
                 └─────────┬──────────┘
                           │
              ┌────────────┴────────────┐
              ▼                         ▼
    ┌──────────────────┐       ┌──────────────────┐
    │ Safe execution   │       │ Human approval   │
    │ allowed actions  │       │ risky actions    │
    └────────┬─────────┘       └────────┬─────────┘
             │                          │
             ▼                          ▼
    ┌────────────────────────────────────────┐
    │ Tool execution + state update + logs   │
    └────────────────────────────────────────┘

This is less flashy than a demo agent.

But it is much closer to what companies actually need.

The most important design principle

The most useful AI agents are not the ones with the most autonomy.

They are the ones with the clearest operating boundaries.

A good workflow agent should know:

What it is allowed to do
What it is not allowed to do
What data it needs before acting
When it must ask for approval
How to recover from failure
How to explain what happened

That is the difference between a toy agent and an operational system.

Where AI agents are actually useful today

The best use cases are usually not broad, open-ended jobs.

They are narrow, repetitive workflows with clear rules and frequent human review.

Workflow	Why it works well
Customer support triage	High volume, repeatable patterns
Refund and billing workflows	Clear rules, measurable outcomes
Lead qualification	Structured enrichment and scoring
CRM enrichment	Repetitive data work
Internal report generation	Recurring operational summaries
Compliance checklist review	Rule-based review process
Logistics exception handling	Many edge cases but clear escalation paths
Hosting abuse investigation	Requires evidence gathering and action control
Finance back-office operations	Repetitive but sensitive
Vendor onboarding	Multi-step process with approvals

These workflows are valuable because they are repetitive but not always simple.

They require judgment, but also structure.

That is exactly where AI can help.

Not by replacing the entire operation.

By handling the repetitive execution path and escalating the exceptions.

A simple test for whether an AI agent is real

When evaluating an AI agent, ask these questions:

Can it complete a workflow across multiple systems?
Can it preserve state between steps?
Can it enforce business rules?
Can it refuse unsafe actions?
Can it ask for human approval?
Can it recover when a tool fails?
Can it produce an audit trail?
Can a human understand why it acted?

If the answer is no, it may still be a useful chatbot.

But it is not yet an operational agent.

Final thought

The future of enterprise AI is not just better answers.

It is better execution.

The companies that get the most value from AI will not be the ones that simply add a chatbot to their website.

They will be the ones that connect AI to real workflows:

safely
observably
with business rules
with approval gates
with system integrations
with clear ownership

AI agents should not just talk about work.

They should help move work through the system.

That is the real shift.

At Tactas AI, we build custom AI agents for business operations — agents that connect with internal tools, follow business rules, execute approved actions, and keep human oversight where it matters.

DEV Community: Tactas AI

Building AI Agents That Actually Execute Workflows, Not Just Answer Questions

Building AI Agents That Actually Execute Workflows, Not Just Answer Questions

Businesses do not run on prompts

A chatbot explains. A workflow agent executes.

Tool calling is not workflow automation

The agent should not be the source of truth

Example: support ticket automation

Production agents need state

Human approval is not a weakness

Agents need permission boundaries

Logs are part of the product

The agent must handle failure like software, not like a chatbot

A better architecture for operational agents

The most important design principle

Where AI agents are actually useful today

A simple test for whether an AI agent is real

Final thought