DEV Community

Daniel R. Foster for Tactas AI

Posted on

Building AI Agents That Actually Execute Workflows, Not Just Answer Questions

Building AI Agents That Actually Execute Workflows, Not Just Answer Questions

Most AI agent demos look impressive because the environment is clean.

A user asks something. The model understands it. The agent calls a tool. A nice response comes back.

It feels like automation.

But in a real business, that is usually the easiest part.

The harder question is not:

Can the AI call an API?

The harder question is:

Should the AI call this API, with this data, under this condition, for this customer, at this point in the workflow, without creating operational risk?

That is where most “AI agents” start to break.

A chatbot can answer a question.

A workflow agent has to make progress through a business process.

Those are different systems.


Businesses do not run on prompts

A lot of AI products still assume the main interface is conversation.

The user types:

“Can this customer get a refund?”

The AI responds:

“Based on the policy, this customer may be eligible.”

That is useful, but it is not execution.

In a real company, the refund process probably involves several steps:

  • Check the order status
  • Verify payment settlement
  • Read the refund policy
  • Check customer history
  • Detect abuse patterns
  • Calculate refund amount
  • Decide whether approval is required
  • Create an internal note
  • Trigger the refund
  • Notify the customer
  • Update CRM
  • Log the decision

That workflow may touch Stripe, HubSpot, Zendesk, Postgres, internal admin tools, Slack, and a finance dashboard.

The AI response is only one small part.

The actual value is in moving the process forward safely.


A chatbot explains. A workflow agent executes.

A chatbot is optimized for interaction.

A workflow agent is optimized for controlled execution.

The difference is not only technical. It changes the entire architecture.

A basic chatbot usually looks like this:

User message  -> LLM  -> Response
Enter fullscreen mode Exit fullscreen mode

A tool-using chatbot looks like this:

User message  -> LLM  -> Tool call  -> Tool result  -> Response
Enter fullscreen mode Exit fullscreen mode

A real workflow agent needs something closer to this:

Trigger  -> Intent classification  -> Context retrieval  -> Policy/rule evaluation  -> Risk scoring  -> Action planning  -> Permission check  -> Tool execution  -> State update  -> Audit log  -> Human approval if needed  -> Final user/internal response
Enter fullscreen mode Exit fullscreen mode

The LLM is still useful, but it is not the whole system.

The core system is the execution layer around the LLM.


Tool calling is not workflow automation

Tool calling is often treated as the definition of an AI agent.

That is a weak definition.

If an LLM can call refundCustomer() or updateTicketStatus(), that does not mean the business process is automated.

It only means the model has access to a dangerous button.

The real work is everything around that button.

For example, imagine this tool:

type RefundCustomerInput = {
  customerId: string;
  orderId: string;
  amount: number;
  reason: string;
};

async function refundCustomer(input: RefundCustomerInput) {
  // Create refund through payment provider
}
Enter fullscreen mode Exit fullscreen mode

The tool is simple.

The workflow is not.

Before calling it, the system needs to know:

Question Why it matters
Is the order refundable? Prevents policy violations
Has the payment settled? Avoids invalid refund attempts
Is the request inside the refund window? Enforces business rules
Has this customer requested too many refunds? Detects abuse
Is the amount above the auto-approval threshold? Controls financial risk
Is there an open chargeback? Prevents duplicate financial actions
Is the product category excluded? Handles special cases
Was partial credit already issued? Avoids over-refunding

The tool call is one line.

The decision boundary is the hard part.


The agent should not be the source of truth

One common mistake is letting the LLM “decide” business policy from natural language alone.

That is risky.

The agent should understand the request, summarize context, and propose next actions.

But business rules should live outside the model where possible.

For example:

refund_policy:
  auto_approve:
    max_amount_usd: 100
    within_days: 14
    customer_risk_score_below: 0.35
  require_human_approval:
    amount_above_usd: 100
    customer_has_prior_refunds: true
    fraud_signal_detected: true
    open_chargeback: true
  never_refund_automatically:
    product_type:
      - enterprise_contract
      - custom_service
    account_status:
      - suspended_for_abuse
Enter fullscreen mode Exit fullscreen mode

A better pattern is:

Component Role
LLM Reasoning and language interface
Rules engine Business constraints
Tools Execution
Workflow engine State and orchestration
Human operator Approval for risk
Logs Accountability

The LLM can interpret messy inputs.

The rules engine should decide what is allowed.

This keeps the AI useful without giving it unchecked authority.


Example: support ticket automation

Consider a SaaS company receiving this support ticket:

“I was charged twice this month. Please refund the duplicate payment.”

A chatbot might say:

“I’m sorry about that. I can help check your billing.”

A workflow agent should do more.

It should run a controlled process:

  1. Identify customer account from ticket
  2. Retrieve invoices from billing provider
  3. Check duplicate payment condition
  4. Compare invoice IDs, timestamps, and payment status
  5. Check refund eligibility
  6. Determine whether the amount is within auto-refund limit
  7. Draft customer response
  8. If safe, initiate refund
  9. Add internal note to ticket
  10. Update ticket status
  11. Log every action

This is what the agent execution might look like internally:

{
  "workflow": "duplicate_payment_refund",
  "ticket_id": "TCK-48291",
  "customer_id": "cus_10928",
  "detected_intent": "billing_duplicate_charge",
  "confidence": 0.91,
  "retrieved_context": {
    "invoices_found": 2,
    "duplicate_payment_detected": true,
    "payment_provider": "stripe",
    "amount_usd": 49
  },
  "policy_result": {
    "auto_refund_allowed": true,
    "requires_approval": false,
    "reason": "Duplicate charge confirmed; amount below threshold"
  },
  "planned_actions": [
    "create_refund",
    "add_ticket_note",
    "send_customer_reply",
    "close_ticket"
  ]
}
Enter fullscreen mode Exit fullscreen mode

The important part is not that the AI wrote a polite answer.

The important part is that the system verified the condition, checked policy, executed the refund, and left an audit trail.


Production agents need state

A lot of agent demos are stateless.

They run once, return an answer, and disappear.

Business workflows are rarely like that.

A real workflow may pause, wait for data, require approval, retry later, or resume after a human decision.

Example:

Ticket received  -> Agent checks account  -> Missing invoice data  -> Agent requests billing sync  -> Workflow pauses  -> Billing sync completes  -> Agent resumes  -> Refund requires approval  -> Manager approves  -> Agent executes refund  -> Ticket closes
Enter fullscreen mode Exit fullscreen mode

This requires workflow state.

Not just chat history.

Chat history tells you what was said.

Workflow state tells you what has been done, what is pending, what failed, and what can happen next.

A useful workflow state might include:

{
  "workflow_id": "wf_78321",
  "current_step": "waiting_for_manager_approval",
  "completed_steps": [
    "classify_ticket",
    "retrieve_customer",
    "check_invoice",
    "evaluate_policy"
  ],
  "pending_actions": [
    "manager_approval"
  ],
  "blocked_reason": "refund_amount_above_auto_threshold",
  "next_allowed_actions": [
    "approve_refund",
    "reject_refund",
    "request_more_info"
  ]
}
Enter fullscreen mode Exit fullscreen mode

Without state, the agent is just improvising every time.

That is not acceptable for operations.


Human approval is not a weakness

There is a strange assumption in AI automation that full autonomy is always the goal.

In enterprise workflows, that is often wrong.

The goal is not to remove humans from every decision.

The goal is to remove unnecessary human labor while keeping humans in control of high-risk decisions.

Actions that often need approval:

  • Refunds above a threshold
  • Account suspension
  • Contract changes
  • Production infrastructure changes
  • High-value credit issuance
  • Data deletion
  • Security exceptions
  • Legal or compliance-sensitive responses

A practical approval flow may look like this:

Agent prepares recommendation  -> Shows evidence  -> Lists proposed action  -> Explains policy match  -> Human approves/rejects  -> Agent executes approved action  -> System logs approver and timestamp
Enter fullscreen mode Exit fullscreen mode

This design is much safer than asking the AI to act autonomously in every case.

It also fits how businesses already operate.

Most companies do not want magic.

They want reliable delegation.


Agents need permission boundaries

A real AI agent should not have access to everything.

It should have scoped permissions based on role, workflow, and risk level.

For example:

Support Refund Agent

Can:

  • Read customer profile
  • Read invoice history
  • Create refund below $100
  • Draft ticket replies
  • Add internal notes

Cannot:

  • Refund above $100 without approval
  • Delete customer data
  • Modify subscription plans
  • Issue account credits manually
  • Access unrelated customer records

This matters because LLMs are probabilistic.

Even if the model is good, the system should assume mistakes can happen.

Good architecture limits the blast radius.

The agent should not be trusted because it is intelligent.

It should be trusted because the system around it constrains what it can do.


Logs are part of the product

For internal AI systems, audit logs are not optional.

If an agent performs an action, the company needs to know:

  • What triggered the workflow?
  • What data did the agent retrieve?
  • What did the agent decide?
  • Which policy was applied?
  • Which tools were called?
  • What changed in external systems?
  • Did a human approve it?
  • What was the final outcome?

A weak log looks like this:

Agent refunded customer.
Enter fullscreen mode Exit fullscreen mode

A useful audit log looks like this:

{
  "event": "refund_created",
  "workflow_id": "wf_78321",
  "actor": "ai_agent:support_refund_agent",
  "human_approver": null,
  "customer_id": "cus_10928",
  "amount_usd": 49,
  "policy_version": "refund_policy_v3",
  "reason": "duplicate_payment_confirmed",
  "tool_called": "stripe.refunds.create",
  "external_reference": "re_12345",
  "timestamp": "2026-05-07T10:24:18Z"
}
Enter fullscreen mode Exit fullscreen mode

This is important for debugging, compliance, customer disputes, and internal trust.

If people cannot inspect what the agent did, they will not trust it with real work.


The agent must handle failure like software, not like a chatbot

APIs fail.

Databases return incomplete records.

CRMs contain stale data.

Customers provide wrong information.

Internal tools time out.

A workflow agent needs explicit failure handling.

Example:

If payment provider timeout:
  -> retry twice
  -> if still failing, pause workflow
  -> notify support operator
  -> do not tell customer refund was created

If customer account not found:
  -> ask for additional identifier
  -> do not guess account

If policy conflict detected:
  -> escalate to human
  -> include conflict explanation
Enter fullscreen mode Exit fullscreen mode

This is where many AI systems become dangerous.

When an LLM lacks data, it may still produce a confident answer.

A workflow system should do the opposite.

When required data is missing, it should stop.


A better architecture for operational agents

A practical enterprise agent architecture might look like this:

                 ┌────────────────────┐
                 │ Incoming request    │
                 │ ticket/email/event  │
                 └─────────┬──────────┘
                           │
                           ▼
                 ┌────────────────────┐
                 │ Intent classifier   │
                 └─────────┬──────────┘
                           │
                           ▼
                 ┌────────────────────┐
                 │ Context retrieval   │
                 │ CRM, DB, API, docs  │
                 └─────────┬──────────┘
                           │
                           ▼
                 ┌────────────────────┐
                 │ Policy evaluation   │
                 │ rules, SOPs, limits │
                 └─────────┬──────────┘
                           │
                           ▼
                 ┌────────────────────┐
                 │ Action planner      │
                 └─────────┬──────────┘
                           │
              ┌────────────┴────────────┐
              ▼                         ▼
    ┌──────────────────┐       ┌──────────────────┐
    │ Safe execution   │       │ Human approval   │
    │ allowed actions  │       │ risky actions    │
    └────────┬─────────┘       └────────┬─────────┘
             │                          │
             ▼                          ▼
    ┌────────────────────────────────────────┐
    │ Tool execution + state update + logs   │
    └────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

This is less flashy than a demo agent.

But it is much closer to what companies actually need.


The most important design principle

The most useful AI agents are not the ones with the most autonomy.

They are the ones with the clearest operating boundaries.

A good workflow agent should know:

  • What it is allowed to do
  • What it is not allowed to do
  • What data it needs before acting
  • When it must ask for approval
  • How to recover from failure
  • How to explain what happened

That is the difference between a toy agent and an operational system.


Where AI agents are actually useful today

The best use cases are usually not broad, open-ended jobs.

They are narrow, repetitive workflows with clear rules and frequent human review.

Workflow Why it works well
Customer support triage High volume, repeatable patterns
Refund and billing workflows Clear rules, measurable outcomes
Lead qualification Structured enrichment and scoring
CRM enrichment Repetitive data work
Internal report generation Recurring operational summaries
Compliance checklist review Rule-based review process
Logistics exception handling Many edge cases but clear escalation paths
Hosting abuse investigation Requires evidence gathering and action control
Finance back-office operations Repetitive but sensitive
Vendor onboarding Multi-step process with approvals

These workflows are valuable because they are repetitive but not always simple.

They require judgment, but also structure.

That is exactly where AI can help.

Not by replacing the entire operation.

By handling the repetitive execution path and escalating the exceptions.


A simple test for whether an AI agent is real

When evaluating an AI agent, ask these questions:

  • Can it complete a workflow across multiple systems?
  • Can it preserve state between steps?
  • Can it enforce business rules?
  • Can it refuse unsafe actions?
  • Can it ask for human approval?
  • Can it recover when a tool fails?
  • Can it produce an audit trail?
  • Can a human understand why it acted?

If the answer is no, it may still be a useful chatbot.

But it is not yet an operational agent.


Final thought

The future of enterprise AI is not just better answers.

It is better execution.

The companies that get the most value from AI will not be the ones that simply add a chatbot to their website.

They will be the ones that connect AI to real workflows:

  • safely
  • observably
  • with business rules
  • with approval gates
  • with system integrations
  • with clear ownership

AI agents should not just talk about work.

They should help move work through the system.

That is the real shift.


At Tactas AI, we build custom AI agents for business operations — agents that connect with internal tools, follow business rules, execute approved actions, and keep human oversight where it matters.

Top comments (0)