Elena Revicheva

Posted on May 5 • Originally published at aideazz.hashnode.dev

What Is an AI Agent? A Production Definition From Running Multi-Agent Systems

#ai #programming #machinelearning

Originally published on AIdeazz — cross-posted here with canonical link.

Most definitions of AI agents are either too academic ("autonomous entities that perceive and act") or too marketing-driven ("ChatGPT but with buttons!"). After building and deploying multiple agent systems in production — from Telegram bots handling thousands of daily queries to multi-agent workflows on Oracle Cloud — I've developed a more practical definition.

The Core Loop: Observe → Decide → Act → Persist

An AI agent is software that runs this loop continuously:

Observe: Gather context from multiple sources (messages, APIs, database state, other agents)
Decide: Use LLMs or other models to determine next actions based on observations
Act: Execute those actions (send messages, call APIs, update databases, trigger workflows)
Persist: Maintain state across interactions for continuity

This differs fundamentally from chat-only wrappers that simply pipe user input to an LLM and return the response. The key distinction? Agents do things beyond returning text.

Here's a concrete example from one of our production systems: A user messages our Telegram agent asking about their order status. The agent:

Observes the message and retrieves the user's ID from Telegram metadata
Decides it needs order information, checking its permission scope
Acts by querying our Oracle database for order records
Persists the interaction context for follow-up questions

The user might then ask "Can you expedite shipping?" The agent already has the order context, checks business rules, and could actually modify the order priority in the system — not just explain how expediting works.

Architecture Patterns That Actually Scale

When people ask "what is an AI agent," they often imagine a single monolithic system. In practice, production agents are usually specialized components in larger systems.

Our typical architecture:

Router Agent: Analyzes incoming requests and delegates to specialized agents
Task Agents: Handle specific domains (customer service, data analysis, document processing)
Coordinator Agent: Manages multi-step workflows across task agents
Monitor Agent: Tracks system health and intervenes when needed

This isn't arbitrary complexity. Single-agent systems hit walls quickly:

Context windows overflow with state management
One prompt template can't handle diverse tasks well
Failure in one area cascades everywhere
Testing becomes impossible

With specialized agents, each maintains focused state, uses optimized prompts, and fails independently. Our router agent uses Groq for fast classification (under 200ms), then delegates complex reasoning to Claude-3.5-Sonnet agents that might take 2-3 seconds but handle nuanced tasks.

The tradeoff: coordination overhead. Agents must pass context efficiently, handle partial failures, and avoid infinite delegation loops. We've found explicit state schemas (JSON) work better than natural language for inter-agent communication.

State Management: The Difference Between Toy and Production

Chat wrappers maintain conversation history. Agents maintain operational state. This distinction separates demos from production systems.

Consider our WhatsApp scheduling agent:

User: "Book a meeting with Sarah next Tuesday at 2pm"
Agent: "I'll check availability..."
[Agent queries calendar API, finds conflict]
Agent: "Sarah has a conflict at 2pm. She's free at 10am or 3pm. Which works?"
User: "Actually make it Wednesday instead"

A chat wrapper would need the entire conversation to understand "it" refers to the meeting. Our agent maintains structured state:

{
  "pending_action": "schedule_meeting",
  "participants": ["user_123", "sarah_456"],
  "proposed_time": null,
  "constraints": ["tuesday_2pm_conflict"],
  "alternatives": ["tuesday_10am", "tuesday_3pm"]
}

When the user says "Wednesday instead," the agent updates the specific field rather than reinterpreting everything. This approach:

Reduces token usage by 60-80%
Enables resuming conversations after connection drops
Allows other agents to understand ongoing tasks
Supports compliance logging

We persist this state in Oracle Autonomous JSON Database, which handles concurrent updates and provides ACID guarantees — critical when multiple agents might update the same user's state.

The LLM Is Just One Component

A common misconception: AI agents are just LLMs with extra steps. In our production systems, LLM calls represent maybe 20-30% of execution time.

Real agent loop timing breakdown (WhatsApp order processing):

Message decryption/validation: 50ms
State retrieval from cache/DB: 80-120ms
LLM decision call: 200-800ms (Groq) or 1-3s (Claude)
Business logic validation: 100ms
External API calls: 200ms-5s
State persistence: 50-100ms
Response encryption/sending: 50ms

The LLM provides reasoning capability, but agents need:

Message queue integration for reliable async processing
Caching layers to avoid repeated LLM calls
Circuit breakers for external dependencies
Retry logic with exponential backoff
Monitoring/alerting for production issues

Our Oracle Cloud infrastructure provides much of this — OCI Queue service for message handling, Redis for caching, and built-in monitoring. But even with good infrastructure, agent complexity lives in orchestration logic, not LLM prompts.

Multi-Agent Coordination: Beyond Pipeline Thinking

Single agents hit complexity ceilings. Multi-agent systems break through but introduce coordination challenges. The naive approach — agents calling each other like functions — creates brittle pipelines.

Our production pattern uses event-driven coordination:

Agents publish state changes to a shared event bus
Other agents subscribe to relevant event types
A coordinator agent manages workflow-level concerns
Each agent maintains local state, syncing through events

Example from our document processing system:

Upload agent receives PDF, publishes document_received event
OCR agent subscribes to this event, processes, publishes text_extracted
Classification agent takes extracted text, publishes document_classified
Multiple specialized agents handle different document types in parallel

This architecture handles partial failures gracefully. If the classification agent crashes, documents queue up but OCR continues. When classification recovers, it processes the backlog without losing work.

The challenge: event ordering and consistency. We use Oracle Streaming Service with exactly-once semantics and explicit sequence numbers. Agents checkpoint their progress, enabling clean recovery from any point.

Common Failure Modes and Mitigation

Production agents fail in predictable ways:

Context corruption: Agents lose track of conversation state or mix up users. Mitigation: Explicit session IDs, regular state validation, automatic reset after idle periods.

Infinite loops: Agent A delegates to Agent B who delegates back to Agent A. Mitigation: Loop detection via request IDs, maximum delegation depth, circuit breakers on agent communication.

Prompt injection: Users manipulate agents into unintended behaviors. Mitigation: Structured output formats (JSON schema validation), privilege separation between agents, sanitization of user inputs before prompt inclusion.

Cost explosion: Recursive agent calls or large context accumulation. Mitigation: Token budgets per interaction, cost attribution to user/session, automatic fallback to cheaper models.

Latency cascades: Slow responses compound in multi-agent flows. Mitigation: Aggressive timeouts, parallel processing where possible, caching of intermediate results.

Our monitoring tracks these failure modes explicitly. We measure not just success rates but loop detection triggers, context reset frequency, and cost per interaction. This data drives architectural improvements.

Building Your First Production Agent

Start with a single, focused agent that does one thing well. Our recommendation based on what works:

Choose a narrow scope: "Schedule meetings via Telegram" beats "AI assistant for everything"
Design state schema first: What must persist between interactions?
Build the non-LLM parts: Message handling, state storage, external integrations
Add LLM decision-making: Start with simple prompts, iterate based on real usage
Implement monitoring early: Track decisions, not just errors

Avoid these common mistakes:

Starting with multi-agent systems before mastering single agents
Putting everything in prompts instead of code
Ignoring state management until it's too late
Optimizing LLM costs before validating the use case

The Reality of Production AI Agents

What is an AI agent? It's not a chatbot with API access or an LLM with a for-loop. It's a system that observes its environment, makes decisions, takes actions, and maintains state — reliably, at scale, with production constraints.

Our agents handle thousands of daily interactions across Telegram and WhatsApp, coordinate complex workflows, and integrate with enterprise systems. They're not perfect. They require constant monitoring, regular prompt tuning, and occasional manual intervention. But they deliver real value by automating tasks that would otherwise require human attention.

The key insight from running these systems: agents are software engineering challenges more than AI challenges. The LLM provides reasoning capability, but production value comes from reliable orchestration, state management, and system integration. Focus there, and agents become powerful tools rather than impressive demos.

— Elena Revicheva · AIdeazz · Portfolio

Top comments (1)

NOVAInetwork • May 7

The "Observe, Decide, Act, Persist" loop is a clean
framework. The part that stuck with me is "agents are
software engineering challenges more than AI challenges."
That matches what I've seen building from the other side.

I'm working on a blockchain where AI agents are protocol
primitives. The chain itself handles the Persist and Act
layers natively. Every entity has on-chain state (balance,
nonce, memory objects, signal history) and the protocol
enforces what actions the entity is allowed to take through
a capability bitfield checked before every transaction.

Your state management section is where the approaches
converge. You persist structured JSON in Oracle with ACID
guarantees. We persist typed memory objects on-chain with
protocol-enforced caps (100 objects per entity, 64 KiB
each). Both solve the same problem: the agent needs
durable, queryable state that doesn't depend on
conversation history.

The multi-agent coordination pattern is interesting too.
Your event bus approach (agents publish state changes,
others subscribe) maps to how entities compose on NOVAI.
One entity publishes a Prediction signal on-chain. Another
reads it via RPC and publishes a RiskScore in response. No
HTTP between them. The chain is the event bus.

The failure modes list is honest and useful. We handle
the privilege escalation one at the protocol level. The
dispatcher enforces capability gates, so an entity
literally cannot call operations it wasn't registered for.
No prompt injection can change what the protocol allows.

github.com/0x-devc/NOVAI-node