AI agents are no longer just chatbots; they are autonomous workers running real operations. But how do we know they are doing a good job? AgentOps is the answer. In this post, I break down the three layers of AgentOps using a hospital scenario anyone can follow.
Introduction: AI Agents Have Grown Up
A few years ago, when most people heard "AI", they pictured a simple chatbot: you ask a question, you get an answer, end of story. That world is gone.
Today's AI agents can think for themselves, talk to other software systems, read and write files, call APIs, and even hand off tasks to other agents. They are starting to behave less like a search box and more like a junior employee who shows up every day, takes assignments, and tries to get things done.
This is exciting. But it raises a serious question: how do we know these "employees" are doing a good job? What happens if an agent makes a mistake — especially in a setting like a hospital where mistakes can affect a patient's life?
This is exactly the gap that AgentOps fills. This post is built on notes I took while watching an IBM Technology video on the topic. My goal is to write something that even someone who has never heard the term "AI agent" can read and walk away understanding.
So, What Is AgentOps?
Here is the simplest way to define it:
AgentOps is the discipline of managing, improving, and monitoring AI agents in production.
If you have a software background, you may know "DevOps". DevOps is about continuously running, watching, and improving software systems. AgentOps is the same idea, but for AI agents. The difference is that the thing you are monitoring is not a static piece of code — it's a non-deterministic, decision-making "AI worker".
AgentOps is built on three layers:
- Observability
- Evaluation
- Optimization
We will walk through each layer using one running example so the concepts don't feel abstract.
Our Running Example: Two AI Agents in a Hospital
Imagine a hospital. A doctor prescribes a new medication to a patient. Before the patient can pick it up, the insurance company needs to approve it. Traditionally this is a long, painful process involving phone calls, faxes, paperwork, and lots of human waiting.
Now imagine we automate it with two AI agents:
1. Clinical Documentation Agent
This agent connects to the hospital's Electronic Health Record (EHR) system. It pulls the doctor's notes, lab results, the patient's medical history, and any prior treatments. Then it bundles all the relevant information into a clean package that an insurance company would expect.
2. Payer Authorization Agent
This second agent takes that package, logs into the insurance company's portal, fills out the authorization form, submits it, and waits for approval. Once it gets a green light, it notifies the pharmacy and the doctor.
These two agents talk to each other (we call this A2A, short for agent-to-agent) and they each call out to external systems like the EHR, the insurance portal, and the pharmacy.
Sounds great. But how reliable is this system? How fast? How expensive? What if it makes a mistake one day? That is why we need AgentOps.
Layer 1: Observability
Observability answers the question: "What is my agent doing right now, and how long is it taking?" Instead of treating the agent like a mysterious black box, you turn it into a glass box.
In our hospital example, there are four key signals we want to watch.
End-to-End (E2E) Trace Duration
This is the total time from the moment the user makes a request to the moment they get an answer back. In our case, from the second the doctor says "get insurance approval for this patient" to the second that approval is confirmed.
If that takes 10 seconds, fantastic. If it takes 4 hours, something is wrong somewhere in the pipeline and you need to find it.
Agent-to-Agent (A2A) Handoff Latency
In our example, the Clinical Documentation Agent finishes its work and hands the task off to the Payer Authorization Agent. How long does that handoff take?
This matters more than people realize. Sometimes the time spent passing tasks between agents is longer than the time spent doing the actual work. A clean handoff protocol can save you minutes per request.
Tool Execution Latency
Agents rarely work alone. They use tools — calling the EHR system is a tool, opening the insurance portal is a tool, sending a message to the pharmacy is a tool.
How long does each tool take to respond? Maybe the insurance portal is slow and bottlenecking everything. Tool execution latency lets you spot exactly where the slowdown is, instead of blaming "the AI" in general.
Cost per Authorization
How much does a single insurance approval actually cost us? AI agents are not free. Every model call burns tokens, every tool call costs money. If a single approval costs $50 to run while a human staff member could do the same job for $10, the math doesn't work.
This is the metric that tells you whether your AI investment is actually paying off — or quietly bleeding the budget.
Layer 2: Evaluation
Observability tells you what is happening. Evaluation tells you whether what's happening is actually good. An agent that responds in two seconds but gives wrong answers is worse than useless — it's dangerous.
Task Completion Rate
Out of every 100 requests, how many does my agent finish successfully without a human having to step in? If that number is 95%, great. If it's 40%, your humans are still doing more than half the work and the automation barely exists.
Factual Accuracy
Are the things your agent says actually true? In healthcare this is not optional — it's life or death.
If the Clinical Documentation Agent records "no penicillin allergy" when the patient actually has one, the consequences can be catastrophic. Factual accuracy measures whether the agent's outputs match the real underlying data.
Guardrail Violations
Think of "guardrails" like the railings on a highway — boundaries the agent should never cross. For a hospital, those might be:
- Never leak patient information to anyone unauthorized
- Always comply with HIPAA and similar privacy laws
- Never make decisions outside its authority
The guardrail violation rate measures how often the agent crosses one of those lines. The closer this number is to zero, the safer your system.
Clinical Appropriateness
This is a healthcare-specific check. Are the agent's decisions medically reasonable? For example, approving an adult dosage for a pediatric patient is technically a "decision" — but it's not clinically appropriate. This is usually scored using rules designed by actual clinicians.
First-pass Approval Rate
Of all the insurance authorizations the agent submits, what percentage get approved on the first try? This single metric secretly measures two things at once:
- How well the agent prepares documentation
- How well the agent understands the insurance company's rules
If your first-pass approval rate is 85%, the other 15% have to be redone, which means lost time and lost money.
Layer 3: Optimization
The first two layers tell you what is happening and whether it's good. The third layer answers: "How do we make it better?"
Prompt Token Efficiency
Language models are billed by "tokens". Every token is money. This metric asks: "How much output quality am I getting for each token I spend?"
If you're shoving a 5000-token mega-prompt into the model and only getting a one-line answer back, you are wasting money. If you can get the same quality with 500 tokens, you just made the system 10x cheaper. Multiplied over thousands of requests a day, this is a huge deal.
Flow Step Efficiency
How many steps does the agent take to complete a task? Sometimes agents loop, double-check things they already know, or ask the same question twice. For example, an agent might query the patient's name five times when it could have stored it once and reused it.
Cutting unnecessary steps makes the agent faster and cheaper at the same time.
Retrieval Precision (at K)
This one is a little technical, but it's important. Most agents don't know everything — they pull information from a knowledge base on demand. This is called RAG, short for Retrieval Augmented Generation. For example, an agent might pull the patient's previous lab reports to figure out their condition.
"Retrieval Precision at K" measures: out of the K documents the agent pulled, how many were actually relevant? If the agent grabs 10 documents but only 2 are useful, the other 8 are noise. Noise slows the agent down, confuses the model, and can lead to wrong decisions.
Handoff Success Rate
How often do handoffs between agents succeed cleanly? The Clinical Documentation Agent prepares a file, but does the Payer Authorization Agent actually receive it correctly? Or does it get a half-broken version? Failed handoffs are silent killers — the system looks like it's running but the wrong things are flowing through it.
Improvement Velocity
How fast is your agent actually getting better over time? This is a meta-metric. Are you continuously testing, measuring, tweaking, and re-deploying? Or is the agent today exactly as good as it was on day one? The real power of AgentOps is creating a tight feedback loop where the system improves itself week after week.
Back to the Hospital: What the Numbers Could Look Like
Imagine we apply all three layers in our hospital. After a few months of running this system with proper AgentOps, we might see results like:
- Time to get an insurance approval reduced by 85% (from days to hours)
- Cases requiring human intervention dropped by 50%
- Cost per authorization down to $0.47
These are not just cute numbers on a dashboard. They translate into:
- Patients getting their medications faster
- Nurses and doctors spending less time on paperwork and more time with patients
- The hospital running more efficiently as a business
That's the real promise of agents in healthcare — and AgentOps is the discipline that makes it possible.
Why You Cannot Skip AgentOps
Putting an AI agent into a hospital, a bank, or any other critical environment is like handing the car keys to a teenager. Can they drive? Maybe. Would you take your eyes off them? Absolutely not.
AgentOps is the "eyes" of your AI system. It watches, it grades, it improves. Running an AI agent without AgentOps is like:
- Driving a car with no speedometer
- Running a company with no financial reports
- Treating a patient without a thermometer
It's technically possible, but it is asking for trouble.
Final Thoughts
The thing that hit me the hardest watching the IBM video was this: AI agents are no longer "experiments". They are becoming real operational systems. And real systems need real metrics.
To recap the three layers in plain English:
- Observability: What is happening, how long is it taking, and how much does it cost?
- Evaluation: Is what's happening correct, safe, and useful?
- Optimization: How do we make it better, cheaper, and faster?
If you ever deploy your own AI agent into something that matters, do not skip these three layers. Building a flashy demo is easy. Building an agent that runs reliably every hour of every day is only possible with a solid AgentOps practice behind it.
I'm trying to learn one new AI concept per day by watching IBM Technology videos. If you're doing something similar, I highly recommend turning what you watch into written notes like this one. Writing forces you to actually understand the material — much more than just hitting play and nodding along.
This post is based on notes I took from an IBM Technology video on AgentOps. If you want to go deeper, I recommend watching the original.
Top comments (1)
One surprising insight is that the success of AI agents in operations often hinges less on their technical capabilities and more on how they're integrated into existing workflows. In our experience with enterprise teams, effective AgentOps requires a deep understanding of the current processes and a clear strategy for incremental integration, not just a wholesale technology implementation. This approach minimizes disruption and maximizes the agent's impact on operational efficiency. - Ali Muwwakkil (ali-muwwakkil on LinkedIn)