DEV Community

Custodia-Admin
Custodia-Admin

Posted on • Originally published at pagebolt.dev

Measuring and Maintaining SLA Reliability for AI Agent Workflows

Measuring and Maintaining SLA Reliability for AI Agent Workflows

Your team deployed an AI agent to handle customer support tickets. It processes 100 per day. It's been running for 2 weeks.

Then someone asks: "What's the SLA?"

SLA. Service Level Agreement. The promise you make about how often the service will be available and working correctly.

You don't have one. Because you never defined one.

This is the state of most agent deployments: powerful, production-critical, but no reliability guarantees.

Why Agent SLAs Are Different

Traditional service SLAs are straightforward:

  • Uptime: The service is available 99.9% of the time
  • Response time: Requests resolve in <200ms
  • Error rate: <0.1% of requests fail

Agent SLAs are more complex:

Uptime: Is the agent running? But that's not the full story.
Execution rate: Does the agent complete its task, or does it get stuck?
Accuracy rate: Does the agent do the task correctly?
Latency: How long does a task take?

An agent can be "up" (the process running) but "down" (stuck in a loop, unable to make progress).

Defining Agent SLA Metrics

Start with these:

1. Availability

  • Percentage of time the agent is able to accept new tasks
  • Target: 99.5% (acceptable: 1 outage per 2 weeks)

2. Task Completion Rate

  • Percentage of tasks that complete without error
  • Target: 99% (acceptable: 1 failure per 100 tasks)

3. Execution Time (P95 latency)

  • 95th percentile time to complete a task
  • Target: <5 minutes for most tasks

4. Accuracy Rate

  • Percentage of completed tasks that are correct
  • Target: 99%+ (depends on use case)

5. Recovery Time (MTTR)

  • Time to recover from failure
  • Target: <1 hour

Implementing SLA Monitoring

You can't hit an SLA you don't measure. Instrument every agent workflow:

import time
from datetime import datetime

class AgentTask:
    def __init__(self, task_id, task_description):
        self.task_id = task_id
        self.start_time = datetime.now()
        self.end_time = None
        self.status = "pending"
        self.result = None
        self.error = None

    def execute(self, agent):
        """Execute task and collect metrics."""
        try:
            self.result = agent.run(self)
            self.status = "success"
        except Exception as e:
            self.status = "failed"
            self.error = str(e)
        finally:
            self.end_time = datetime.now()
            self.duration = (self.end_time - self.start_time).total_seconds()
            self.log_metrics()

    def log_metrics(self):
        """Send metrics to monitoring system."""
        metrics = {
            "task_id": self.task_id,
            "status": self.status,
            "duration_seconds": self.duration,
            "timestamp": self.start_time.isoformat(),
            "error": self.error
        }

        # Send to your monitoring system (DataDog, New Relic, etc.)
        monitoring_service.record_metric("agent.task", metrics)
Enter fullscreen mode Exit fullscreen mode

Now you have data. Track it over time.

Responding to SLA Violations

When your agent misses SLA:

1. Identify the root cause

  • Did the agent crash?
  • Did it get stuck?
  • Did it make an error?
  • Was it a dependency failure (API down, database slow)?

2. Classify the failure

  • Agent bug: needs code fix
  • Infrastructure: needs capacity planning
  • External dependency: needs escalation
  • Expected failure: does SLA need adjustment?

3. Implement the fix

  • Deploy updated agent code
  • Scale infrastructure
  • Negotiate SLA with dependency provider
  • Re-evaluate targets

Real-World Example

Agent: Process invoice submissions

Target SLA:

  • Availability: 99.5%
  • Task completion: 99%
  • Accuracy: 99%
  • Execution time (P95): 2 minutes

Week 1 metrics:

  • Availability: 98% (one 2-hour outage)
  • Task completion: 94% (6 of 100 tasks failed)
  • Accuracy: 97% (3 tasks processed incorrectly)
  • Execution time: 3 minutes

What's failing?

  • Task failures: Agent timeout during PDF upload (infrastructure issue)
  • Accuracy failures: Agent misreading handwritten dates (model accuracy issue)
  • Execution time: Slow database queries (dependency issue)

Actions:

  • Increase timeout threshold (infrastructure)
  • Retrain model on handwritten input (agent improvement)
  • Optimize database queries (dependency escalation)

Week 2 metrics: Back to target.

The Business Case for Agent SLAs

Defining and tracking SLAs:

  • Builds confidence — Stakeholders trust systems with published SLAs
  • Drives improvements — Metrics highlight bottlenecks
  • Enables scaling — You know what's working and what needs investment
  • Facilitates compensation — When SLAs miss, you have data to adjust pricing or credits

Without SLAs, agents are "best effort." With SLAs, they're infrastructure.


Try it free: 100 requests/month on PageBolt—capture visual proof of your agent's execution at every SLA checkpoint. No credit card required.

Top comments (0)