Speaker: Jimmy Soh @ AWS Amarathon 2025
Summary by Amazon Nova
Key Challenges in LLM Operations
Tracking usage across tenants and models
Preventing abuse and prompt injection
Optimizing cost without sacrificing SLA
Blind spots in usage, security, and cost can sink LLM scale.
Real-Time Insights Driving AIOps Decisions
Monitor every prompt, token, and latency
Detect anomalies and abuse patterns
Drive intelligent automation
Live metrics turn anomalies into instant, automated fixes.
Fair Pricing Through Smart Observability
Align cloud spend with true LLM usage
Identify under-utilized resources and right-size automatically
Trigger cost-saving actions (scale-to-zero, burst capacity)
Pay only for value – usage-driven metering that right-sizes itself.
From Observability to Optimization with AIOps
Smarter automation drives faster incident resolution
Continuous cost efficiency without manual tuning
High-performing AI workloads
transform LLM observability into intelligent AIOps actions.
Architecture
1 Chat / AI Services Usage with Customers using Toby AI
SaaS Cluster
Toby AI Services
Application Performance
Monitoring
Real User Monitoring
Logs and Metrics Analytics
Synthetic Monitoring
Monitoring Rules and Alerts
2 Telemetry Data SaaS (Self-Monitoring Cluster)
Toby AI Services
Application Performance
Monitoring
Real User Monitoring
Logs and Metrics Analytics
Synthetic Monitoring
Monitoring Rules and Alerts
3 Subscription Check with AIOps Orchestrator
Context Enrichment
Policy Decision
Trigger Actions
4 Runbook, Scale, Throttle, Optimize with DevOps Orchestrator
- GitOps Runbooks
Team:
Top comments (0)