If you’ve ever lost sleep debugging a latency spike across microservices, you know observability isn’t just about metrics—it’s about context.
For months, my team ran a multi-tier SaaS application on Alibaba Cloud using Prometheus on ACK (Alibaba Cloud Container Service for Kubernetes). It worked—until scaling demands, fragmented tooling, and alert fatigue caught up with us.
We recently migrated to Application Real-Time Monitoring Service (ARMS), Alibaba Cloud’s managed observability platform. The result? 40% faster incident resolution, far less operational overhead, and a unified view across our entire stack.
Here’s how—and why—it worked for us.
Why We Hit Prometheus’ Limits
Prometheus is fantastic for metrics—but in a production-grade, multi-environment setup, we kept running into walls:
- Operational drag: Managing HA Prometheus, storage, retention, and scrape configs across clusters ate up valuable engineering time.
- No end-to-end visibility: Metrics lived in Prometheus, logs in a separate system, and traces in another. Correlating a frontend slowdown to a slow database query meant manual detective work.
- Scaling unpredictably: During traffic spikes, we either over-provisioned (wasting money) or risked metric gaps.
- Noisy, low-context alerts: “High CPU” or “API error rate up” without knowing which service or which dependency was the root cause.
We needed something that gave us metrics + traces + logs in one place, with minimal maintenance.
Why We Chose ARMS
ARMS isn’t just “Prometheus-as-a-service.” It’s a unified observability platform that supports:
- Full Prometheus remote write/pull compatibility
- Automatic discovery of cloud resources (Kubernetes workloads, load balancers, managed databases, etc.)
- Built-in distributed tracing (with auto-instrumentation for Java, Python, Go, and more)
- Tight integration with Alibaba Cloud services—no extra agents or complex networking
Most importantly: we didn’t have to throw away our existing investment. ARMS ingested our existing Prometheus metrics while adding deeper context from the underlying cloud infrastructure.
Our Migration Strategy: Safe and Incremental
We didn’t flip a switch. Observability is too critical for that. Here’s our phased approach:
- Parallel ingestion: Configured ARMS to receive metrics via Prometheus remote write—same exporters, same metrics.
- Validation period: Ran both systems side-by-side for 2 weeks, comparing dashboards, alert fidelity, and data completeness.
- Gradual cutover: Shifted Grafana dashboards to use ARMS as a data source, then migrated alert rules.
- Decommission: Once confident, we scaled down our self-managed Prometheus stack and reclaimed resources.
Zero downtime. Minimal risk.
💡 Pro tip: Start with ARMS’ Prometheus Monitoring mode—it’s the gentlest on-ramp if you’re already using Prometheus.
Observability Across All Tiers—Finally
Our stack spans:
- Frontend (CDN + static assets on OSS)
- API gateway (via Application Load Balancer)
- Stateless microservices (on ACK)
- Caching layer (managed Redis)
- Persistent storage (managed PostgreSQL)
Before ARMS, answering “Why is checkout slow?” meant jumping between 3+ tools.
Now, in a single ARMS dashboard, we see:
- Frontend performance (real user metrics)
- API latency and error rates
- Service dependency map with live traces
- Slow database queries auto-linked to calling services
It’s no longer about “collecting data”—it’s about understanding behavior.
Results That Mattered
- 40% reduction in mean time to resolution (MTTR)
- 60% less time spent managing observability infrastructure
- Fewer but higher-signal alerts (with service-level context)
- Faster onboarding—new engineers explore the system through ARMS, not tribal knowledge
If You’re Considering a Similar Move
Here’s what helped us:
- Use ARMS’ Prometheus-compatible endpoint to avoid config rewrites.
- Enable Application Monitoring for auto-instrumentation—just add a few JVM args or SDK calls.
- Set up RAM roles (Alibaba Cloud’s IAM) for secure, credential-free access.
- Still push custom metrics? ARMS supports OpenTelemetry and Prometheus client libraries.
Final Thought
This wasn’t just a tool swap—it was a shift from reactive monitoring to proactive understanding. By leaning into a managed, cloud-native observability platform, we got our weekends back and our SLOs under control.
If you’re running Prometheus on Kubernetes and feeling the pain of scale, it might be time to explore what a unified observability platform can do for your team—regardless of your cloud provider.
Have you migrated from self-managed Prometheus to a managed solution? I’d love to hear your lessons in the comments!
About the author: I’m a DevOps engineer and 2x Alibaba Cloud MVP. I spend my days building resilient, observable systems in the cloud—and sharing what works (and what doesn’t).
Top comments (0)