Azure Observability

#azure

Originally published on lavkesh.com

I see monitoring as telling you when something is wrong, whereas observability tells you why. Traditional monitoring relies on predefined metrics and alerts, which works until something breaks in an unexpected way. Observability takes it further by looking at a system's external outputs to understand its internal state, even for problems you didn't plan for.

In Azure, this covers the full stack: infrastructure, application code, and services. It's about understanding how your system behaves, not just when it breaks.

The three pillars of observability in Azure are metrics, logs, and traces. Metrics are numeric measurements over time, such as CPU usage, request rate, error count, and latency percentiles. Azure Monitor collects these metrics from VMs, databases, App Services, and most other Azure resources.

For example, I've seen systems where 95th percentile latency was the key metric, because that's what affected user experience. In one case, we reduced average latency by 30% but 95th percentile latency only improved by 10%, which is what mattered to users. To get that insight, we used Azure Monitor to collect metrics from our App Service and Azure Storage, and then used Azure Data Explorer to analyze the data and identify bottlenecks.

You can build dashboards and set alerts on top of those metrics to catch performance problems early. Logs, on the other hand, provide detailed records of events, errors, and state changes. Azure Log Analytics gives you centralized log storage and querying with KQL, and Microsoft Sentinel adds threat detection on top.

I've found that centralizing logs is crucial, especially when dealing with large-scale systems. In one instance, we had a system with over 500 VMs, and trying to SSH into each box to grep logs was impractical. By using Azure Log Analytics, we were able to query logs across all VMs and identify the root cause of an issue in under an hour, saving us several hours of manual work.

Traces are about distributed tracing, which tracks a request as it moves across services. In a microservices system, a single user request might touch five services. Azure Application Insights shows you the end-to-end transaction, where latency comes from, and which service is the bottleneck.

Azure services emit telemetry by default, and you can instrument your application code with the Application Insights SDK or OpenTelemetry. OpenTelemetry is vendor-neutral, so you're not locked into Azure tooling if requirements change. For instance, we used OpenTelemetry to instrument our .NET application, which allowed us to track requests across multiple services and identify performance issues.

When it comes to practices, I define the metrics that matter before I get paged at 2am. I know my SLOs and set alerts based on those, not arbitrary thresholds. I centralize logs rather than SSHing into boxes and grepping around, and instrument services with distributed tracing from the start.

In terms of specific tools, I've found that Azure Monitor and Azure Log Analytics are essential for observability in Azure. Azure Monitor provides real-time metrics and alerts, while Azure Log Analytics provides detailed log analysis and querying capabilities. By combining these tools, you can get a complete picture of your system's behavior and identify issues before they affect users.

I also build runbooks for common alerts so whoever is on call knows what to do. And I review observability data regularly, not just during incidents. The goal is to catch problems before users report them and resolve them in minutes instead of hours.

By following these practices, you can make the most of Azure's observability features and improve the reliability, performance, and security of your applications and services.

DEV Community

Azure Observability

Top comments (0)