Haripriya Veluchamy

Posted on Nov 22

Building an End-to-End Monitoring Architecture in Azure for a Multi-Service Product

#devops #azure #monitoring #cloud

When a product depends on multiple interconnected services, reliability becomes non-negotiable. In our case, two critical services form the core of the system:

A scheduled data collection service that crawls external sources
An API service that processes, transforms, and serves enriched data

To ensure uninterrupted operations, a full-stack monitoring architecture was developed on Azure covering everything from infrastructure metrics to business KPIs, with automated alerting routed directly into Slack.

This post breaks down the layers of that monitoring system and the principles behind its design.

🔹 Layer 1: Platform Diagnostics with Azure App Service

Both services run on Azure App Service. Azure provides extensive diagnostic categories, and every relevant category was enabled to maximize visibility:

✔ HTTP Logs  
✔ Console Logs  
✔ Application Logs  
✔ Access Audit Logs  
✔ IPSecurity Audit Logs  
✔ Platform Logs  
✔ Authentication Logs  
✔ AllMetrics (CPU, Memory, Connections, Threads)

All logs flow into a centralized Log Analytics workspace.
This centralization is key issues across services become much easier to correlate.

🔹 Layer 2: Deep Telemetry with Application Insights

Platform logs only reveal surface-level issues. To understand what happens inside the applications, both services were integrated with Application Insights, enabling:

Exception tracking
Full request traces
Dependency telemetry (database, external APIs, network calls)
Performance metrics
Custom instrumentation for business workflows

This added the ability to detect internal bottlenecks, slow external services, and logical failures that App Service logs would never capture.

🔹 Layer 3: Smart Health Check Endpoints

Each service exposes a real health check endpoint not just a “200 OK”.
These checks validate:

Database connectivity
Cache availability
External dependency readiness
Background job health
Internal workflow availability

They act as early warning signals for issues that have not yet impacted the user-facing experience.

🔹 Layer 4: Heartbeat Validation for Scheduled Workloads

The data collector runs on a time-based schedule, so a heartbeat mechanism ensures silent failures are caught immediately.

The mechanism:

Updates a timestamp file or marker after each successful crawl
A validator process monitors the timestamp freshness
Alerts are fired when the heartbeat stops updating

This prevents scenarios where the crawler appears healthy from the outside but has quietly stopped running.

🔹 Performance Optimization with Server-Side Caching

To keep the API responsive, server-side caching was added with:

TTL-driven expiry
LRU eviction
Cache pre-warming for high-traffic endpoints

This reduces dependency load and ensures predictable performance even during spikes.

🔹 Alerting Strategy: Meaningful, Not Noisy

Alert rules were designed around realistic system behavior avoiding alert fatigue while ensuring actionable detection.

Configured alerts include:

Surge in HTTP 4xx/5xx errors
CPU and memory anomalies
Slow or failed health checks
Data collection delays
Processing throughput drops
Dependency timeout spikes

Each alert uses:

Tuned thresholds (based on real baselines)
Proper look-back windows
Severity mapping
Evaluation frequency configured for the metric type

This ensures alerts represent real incidents, not transient noise.

🔹 Slack Notification Integration (Action Group → Webhook)

All alerts flow straight into the team's Slack workspace.
This is done through:

An Azure Action Group
A Slack Incoming Webhook
Mapped severity → Slack channel routing

This integration ensures every critical event reaches the team instantly, without relying on email.

Example alert flow:

Azure Monitor → Action Group → Slack Webhook → Incident Channel

This drastically reduces reaction time during incidents.

🔹 Unified Visualization with Grafana

Grafana provides the single-pane-of-glass view of the entire system.

Setup included:

Creating a Grafana instance
Assigning it the Monitor Reader role
Adding Azure Monitor as a data source
Building dashboards that combine:

Infrastructure metrics
Application Insights telemetry
Custom business KPIs

These dashboards allow instant diagnosis and trend analysis, especially during incident reviews.

🔹 Key Design Principles

A few core ideas shaped the monitoring system:

1. Layered Observability

Platform → Application → Business → Scheduled Jobs
Each layer answers questions the others cannot.

2. Baseline-Driven Alerting

Thresholds were tuned using real traffic and performance patterns.

3. Centralized Logging

All logs and telemetry feed into a single workspace.

4. Cost-Aware Logging

Diagnostic categories were selected strategically to avoid unnecessary Log Analytics ingestion costs.

5. Fast Notification Path

Slack, not email because delay kills reliability.

🚀 Final Thoughts

This monitoring system does more than detect failures it prevents incidents by surfacing early symptoms.
Proactive insights, clean alerts, and unified dashboards help maintain product reliability without overwhelming the team.

For anyone building observability for multi-service systems, designing a monitoring stack around clarity, actionable alerts, and proper layering makes all the difference.

DEV Community