DEV Community

Haripriya Veluchamy
Haripriya Veluchamy

Posted on

Building an End-to-End Monitoring Architecture in Azure for a Multi-Service Product

When a product depends on multiple interconnected services, reliability becomes non-negotiable. In our case, two critical services form the core of the system:

  • A scheduled data collection service that crawls external sources
  • An API service that processes, transforms, and serves enriched data

To ensure uninterrupted operations, a full-stack monitoring architecture was developed on Azure covering everything from infrastructure metrics to business KPIs, with automated alerting routed directly into Slack.

This post breaks down the layers of that monitoring system and the principles behind its design.


πŸ”Ή Layer 1: Platform Diagnostics with Azure App Service

Both services run on Azure App Service. Azure provides extensive diagnostic categories, and every relevant category was enabled to maximize visibility:

βœ” HTTP Logs  
βœ” Console Logs  
βœ” Application Logs  
βœ” Access Audit Logs  
βœ” IPSecurity Audit Logs  
βœ” Platform Logs  
βœ” Authentication Logs  
βœ” AllMetrics (CPU, Memory, Connections, Threads)
Enter fullscreen mode Exit fullscreen mode

All logs flow into a centralized Log Analytics workspace.
This centralization is key issues across services become much easier to correlate.


πŸ”Ή Layer 2: Deep Telemetry with Application Insights

Platform logs only reveal surface-level issues. To understand what happens inside the applications, both services were integrated with Application Insights, enabling:

  • Exception tracking
  • Full request traces
  • Dependency telemetry (database, external APIs, network calls)
  • Performance metrics
  • Custom instrumentation for business workflows

This added the ability to detect internal bottlenecks, slow external services, and logical failures that App Service logs would never capture.


πŸ”Ή Layer 3: Smart Health Check Endpoints

Each service exposes a real health check endpoint not just a β€œ200 OK”.
These checks validate:

  • Database connectivity
  • Cache availability
  • External dependency readiness
  • Background job health
  • Internal workflow availability

They act as early warning signals for issues that have not yet impacted the user-facing experience.


πŸ”Ή Layer 4: Heartbeat Validation for Scheduled Workloads

The data collector runs on a time-based schedule, so a heartbeat mechanism ensures silent failures are caught immediately.

The mechanism:

  1. Updates a timestamp file or marker after each successful crawl
  2. A validator process monitors the timestamp freshness
  3. Alerts are fired when the heartbeat stops updating

This prevents scenarios where the crawler appears healthy from the outside but has quietly stopped running.


πŸ”Ή Performance Optimization with Server-Side Caching

To keep the API responsive, server-side caching was added with:

  • TTL-driven expiry
  • LRU eviction
  • Cache pre-warming for high-traffic endpoints

This reduces dependency load and ensures predictable performance even during spikes.


πŸ”Ή Alerting Strategy: Meaningful, Not Noisy

Alert rules were designed around realistic system behavior avoiding alert fatigue while ensuring actionable detection.

Configured alerts include:

  • Surge in HTTP 4xx/5xx errors
  • CPU and memory anomalies
  • Slow or failed health checks
  • Data collection delays
  • Processing throughput drops
  • Dependency timeout spikes

Each alert uses:

  • Tuned thresholds (based on real baselines)
  • Proper look-back windows
  • Severity mapping
  • Evaluation frequency configured for the metric type

This ensures alerts represent real incidents, not transient noise.


πŸ”Ή Slack Notification Integration (Action Group β†’ Webhook)

All alerts flow straight into the team's Slack workspace.
This is done through:

  • An Azure Action Group
  • A Slack Incoming Webhook
  • Mapped severity β†’ Slack channel routing

This integration ensures every critical event reaches the team instantly, without relying on email.

Example alert flow:

Azure Monitor β†’ Action Group β†’ Slack Webhook β†’ Incident Channel

This drastically reduces reaction time during incidents.


πŸ”Ή Unified Visualization with Grafana

Grafana provides the single-pane-of-glass view of the entire system.

Setup included:

  1. Creating a Grafana instance
  2. Assigning it the Monitor Reader role
  3. Adding Azure Monitor as a data source
  4. Building dashboards that combine:
  • Infrastructure metrics
  • Application Insights telemetry
  • Custom business KPIs

These dashboards allow instant diagnosis and trend analysis, especially during incident reviews.


πŸ”Ή Key Design Principles

A few core ideas shaped the monitoring system:

1. Layered Observability

Platform β†’ Application β†’ Business β†’ Scheduled Jobs
Each layer answers questions the others cannot.

2. Baseline-Driven Alerting

Thresholds were tuned using real traffic and performance patterns.

3. Centralized Logging

All logs and telemetry feed into a single workspace.

4. Cost-Aware Logging

Diagnostic categories were selected strategically to avoid unnecessary Log Analytics ingestion costs.

5. Fast Notification Path

Slack, not email because delay kills reliability.


πŸš€ Final Thoughts

This monitoring system does more than detect failures it prevents incidents by surfacing early symptoms.
Proactive insights, clean alerts, and unified dashboards help maintain product reliability without overwhelming the team.

For anyone building observability for multi-service systems, designing a monitoring stack around clarity, actionable alerts, and proper layering makes all the difference.


Top comments (0)