When a product depends on multiple interconnected services, reliability becomes non-negotiable. In our case, two critical services form the core of the system:
- A scheduled data collection service that crawls external sources
- An API service that processes, transforms, and serves enriched data
To ensure uninterrupted operations, a full-stack monitoring architecture was developed on Azure covering everything from infrastructure metrics to business KPIs, with automated alerting routed directly into Slack.
This post breaks down the layers of that monitoring system and the principles behind its design.
πΉ Layer 1: Platform Diagnostics with Azure App Service
Both services run on Azure App Service. Azure provides extensive diagnostic categories, and every relevant category was enabled to maximize visibility:
β HTTP Logs
β Console Logs
β Application Logs
β Access Audit Logs
β IPSecurity Audit Logs
β Platform Logs
β Authentication Logs
β AllMetrics (CPU, Memory, Connections, Threads)
All logs flow into a centralized Log Analytics workspace.
This centralization is key issues across services become much easier to correlate.
πΉ Layer 2: Deep Telemetry with Application Insights
Platform logs only reveal surface-level issues. To understand what happens inside the applications, both services were integrated with Application Insights, enabling:
- Exception tracking
- Full request traces
- Dependency telemetry (database, external APIs, network calls)
- Performance metrics
- Custom instrumentation for business workflows
This added the ability to detect internal bottlenecks, slow external services, and logical failures that App Service logs would never capture.
πΉ Layer 3: Smart Health Check Endpoints
Each service exposes a real health check endpoint not just a β200 OKβ.
These checks validate:
- Database connectivity
- Cache availability
- External dependency readiness
- Background job health
- Internal workflow availability
They act as early warning signals for issues that have not yet impacted the user-facing experience.
πΉ Layer 4: Heartbeat Validation for Scheduled Workloads
The data collector runs on a time-based schedule, so a heartbeat mechanism ensures silent failures are caught immediately.
The mechanism:
- Updates a timestamp file or marker after each successful crawl
- A validator process monitors the timestamp freshness
- Alerts are fired when the heartbeat stops updating
This prevents scenarios where the crawler appears healthy from the outside but has quietly stopped running.
πΉ Performance Optimization with Server-Side Caching
To keep the API responsive, server-side caching was added with:
- TTL-driven expiry
- LRU eviction
- Cache pre-warming for high-traffic endpoints
This reduces dependency load and ensures predictable performance even during spikes.
πΉ Alerting Strategy: Meaningful, Not Noisy
Alert rules were designed around realistic system behavior avoiding alert fatigue while ensuring actionable detection.
Configured alerts include:
- Surge in HTTP 4xx/5xx errors
- CPU and memory anomalies
- Slow or failed health checks
- Data collection delays
- Processing throughput drops
- Dependency timeout spikes
Each alert uses:
- Tuned thresholds (based on real baselines)
- Proper look-back windows
- Severity mapping
- Evaluation frequency configured for the metric type
This ensures alerts represent real incidents, not transient noise.
πΉ Slack Notification Integration (Action Group β Webhook)
All alerts flow straight into the team's Slack workspace.
This is done through:
- An Azure Action Group
- A Slack Incoming Webhook
- Mapped severity β Slack channel routing
This integration ensures every critical event reaches the team instantly, without relying on email.
Example alert flow:
Azure Monitor β Action Group β Slack Webhook β Incident Channel
This drastically reduces reaction time during incidents.
πΉ Unified Visualization with Grafana
Grafana provides the single-pane-of-glass view of the entire system.
Setup included:
- Creating a Grafana instance
- Assigning it the Monitor Reader role
- Adding Azure Monitor as a data source
- Building dashboards that combine:
- Infrastructure metrics
- Application Insights telemetry
- Custom business KPIs
These dashboards allow instant diagnosis and trend analysis, especially during incident reviews.
πΉ Key Design Principles
A few core ideas shaped the monitoring system:
1. Layered Observability
Platform β Application β Business β Scheduled Jobs
Each layer answers questions the others cannot.
2. Baseline-Driven Alerting
Thresholds were tuned using real traffic and performance patterns.
3. Centralized Logging
All logs and telemetry feed into a single workspace.
4. Cost-Aware Logging
Diagnostic categories were selected strategically to avoid unnecessary Log Analytics ingestion costs.
5. Fast Notification Path
Slack, not email because delay kills reliability.
π Final Thoughts
This monitoring system does more than detect failures it prevents incidents by surfacing early symptoms.
Proactive insights, clean alerts, and unified dashboards help maintain product reliability without overwhelming the team.
For anyone building observability for multi-service systems, designing a monitoring stack around clarity, actionable alerts, and proper layering makes all the difference.

Top comments (0)