At 2am on a Tuesday, an IP address change in Microsoft
infrastructure silently broke our entire integration
pipeline at Blue Yonder. Messages stopped flowing
between Salesforce and ServiceNow. No errors surfaced
in the application logs. The systems reported themselves
as healthy. But nothing was moving.
It was Azure Application Insights that found it.
The Application Map showed 100% failure rate on the
Service Bus node. I clicked the connection line. Saw
connection refused errors on the dependency calls.
Traced the root cause to the IP change within minutes.
I wrote a Logic App to inspect the dead-letter queue
and replay every failed message. Zero data loss.
Zero SLA breach.
That incident shaped how I think about observability
forever. App Insights is not a monitoring tool you add
at the end. It is the foundation you build everything on
from day one.
This post covers everything I learned using App Insights
in production - the queries, the alerts, the patterns,
and the lessons that only come from real incidents.
What App Insights Collects Automatically
The first thing that surprises most engineers is how
much App Insights collects with zero configuration.
Add one NuGet package and one line in Program.cs and
you immediately get:
Every HTTP request your API receives - the URL, method,
response code, duration, and whether it succeeded.
Every dependency call your app makes outbound - SQL
queries, HTTP calls to external APIs, Service Bus
operations, Blob Storage reads. Each one tracked with
duration and success status.
Every unhandled exception with the full stack trace,
the exact line of code that failed, and all exception
properties including inner exceptions.
Every log message you write with ILogger - Information,
Warning, Error levels all captured with custom properties.
Performance counters including CPU usage, memory
consumption, and request queue length collected
automatically on App Service.
Setting It Up in C# ASP.NET Core
Three steps and you are done.
Install the NuGet package:
Microsoft.ApplicationInsights.AspNetCore
Add one line in Program.cs:
builder.Services.AddApplicationInsightsTelemetry(
builder.Configuration["ApplicationInsights:ConnectionString"]
);
Add the connection string to appsettings.json or
Azure App Service Configuration (never hardcode it).
That is it. Every ILogger call in your code now goes
to App Insights automatically. No code changes needed
in your controllers or services.
For custom events and metrics - tracking business-level
events beyond technical telemetry - inject TelemetryClient
and call TrackEvent() with a name and custom properties.
I used this at Blue Yonder to track every integration
completion with the system name and record count as
properties, so I could query processing volumes by
system over time.
KQL - The Query Language That Changes Everything
KQL (Kusto Query Language) is what makes App Insights
powerful rather than just a log viewer. It reads left
to right with pipe operators - each step filters or
transforms the result of the previous step.
The basic structure is always the same:
Start with a table name, then pipe through operators
like where, project, summarize, order by, and extend.
Once you understand five queries you can write almost
any investigation query you need. Here are the five
I used most in production:
All failed requests in the last hour:
requests
| where timestamp > ago(1h)
| where success == false
| project timestamp, name, url, resultCode, duration
| order by timestamp desc
This was my first query every morning and after every
deployment. If it returned rows, I had work to do.
Slowest API endpoints today:
requests
| where timestamp > ago(24h)
| summarize AvgDuration = avg(duration), Count = count() by name
| order by AvgDuration desc
| take 10
This identified performance regressions immediately
after deployments before customers reported them.
All exceptions with full detail:
exceptions
| where timestamp > ago(4h)
| project timestamp, type, outerMessage, innermostMessage, method
| order by timestamp desc
The innermostMessage field is the one you want -
it has the root cause, not the wrapper exception.
Dependency failures - what external calls failed:
dependencies
| where timestamp > ago(1h)
| where success == false
| project timestamp, type, target, name, duration, resultCode
| order by timestamp desc
This was how I found the Microsoft IP change.
Service Bus dependencies all showing connection refused,
all starting at the same timestamp.
Error rate by hour - spot patterns:
requests
| where timestamp > ago(24h)
| summarize Total=count(), Failed=countif(success==false) by bin(timestamp,1h)
| extend ErrorRate = round(Failed*100.0/Total, 2)
| order by timestamp asc
This chart pattern shows you whether failures are
random noise or a systematic problem getting worse.
The Queries I Wrote at Blue Yonder
Beyond the standard queries, I built several patterns
specific to integration monitoring that I have not seen
documented elsewhere.
Cross-system correlation - following one record through
every system it touched. Every request in our pipeline
carried a CorrelationId custom property. With this query
I could trace a single Salesforce case through Logic App
orchestration, Function App transformation, Service Bus
messaging, and the final ServiceNow API call - seeing
exact timestamps and durations at each step:
union requests, dependencies, traces, exceptions
| where timestamp > ago(24h)
| where tostring(customDimensions["CorrelationId"]) == "your-id"
| project timestamp, itemType, name, message, duration, success
| order by timestamp asc
Token refresh monitoring - tracking the 3-month
Salesforce and 6-month ServiceNow credential refresh
cycles that were a significant operational risk before
I centralized them in Key Vault:
customEvents
| where name == "TokenRefreshed"
| extend System = tostring(customDimensions["System"])
| summarize LastRefresh=max(timestamp), SuccessCount=countif(tobool(customDimensions["Success"])==true) by System
Integration pipeline health - the Monday morning query
that showed overnight processing status for every
integration flow:
requests
| where timestamp > ago(12h)
| where name contains "Integration"
| summarize Success=countif(success==true), Failed=countif(success==false), AvgDuration=avg(duration) by name
| extend Status = iff(Failed > 0, "DEGRADED", "HEALTHY")
| order by Status asc
Smart Alerts - Stop Watching Dashboards
The real power of App Insights is alerts that find
problems for you.
Metric alerts fire when a number crosses a threshold -
failed requests greater than 5 in 5 minutes, response
time average greater than 2 seconds, exception count
greater than 10 per hour.
Log alerts run a KQL query on a schedule and fire if
the results meet a condition. This is how I monitored
the dead-letter queue - a query that ran every 5 minutes
and fired immediately if any messages appeared in the DLQ.
In production, a non-empty DLQ is always a signal that
something needs investigation.
Smart detection requires no configuration. App Insights
learns your baseline automatically and alerts on
anomalies - unusual failure rate spikes, abnormal
response time degradation, memory leak patterns.
It caught two issues at Blue Yonder that I would
not have noticed from metrics alone.
Live Metrics During Deployments
Live Metrics shows you what is happening with less than
one second latency - incoming requests per second,
exception rate, dependency call rate, CPU and memory
of every running instance.
I had Live Metrics open on a second monitor during
every production deployment. If the exception rate
spiked within 30 seconds of a deploy I knew immediately
to roll back. If it stayed flat for 2 minutes the
deployment was clean.
This practice caught one bad deployment at Blue Yonder
that would have caused a production incident if we had
waited for customer reports. The rollback took 90 seconds.
The alternative would have been hours of incident response.
Application Map
The Application Map is the fastest way to understand
what broke and where. It shows every component of your
system as a node - your API, the SQL database, Service Bus,
external APIs - with connection lines showing call volume
and failure rate between them.
When the Microsoft IP change broke our pipeline, the
Service Bus node on the Application Map turned red
with a 100% failure rate. I clicked the connection
line between our API and Service Bus. The details pane
showed connection refused with the target IP address.
That one click saved 30 minutes of log digging.
Distributed Tracing
Every request in App Insights gets a unique Operation ID
that flows automatically through every system it touches.
A single user request that goes through APIM, Logic App,
Function App, Service Bus, and SQL - all with the same
Operation ID.
In Transaction Search, paste the Operation ID and see
every step in chronological order with exact timestamps
and durations. The full story of one request across
your entire distributed system in one view.
The practical implication for code: add a CorrelationId
to your custom log entries and custom events. Then you
can find every log entry related to one business
transaction even when it crosses system boundaries.
Connecting App Insights Across the Azure Stack
App Insights works with every Azure service. Logic Apps
through diagnostic settings send all run history to the
same Log Analytics workspace you query with KQL.
Function Apps with AddApplicationInsightsTelemetry()
track every function execution automatically. APIM
connected to App Insights logs every API call with the
caller's subscription key. Service Bus diagnostic
settings expose message counts and DLQ depth as metrics.
The goal is one workspace where you can query across
all these data sources simultaneously. When an incident
spans multiple systems - which in integration work they
always do - you want one place to look, not five dashboards.
Key Lessons From Production
Set up App Insights before you write business logic.
Retrofitting observability into an existing system is
ten times harder than building it in from the start.
Use structured logging with named parameters.
log.LogInformation("Processing {OrderId}", order.Id)
is infinitely more queryable than string concatenation.
The named parameter becomes a filterable field in KQL.
Add CorrelationId to everything. Every log entry,
every custom event, every Service Bus message property.
Cross-system tracing without it is guesswork.
Learn five queries well rather than memorizing dozens.
The queries in this post covered 90% of every real
incident investigation I ran at Blue Yonder.
Set DLQ alerts before going live. Silent message loss
in a Service Bus integration is the hardest production
bug to diagnose after the fact. The alert costs
5 minutes to set up. The incident it prevents costs days.
Enable sampling in production. A busy integration
platform generates GBs of telemetry per day at full
collection. Adaptive sampling preserves all errors
and exceptions while reducing successful request
volume to stay within cost limits.
App Insights in the TechStack Blog
The C# ASP.NET Core API powering this blog has
App Insights enabled. Every visit to techstackblog.com
that triggers an API call is tracked - the request
duration, the SQL query to Azure SQL, any exceptions.
The connection string lives in Azure App Service
Configuration, not in the code or GitHub.
If you are reading this post and wondering whether
it is being monitored - it is.
Summary
Azure Application Insights gives you complete visibility
into every layer of your Azure stack. Automatic telemetry
collection means zero-config coverage from day one.
KQL gives you the ability to answer any question about
your system's behavior. Smart alerts find problems
before your customers do. Application Map shows you
what broke. Distributed tracing shows you why.
Build it in from day one. Set your alerts. Learn your
five queries. Then stop firefighting and start preventing.
Originally published at TechStack Blog:
https://www.techstackblog.com
Follow for weekly posts on Azure integration, C#,
observability, and cloud engineering.
Top comments (0)