Azure APIM MCP Audit Logging Without Breaking Everything
In Part 2, we locked down security. Now let's talk about observability.
You need audit logging for compliance. You need distributed tracing for debugging. You need error handling for production resilience.
Here's the problem: by default, you have none of that.
Out of the box, APIM passes MCP requests through with zero logging. You have no idea if someone's calling tools/list or tools/call. You can't tell which API requests are actually MCP traffic. No audit trail. No visibility. Just requests flowing through in the dark. Not what you want, not what compliance wants, not what security wants.
And when you try to add logging? That's when you discover the fun part: accessing the response body in APIM policies causes requests to hang indefinitely.
I found this out the hard way. Days of debugging. Requests hanging. Timeouts everywhere. The fix? Don't touch the response body. Ever.
This issue has been reported to Microsoft,it is on their radar. But here's the reality: MCP servers are in preview. When you work with preview functionality, you carry the burden of working around these rough edges yourself. That's the tax you pay for being early.
Let me show you how to get the observability you need, without the trial and erroring I went through.
Observability Architecture
The Response Body Problem
What You'd Normally Do
In a typical APIM setup, especially during development, you'd log the response body:
<outbound>
<log-to-eventhub logger-id="my-logger">
@{
return new JObject(
new JProperty("request", context.Request.Body.As<string>()),
new JProperty("response", context.Response.Body.As<string>()) // HANGS
).ToString();
}
</log-to-eventhub>
</outbound>
This will hang your requests. The response never completes. Clients timeout.
Why This Happens
When you access context.Response.Body in the <outbound> section:
I dont have access to the APIM MCP internals, but: I suspect that APIM MCP tries to buffer the entire response, which can be large, which can be streaming. If its in memory buffer, it might just trigger errors there. If its streaming, it will block until everything is read, which is not going to happen until the client reads it... In short, a deadlock, no bueno, and you break production.
The frustrating part? This behavior isn't called out clearly in the main policy documentation. The policy is saved without any warnings, and you wonder why suddenly you get timeout errors. After a lot of trial and error, or if you're lucky, someone (Hi there ;) ) warns you first. Probably just one of those things that happens in preview.
Microsoft is aware of this issue, it's been reported and is on their radar.
The good news: there's a reliable workaround that gives you everything compliance needs.
The Solution: Log Metadata, Not Payloads
You can't log response bodies. Fine. Log everything else instead—and it turns out that's exactly what you need for compliance anyway:
<outbound>
<log-to-eventhub logger-id="mcp-audit-logger">
@{
var mcpMethod = "";
var mcpId = "";
// Extract MCP method from request body (safely)
try {
var requestBody = context.Request.Body.As<JObject>(preserveContent: true);
mcpMethod = requestBody["method"]?.ToString() ?? "";
mcpId = requestBody["id"]?.ToString() ?? "";
} catch {
mcpMethod = "parse-error";
}
return new JObject(
// Request metadata
new JProperty("timestamp", DateTime.UtcNow.ToString("o")),
new JProperty("requestId", context.RequestId),
new JProperty("subscriptionId", context.Subscription?.Id ?? "none"),
new JProperty("subscriptionName", context.Subscription?.Name ?? "none"),
// MCP-specific
new JProperty("mcpMethod", mcpMethod),
new JProperty("mcpId", mcpId),
// Response metadata (SAFE - no body access)
new JProperty("statusCode", context.Response.StatusCode),
new JProperty("statusReason", context.Response.StatusReason),
// Timing
new JProperty("elapsedMs", context.Elapsed.TotalMilliseconds),
// Client info
new JProperty("clientIp", context.Request.IpAddress),
new JProperty("userAgent", context.Request.Headers.GetValueOrDefault("User-Agent", ""))
).ToString();
}
</log-to-eventhub>
<base />
</outbound>
Key points:
-
context.Request.Body.As<JObject>(preserveContent: true)is safe in<outbound> -
context.Response.StatusCodeis safe -
context.Response.Bodyis NOT safe - Wrap parsing in try-catch for malformed requests
- If you have any other headers you want to log, you can add them quite easily to the JObject.
Logging Tool Discovery
Tracking who is executing tools is needed, but what about who's discovering your tools?. Who hasnt seen the IP port scans in his life, this is not much else than that. To log these activities, a policy snippet needs to be added to the Inbound section, since this request will never hit any outbound services (and therefore policies)
<inbound>
<!-- After security checks, before backend -->
<choose>
<when condition="@(context.Request.Body.As<JObject>(preserveContent: true)["method"]?.ToString() == "tools/list")">
<log-to-eventhub logger-id="mcp-audit-logger" partition-id="0">
@{
return new JObject(
new JProperty("eventType", "discovery"),
new JProperty("timestamp", DateTime.UtcNow.ToString("o")),
new JProperty("requestId", context.RequestId),
new JProperty("subscriptionId", context.Subscription?.Id ?? "none"),
new JProperty("subscriptionName", context.Subscription?.Name ?? "none"),
new JProperty("clientIp", context.Request.IpAddress),
new JProperty("userAgent", context.Request.Headers.GetValueOrDefault("User-Agent", "")),
new JProperty("mcpMethod", "tools/list")
).ToString();
}
</log-to-eventhub>
</when>
</choose>
</inbound>
Why log discovery separately?
- Security monitoring: Who is checking for tools, and are they allowed to. It can be a forgotten attack surface. So better log it.
- Usage analytics: Track which clients are discovering vs actually using tools. If there is an overload of tool discovery calls, you might have an issue in one of your agents.
- Compliance: Auditors want to know who accessed what capabilities, even if they didn't invoke them
- Performance: Separate partition for discovery logs (partition-id="0") keeps them isolated from invocation logs
This gives you the complete picture: discovery → invocation → response → errors.
Setting Up Event Hub Logging
Reality check: You need Event Hub for production audit logging. Application Insights alone won't give you the retention, queryability, and compliance guarantees you need. Yes, it's another Azure service. Yes, it adds complexity. But it's the right tool for immutable audit trails. Event Hub also very easily integrates with tools like Datadog or Grafana, or whatever SIEM you use for your end product.
If you're running a small pilot, you can skip this and use App Insights only. For regulated environments? Event Hub is non-negotiable.
1. Create Event Hub
# Variables
RESOURCE_GROUP="rg-apim-mcp"
LOCATION="westeurope"
EVENTHUB_NAMESPACE="eh-apim-mcp-logs"
EVENTHUB_NAME="mcp-audit-logs"
# Create namespace
az eventhubs namespace create \
--resource-group $RESOURCE_GROUP \
--name $EVENTHUB_NAMESPACE \
--location $LOCATION \
--sku Standard
# Create event hub
az eventhubs eventhub create \
--resource-group $RESOURCE_GROUP \
--namespace-name $EVENTHUB_NAMESPACE \
--name $EVENTHUB_NAME \
--partition-count 4 \
--message-retention 7
# Get connection string
az eventhubs namespace authorization-rule keys list \
--resource-group $RESOURCE_GROUP \
--namespace-name $EVENTHUB_NAMESPACE \
--name RootManageSharedAccessKey \
--query primaryConnectionString -o tsv
2. Create APIM Logger
# Get APIM instance
APIM_NAME="apim-mcp-prod"
# Create logger
az apim logger create \
--resource-group $RESOURCE_GROUP \
--service-name $APIM_NAME \
--logger-id "mcp-audit-logger" \
--logger-type azureEventHub \
--connection-string "Endpoint=sb://eh-apim-mcp-logs.servicebus.windows.net/;..." \
--description "MCP audit logging"
Or via Azure Portal:
- Navigate to your APIM instance
- APIs → Loggers → + Add
- Name:
mcp-audit-logger - Type: Azure Event Hub
- Connection string: (paste from above)
- Create
3. Configure Named Values (Optional)
If you're managing multiple environments (dev/staging/prod), use named values. If you're just testing, hard-code it and move on:
az apim nv create \
--resource-group $RESOURCE_GROUP \
--service-name $APIM_NAME \
--named-value-id "audit-logger-id" \
--display-name "audit-logger-id" \
--value "mcp-audit-logger"
Then reference in policy:
<log-to-eventhub logger-id="{{audit-logger-id}}">
Complete Production Logging Policy
Here's the full policy with audit logging integrated:
<policies>
<inbound>
<!-- Security policies from Part 2 -->
<choose>
<when condition="@(context.Request.Headers.GetValueOrDefault(\"Ocp-Apim-Subscription-Key\", \"\") == \"\")">
<return-response>
<set-status code="401" reason="Unauthorized" />
<set-body>{"error": "Subscription key required"}</set-body>
</return-response>
</when>
</choose>
<set-variable name="originalSubKey"
value="@(context.Request.Headers.GetValueOrDefault(\"Ocp-Apim-Subscription-Key\", \"\"))" />
<set-variable name="originalJwtToken"
value="@(context.Request.Headers.GetValueOrDefault(\"Authorization\", \"\"))" />
<!-- Store request timestamp for elapsed time calculation -->
<set-variable name="requestStartTime" value="@(DateTime.UtcNow)" />
<base />
<rate-limit calls="100" renewal-period="60" />
<choose>
<when condition="@(context.Subscription == null || context.Subscription.Id == null)">
<return-response>
<set-status code="401" reason="Unauthorized" />
<set-body>{"error": "Invalid subscription key"}</set-body>
</return-response>
</when>
</choose>
<set-header name="Ocp-Apim-Subscription-Key" exists-action="override">
<value>@((string)context.Variables["originalSubKey"])</value>
</set-header>
<set-header name="Authorization" exists-action="override">
<value>@((string)context.Variables["originalJwtToken"])</value>
</set-header>
<!-- Distributed tracing - preserve existing or generate new -->
<set-header name="X-Request-ID" exists-action="skip">
<value>@(context.RequestId)</value>
</set-header>
<!-- Log tool discovery requests -->
<choose>
<when condition="@(context.Request.Body.As<JObject>(preserveContent: true)["method"]?.ToString() == "tools/list")">
<log-to-eventhub logger-id="mcp-audit-logger" partition-id="0">
@{
return new JObject(
new JProperty("eventType", "discovery"),
new JProperty("timestamp", DateTime.UtcNow.ToString("o")),
new JProperty("requestId", context.RequestId),
new JProperty("subscriptionId", context.Subscription?.Id ?? "none"),
new JProperty("subscriptionName", context.Subscription?.Name ?? "none"),
new JProperty("clientIp", context.Request.IpAddress),
new JProperty("userAgent", context.Request.Headers.GetValueOrDefault("User-Agent", "")),
new JProperty("mcpMethod", "tools/list")
).ToString();
}
</log-to-eventhub>
</when>
</choose>
</inbound>
<backend>
<base />
</backend>
<outbound>
<!-- Audit logging (SAFE - no response body access) -->
<log-to-eventhub logger-id="mcp-audit-logger">
@{
var mcpMethod = "";
var mcpId = "";
var userId = "";
// Extract MCP details
try {
var requestBody = context.Request.Body.As<JObject>(preserveContent: true);
mcpMethod = requestBody["method"]?.ToString() ?? "";
mcpId = requestBody["id"]?.ToString() ?? "";
} catch {
mcpMethod = "parse-error";
}
// Extract user ID from JWT (if present)
try {
var authHeader = context.Request.Headers.GetValueOrDefault("Authorization", "");
if (!string.IsNullOrEmpty(authHeader) && authHeader.StartsWith("Bearer ")) {
var token = authHeader.Substring(7);
var jwt = System.IdentityModel.Tokens.Jwt.JwtSecurityTokenHandler.ReadJwtToken(token);
userId = jwt.Claims.FirstOrDefault(c => c.Type == "sub")?.Value ?? "";
}
} catch {
userId = "jwt-parse-error";
}
return new JObject(
// Timestamps
new JProperty("timestamp", DateTime.UtcNow.ToString("o")),
new JProperty("requestStartTime", ((DateTime)context.Variables["requestStartTime"]).ToString("o")),
// Request metadata
new JProperty("requestId", context.RequestId),
new JProperty("operationId", context.Operation?.Id ?? ""),
new JProperty("apiId", context.Api?.Id ?? ""),
// Subscription details
new JProperty("subscriptionId", context.Subscription?.Id ?? "none"),
new JProperty("subscriptionName", context.Subscription?.Name ?? "none"),
// User context
new JProperty("userId", userId),
new JProperty("clientIp", context.Request.IpAddress),
new JProperty("userAgent", context.Request.Headers.GetValueOrDefault("User-Agent", "")),
// MCP-specific
new JProperty("mcpMethod", mcpMethod),
new JProperty("mcpId", mcpId),
// Response metadata (SAFE)
new JProperty("statusCode", context.Response.StatusCode),
new JProperty("statusReason", context.Response.StatusReason),
// Performance
new JProperty("elapsedMs", context.Elapsed.TotalMilliseconds),
new JProperty("backendTimeMs", context.Response.Headers.GetValueOrDefault("X-Backend-Time", "0")),
// Errors
new JProperty("isError", context.Response.StatusCode >= 400),
new JProperty("lastError", context.LastError?.Message ?? "")
).ToString();
}
</log-to-eventhub>
<!-- Response headers - return the request ID (original or generated) -->
<set-header name="X-Request-ID" exists-action="skip">
<value>@(context.Request.Headers.GetValueOrDefault("X-Request-ID", context.RequestId))</value>
</set-header>
<set-header name="X-RateLimit-Limit" exists-action="override">
<value>100</value>
</set-header>
<set-header name="X-RateLimit-Window" exists-action="override">
<value>60</value>
</set-header>
<base />
</outbound>
<on-error>
<!-- Error logging -->
<log-to-eventhub logger-id="mcp-audit-logger">
@{
return new JObject(
new JProperty("timestamp", DateTime.UtcNow.ToString("o")),
new JProperty("requestId", context.RequestId),
new JProperty("subscriptionId", context.Subscription?.Id ?? "none"),
new JProperty("isError", true),
new JProperty("errorSource", context.LastError?.Source ?? ""),
new JProperty("errorReason", context.LastError?.Reason ?? ""),
new JProperty("errorMessage", context.LastError?.Message ?? ""),
new JProperty("statusCode", context.Response.StatusCode),
new JProperty("elapsedMs", context.Elapsed.TotalMilliseconds)
).ToString();
}
</log-to-eventhub>
<base />
</on-error>
</policies>
Distributed Tracing
Request ID Propagation
Of course you want to be able to trace a request from end-to-end. Because, whats the use otherwise besides just burning money on logs?
The policy above includes:
<!-- Inbound: Preserve existing or generate new -->
<set-header name="X-Request-ID" exists-action="skip">
<value>@(context.RequestId)</value>
</set-header>
<!-- Outbound: Return the request ID (original or generated) -->
<set-header name="X-Request-ID" exists-action="skip">
<value>@(context.Request.Headers.GetValueOrDefault("X-Request-ID", context.RequestId))</value>
</set-header>
Use exists-action="skip" instead of "override". This preserves any X-Request-ID that's already present from upstream services (AI agents, proxies, gateways). Only generate a new one if it doesn't exist.
THe part in outbound makes sure we actually send this to the backend service, of course its your responsibility that it uses that ID for its own logging, and propagates it downstream as well.
We dont want to break the chain.
What to Monitor
Once Event Hub is streaming your logs, pipe them to whatever observability platform you're using:
Common destinations:
- Grafana Cloud - Event Hub → Grafana Alloy → Grafana Cloud (uses Loki backend)
- Datadog - Event Hub → Azure Function forwarder → Datadog
- Splunk - Event Hub → Splunk HEC connector
- Application Insights - Built-in Azure integration if you're all-in on Azure
The Event Hub JSON format (shown in the policy above) works with pretty much any log aggregator. You're getting structured JSON with timestamps, request IDs, and all the metadata you need.
Key metrics to track:
Request Metrics: Requests/min, error rate by status code, P50/P95/P99 latency, rate limit hits (429s)
MCP-Specific: tools/list vs tools/call distribution, per-tool error rates, per-subscription usage patterns
Security: Unauthorized attempts (401/403), invalid subscription keys, unusual client IPs, rate limit violations
Performance Considerations
What to Log (And What Not To)
Always log: Request metadata, response status codes, timing, security context, and errors. These are safe, fast, and give you what compliance needs.
Never log: Response bodies (hangs requests), large payloads (kills performance), or sensitive data (violates compliance). Anything that slows down the critical path or exposes PII or PCI is out.
The metadata-only approach handles millions of requests without breaking a sweat. You dont want to introduce verbose logs, slowing down your performance and loose money on people bailing out.
Sampling Strategy
For high-volume APIs:
<!-- Log 100% of errors, 10% of successes -->
<choose>
<when condition="@(context.Response.StatusCode >= 400)">
<log-to-eventhub logger-id="mcp-audit-logger">
@{ /* full logging */ }
</log-to-eventhub>
</when>
<when condition="@(new Random().Next(100) < 10)">
<log-to-eventhub logger-id="mcp-audit-logger">
@{ /* sampled logging */ }
</log-to-eventhub>
</when>
</choose>
Compliance & Audit Requirements
Data Retention
(Do check this with your compliance team, these are just guidelines)
- Event Hub: 7-90 days retention
- Log Analytics: 30-365 days retention
- Archive Storage: Unlimited (cold storage)
Audit Trail Fields
For compliance (SOC2, ISO 27001):
{
"timestamp": "2025-11-05T10:30:45.123Z",
"requestId": "abc-123-def-456",
"userId": "user@company.com",
"subscriptionName": "Company-Production",
"mcpMethod": "tools/call",
"toolName": "processPayment",
"statusCode": 200,
"clientIp": "203.0.113.42",
"action": "execute",
"resource": "/api/payment",
"result": "success"
}
Note: If you have a WAF in front of your APIM, make sure to collect WAF logs as well, because they contain important security events (blocked attacks, suspicious IPs, etc).
GDPR Considerations
During my career I have made my mistakes, and seen others make them. To prevent you ending up sanitizing logs, which is probably going to ruin your weekend, here are some guidelines:
For user data:
- Don't log full request/response bodies (they can contain PII, PCI data)
- Log user IDs, but only if needed (pseudonymized from JWT claims)
- Support data deletion requests (Event Hub retention handles this)
- Implement retention policies (compliance will ask for this)
Before You Go Live
Core Requirements (Everyone needs this):
- Event Hub logger configured
- Audit logging policy deployed
- Response body access removed (no hanging, although you will notice this quite fast)
- Request IDs propagated to backends (preserving upstream IDs)
- Log retention policies set (check with compliance team)
- Sampling configured for high volume (if needed)
Monitoring Stack (Choose your poison):
- Event Hub → Your SIEM/observability platform (Datadog, Grafana, Splunk, etc.)
- Or Application Insights (if you're all-in on Azure)
- Dashboards created (whatever tool you use)
- Alerts configured for errors/latency/security events to disrupt your sleep
What's Next
Security: locked down. Observability: sorted. Now let's tackle the elephant in the room: automation.
Coming soon: Part 4 - GitOps for Azure APIM MCP: Custom Automation Guide
MCP servers don't support Terraform or ARM templates yet. Microsoft knows this is a gap and it's on the roadmap. In the meantime, I'll show you ideas how to automate deployments using custom REST API scripts and CI/CD pipelines—because manual Portal clicks don't scale. This is still something we have to implement, but lets drill down on the concept!
I'm a Product Architect at Backbase, where I design cloud-native banking platforms serving millions of users. The patterns in this series come from real production implementations at enterprise scale. Views are my own.
How are you handling APIM observability? Share your patterns in the comments.
Follow me on LinkedIn for more Azure and platform engineering content.
Top comments (0)