DEV Community: Richard Nye

Setting up Azure MCP with Service Principal

Richard Nye — Mon, 01 Jun 2026 19:58:45 +0000

Today I'm going to focus on how to set up Azure MCP Server, but more specifically how to use a service principal to authenticate. During my setup, and I've done this with both MacOS using npm and Windows using Docker, I really struggled to find a simple guide about how to use Azure MCP with a service principal so thought I'd document my own setup to hopefully help the community.

I've also created a follow-on post detailing a real-world usage of AI to streamline an Azure environment that you can read for inspiration.

Why use Azure MCP?

I don't know about you, but our Azure estate is vast. In fact it doesn't take long for an Azure estate to become difficult to analyse, in my opinion. Yes there's Azure Advisor, but that takes time and sometimes feels like a bit of an upsell - yes I'm aware my VMs are running as B-series SKUs, I did that for a reason! We're seeing all these articles and buzz about speeding up app development using MCP, and for good reason, but what about Infrastructure?

Potential uses for Azure MCP

The world really is your oyster here. Personally I've used Azure MCP to highlight inefficiencies within our environment and I'm excited by the potential. For example, I wanted it to act as an Azure architect and FinOps expert to highlight cost-saving, and it did just that. It found a VM that looked fine on the surface but was failing to deallocate, numerous potential orphaned Veeam backup resources, and inefficient use of public IPs on VMs that were behind an Azure firewall. All basic stuff, but really useful when you have a team full of varying Azure experience and skill levels.

It wasn't perfect, though, and there's the usual disclaimer when it comes to AI - it's a tool that should never drive you. Every recommendation it's outputted (in a lovely dedicated markdown doc, no less) will be manually checked and confirmed before I go deleting resources. But the speed at which it analysed, and the thorough report it produced, has left me excited to see what's next. One glimpse at the Azure MCP Server docs and the tools it offers should be enough to get the creative juices flowing for anyone reading this!

Where I personally want to go next is getting it to analyse log analytics, triaging alerts before escalating to a human (with context provided by Confluence docs and the Azure estate), and the like. I think it's worth creating a plan, with phases, as part of the PoC here. Document what business challenges you're trying to solve, and how much money and time this might help you save. What are your pain points as a team?

But it can go beyond that. We have two Azure subscriptions that are currently not infrastructure-as-code, and it was going to be a daunting task creating Terraform that not only encompasses all resources that already exist, but also follows our other subscriptions in IaC style. What if I combine access to the IaC repository with Azure MCP? Suddenly this task potentially becomes weeks quicker. I'm yet to test that but I'm excited to.

Prerequisites before configuring Azure MCP with service principal

First, a brief overview of a variety of things. Now I won't be covering what MCP is here, I feel like there's already a plentiful supply of learning resources online, but I will be covering the why and how.

Why a service principal?

Now I've noticed that Microsoft documentation like this has you doing an interactive sign-in via Azure CLI or PowerShell first, things like az login for instance. But this didn't sit well with me for a couple of reasons:

I use a separate admin account which has permission to create, edit, and delete resources, but I wanted this PoC to be strictly read-only.
I want to start treating AI like a unique identity.
For this PoC I wanted to make authentication a bit more portable so that other team members can try it. It's read-only after all, and limited to one subscription.

So I created a new app registration in Entra, created a client token/secret, and provided that app registration with the permission I wanted. I felt an entirely new identity was best here because I could guarantee it wouldn't have existing permission elsewhere. I wanted a good guardrail in place because this is unknown territory - yes the AI model usually asks for permission before acting, but a single misclick or lapse in concentration could give it permission to modify resources; a huge no for what I was attempting here. The service principal provides another guardrail at a higher level.

Choosing how to run the MCP Server

The documentation gives a variety of options, mostly centred around using some form of package manager: NuGet, NPM, PyPI, are the three listed here. But I've been playing around learning MCP by taking advantage of the Docker Desktop MCP Catalog of MCP Servers, and I really wanted consistency in how I'm running MCP with Claude Desktop. Microsoft also handily document exactly that scenario here.

Note that I did also have good experiences with MacOS and NPM, so really it's up to you how you run this. What I will say is that make sure you're familiar with the generic Azure SDK authentication because the environment variables available to you stem from that; they're not unique or created for Azure MCP.

Choosing an MCP Host

This is ultimately a personal decision. I've personally got this working with both Claude Desktop (my preferred MCP host) and VS Code using the GitHub Copilot extension. What I've noticed is that the flow seems fairly similar - you'll setup the MCP server, plan how you'll pass in environment variables at time of Azure authentication, and modify a json file to configure the MCP host to use the MCP server.

Setting up Azure MCP Server with Service Principal

Creating the Service Principal

I won't go into great detail here, this isn't a tutorial on creating a service principal, but you use Entra to first create an app registration, configure how you'll authenticate (I went for a client secret with a short expiry, but certificates are a good option here too), and then assign the app registration/service principal some Azure permissions. It's worth getting this done first because you need the tenant ID, client/app ID of this app registration, and the client secret or cert later on.

Configuring Azure RBAC permissions

As mentioned, I strictly wanted this PoC to be read-only. And the best guardrail I could think of here was Azure RBAC. I wanted something at a layer above the AI so that even if it attempted to make a change it considered worthwhile then it wouldn't be able to. But it's not just about that, it's also about the scope of the PoC, and this will be personal choice. I personally wanted to restrict it to one subscription, but the subscription in its entirety, so I went for the Reader role. For additional security, you could also choose to time-limit this assignment, which when combined with a certificate or short-dated token, could prove effective.

Configuring the environment

Docker Desktop

Basic Docker Setup
Obviously you need Docker Desktop installed and in your Path environment variable. You should be able to run docker commands in the terminal. If you want to speed up the first run, considering pulling the Azure MCP latest build found here to guarantee the image is available locally.

Authentication
Next up is authentication. With Docker, Microsoft recommends a .env file containing AZURE_TENANT_ID, AZURE_CLIENT_ID, and AZURE_CLIENT_SECRET variables, though I assume other variables for certificates like AZURE_CLIENT_CERTIFICATE_PATH and AZURE_CLIENT_CERTIFICATE_PASSWORD will also work here. Though it's worth noting that if you do go the certificate route then you'll need to pass the cert file into the container as part of the Docker run config you'll do in the next step, you could use a bind mount for that but I've not tested it. I went with a client secret.

It's also worth noting that you're storing the client secret in plain text here, and you can also inspect the container to see it. Not ideal and well worth spending some time investigating better and more secure methods. This was acceptable for me given the guardrail in place of Azure RBAC, plus it was all local on my machine, but well worth assessing the risk before proceeding.

Configuring Claude Desktop
This was thankfully quite simple thanks to Microsoft. They have a great example MCP config file and I worked with Claude to get mine working. Here it is:

    "Azure MCP Server": {
      "command": "docker",
      "args": [
        "run",
        "-i",
        "--rm",
        "--env-file",
        <ENV FILE PATH HERE>,
        "mcr.microsoft.com/azure-sdk/azure-mcp:latest"
      ]
    }

I've redacted my .env file path, simply replace it with something like "C:\\path\\to\\.env" and note the need for escaped backslashes! Claude Desktop stores its env file by default in %APPDATA%\Claude\claude_desktop_config.json but you can also go to Settings > Developer to find it in Claude Desktop.

Testing
Simply reboot Claude Desktop and it should launch a Docker container. Copy your subscription ID and do a test prompt like Please test my Azure MCP Server connection by pulling back all resource groups. The subscription ID is 1111111-fffffff-bbbbbb-ggggggg and you should have access.

If you get resource group names returned then setup complete!

Lessons Learned

I'll start with the negatives before moving onto the exciting potential. Firstly, I don't like the idea of having these environment variables in plaintext on my machine, and I think it's worthwhile seeing exactly how I can take advantage of something like Azure Key Vault or the myriad of secret management solutions available going forwards. Even if that means some sort of script to launch this MCP server. I need to spend time strengthening this side of it.

Secondly, the recommendations weren't always fantastic. It lacked context of the wider estate, so don't expect this to be a silver bullet. Like with everything AI, you really do need to put the work into not only crafting a great prompt but also making sure it has the context it needs to make informed decisions. For me, that looks like configuring Confluence, repository access, and who knows what else. It needs thought.

Lastly, you really need to consider data governance here. You're opening up an Azure estate to an LLM. The use of a business/enterprise plan should go without saying, and ensuring the LLM won't be using your Azure estate to train future models is critical.

Now onto the exciting stuff! I found great benefit in going the service principal route rather than interactive auth. I really didn't like the idea of this AI agent acting as me, with all the permissions of my account. This is precisely the problem that service principals are designed to solve; it's a third-party app accessing Entra/Azure resources, treat it as such. The fact it's Azure SDK under the hood really enables you to tweak it to your needs, and you can take some solace in the fact this is using battle-hardened tech underneath.

It's also given me several recommendations to take forwards. Yes, I might have moaned about how they weren't always correct (and that's mostly on me anyway), but it only took a couple of minutes to completely analyse my estate, understand how Veeam for 365/Azure works, realise that's what we were using and cater suggestions to it. It found resources we've not been aware needed cleanup (because as a team we're simply too busy), and would even help me craft a change request if I asked it to. That task is easily several hours, if not more considering the breadth of tech we often use in this industry.

I'm really excited to see how it can solve our other problems.

Google changed the way it crawls our site - and exposed several Azure Front Door misconfigurations

Richard Nye — Fri, 29 May 2026 15:33:18 +0000

Originally published at https://rnye.tech

Hi all, today's post details an interesting problem that faced a website thanks to undocumented Google crawl behaviour that hit us suddenly. The website used Azure Front Door for global CDN/WAF capability but only had one origin - hosted in Azure in the UK South region. This should have been fine given it's a UK-centric site that receives very little global traffic - that is until Google starts crawling you from the West Coast of the US suddenly. Let's dive in.

The problem: what Google Search Console was telling us

All was well with the site from the UK, cache hit ratios were in the 80%+ range, response times were generally rapid even if cache was missed. Google typically crawled the site thousands of times a day. Then suddenly a ticket came in detailing a drastic drop-off in mid-April. Average response times were never great according to Google (700ms) but they'd suddenly jumped to almost double that number (1.3s) with seemingly no explanation. There was absolutely no denying the correlation between crawl requests and average response time, and indeed this is documented behaviour - if response times increase, Google backs off. They claim it's to prevent overloading the site, and I believe that, but I also feel it's likely to ensure they're not wasting their crawl compute resources on long-loading pages. Either way, Google Search Console offered zero explanation as to why.

So the team did what most dev/devops teams do - review latest changes, any Azure Front Door configuration changes in particular, as well as wider site changes. And nothing correlated. AFD hadn't been changed for two weeks, it was that stable, and other changes weren't remotely related. Besides, genuine traffic in the UK wasn't seeing the impact. Response times were still good in the P50/P90/P99 metrics.

Tracing the cause with Azure Front Door logs and AI

As I've touched on in my last post, AI can be fantastic at quick data analysis. That's not to say a human couldn't do it, but I'm telling you from experience that AI found this random behaviour change a lot quicker than humans would have. I'd also recommend reading my post about using service principals and the Azure reader role for guard-railed AI access to ensure you're doing AI analysis in a safe and controlled way.

I ensured the prompt contained the subscription ID, the AFD name and resource ID, and told it the general problem; on x date at y time, we saw a dramatic decrease in Google crawl rate and response times shot up. I instructed it to solely use the data and try to identify a pattern and I was intentionally vague about certain details. Initially it had failed to identify the cause because I'd mentioned a change, even though that change had occurred a week later than this issue started. It got fixated on that and wasn't objective in its analysis. I'd go with a prompt like:

You're an Azure site reliability engineer with extensive Google SEO experience and specialism in monitoring eCommerce sites. We've noticed a drastic drop-off in Google crawl requests and response times have increased. Please analyse the Azure Front Door logs between 15:00 and 19:00 on #th April 2026. For any hypotheses you have, please analyse data from a baseline of the day before (where everything was normal) before outputting them to me. Do not hypothesise about the cause using sources other than Azure Front Door logs. Clarify any uncertainties with me before arriving at your conclusion. Also output the KQL used in your analysis. The Azure Front Door resource name is xyz, resource id is 124-4534346-4577567sdfsdfg and subscription id is 1234567-858-hfhfhfhd.

This prompt addressed the following problems:
1) Claude tried to solve the problem by utilising other sources or its own knowledge of Google SEO. That was too vague here, we'd tried that already. I wanted it to focus solely on the data.
2) Claude had no knowledge of our environment and the first attempt had it finding problems, yes, but problems that were normal for us and unrelated. As soon as I specifically told it to compare to baseline data, it became fantastic. I could see the inner monologue finding issues, checking the baseline data, and realising it wasn't the cause.
3) Keywords including Azure, Google SEO, and Azure Front Door made sure it tapped into the right areas of its knowledge.
4) Having the KQL provided allowed for manual confirmation.

I'd also recommend outputting its analysis as HTML - it made sharing with the team far easier. But only when you're happy the findings are worth sharing - save those tokens!

Useful KQL

These might not be perfect (I've seen Google note that useragents are often spoofed and to lookup the requester IP) but they did a job for me.

Chart Googlebot requests (based on useragent)

// Googlebot requests for last 60 days, in 12h increments. Adjust those times as necessary.
 AzureDiagnostics
 | where Category == "FrontDoorAccessLog"
 | where userAgent_s contains "googlebot"
 | where TimeGenerated >= ago(60d)
 | summarize Requests=count() by bin(TimeGenerated, 12h)
 | render timechart

Chart Googlebot requests by AFD PoP

// Googlebot requests by pop - adjust times as necessary.
 AzureDiagnostics
 | where Category == "FrontDoorAccessLog"
 | where userAgent_s contains "googlebot"
 | where TimeGenerated >= ago(7d)
 | summarize Requests=count() by bin(TimeGenerated, 4h), pop_s
 | render timechart

The cause - Google changed where it crawls from and hit different Azure Front Door PoPs

AI immediately noticed the difference in the pop_s column - Google changed its crawl location from Atlanta to the West Coast (BY/LAX/SJC AFD PoP abbreviations, if interested). While Google crawls our site from all over the globe, what's apparent from AFD is that a couple of PoPs dominate serving Googlebot requests. And rendering the KQL as a timechart made it obvious - Google transitioned to West US over a four hour period and response times jumped as a result. Our site has a UK South origin - that extra geographic distance was enough for crawling to suffer sufficiently that Google backed off.

It was also apparent that Google was missing cache frequently, in fact 70% of requests were missing cache, and without enough natural US traffic (it's a UK-based site!) there was nothing to warm the resources it crawled naturally. Essentially, unless Google crawled a page twice in quick succession (which it does seem to), it would suffer a 1.3s+ response time.

And to be fair to Google they're quite upfront about that - they can crawl you from wherever they feel like, although do seem to prioritise response time (we'll touch on this later). If anything, having a global CDN arguably shot us in the foot here - if we only gave good responses from the UK/Europe, Google likely would've focused solely on those locations. But because response times were occasionally great if cache was hit, and was within the acceptable <1s limit from Atlanta, Google's crawling made a decision to swap us to the West Coast.

The Three Fixes

I'll keep this short.

1) Sort your cache lifetimes out.

We were caching HTML pages for only 10 minutes. I'm still not sure why. We also don't incrementally invalidate the cache based on which pages are changed in a build. Increase your cache lifetimes based on how stable the content is - our content is generally stable but without incremental and selective purge, we've gone with a cache lifetime that's the same as the time between site builds.

2) Check how query parameters are handled by the cache

We had to tweak our list of query parameters that could hit the cache instead of bypassing it. This will depend on your request source tracking, for example.

3) Deploy an origin to Central US

This is the big one and deserves its own section.

The Main Fix: deploying another Azure Front Door origin in Central US

It's a UK-based site, as I've mentioned many times. Why should we deploy infrastructure specifically to serve US requests? Because Google demands it, that's why. I won't detail what that change was for privacy reasons, but it's standard AFD stuff. It leverages AFD's latency-based routing.

What is interesting is how Google responded. Within a couple of hours of deploying infrastructure to Azure Central US, the PoPs that Googlebot primarily hit to crawl us changed from West Coast to Iowa and Minnesota. Almost immediately. And response times rapidly improved, obviously, to 300-500ms even with cache misses.

What I will say is that while Google is quick to drop crawl rate, it seems hesitant to trust again. We're seeing rate increase slower than we'd like, albeit on the up again. Something to be aware of if you run into something similar.

What Microsoft don't tell you about Azure Front Door CDN caching

I found Microsoft's documentation about AFD caching architecture to be seriously lacking. From what I could tell, AFD Classic SKU used to have the concept of 'Origin Shield' - a tiered cache strategy that you could control. If the edge PoP's cache didn't contain the resource, you could manually specify the next cache to try. This would've resolved everything for us if we were able to set the origin shield to UK South. We have enough natural UK traffic that means our UK cache is generally always warm (80-90% hit percentage) so traffic never would've hit our origin, even if a UK round-trip still occurred. So how does the AFD Premium SKU tiered caching actually work? I couldn't fully tell you because Microsoft don't appear to document it. I can see 'REMOTE_HIT' in the AFD logs, so there's clearly some two-tiered caching architecture going on somewhere, but from what I can see AFD generally caches per PoP and then there's a select few tiered caches globally. But the docs on this are sparse, the best I could find was a random Microsoft support/Q&A/Learn thread. That means it's incredibly difficult to warm a cache yourself via a script that loads all resources every x minutes (believe me, I tried). It did work, but the issue is you're relying on Google hitting those same PoPs, and there's no guarantee they will.

I also found a PoP in the AFD logs that Microsoft literally do not have documented in their PoP lists, either by location so that I could try and guess what 'BY' meant, or their actual abbreviation list. I can only assume they've added a PoP recently and not updated the docs. Infuriating when trying to confirm where Googlebot requests are coming from!

Summary

If I had to give one main takeaway from this debacle it's that Google can, and will, choose to crawl you from wherever it feels like and reserves the right to suddenly change where it crawls you from. This will highlight how good your global response times are (albeit with a heavy US-centric bias) as well as your caching strategy and hit ratios. Our monitoring should've caught this sooner, but that's always an ongoing battle.

So even if you're a site serving primarily one country, if you're heavily reliant on SEO and Google generally then you're not as local as you think.