So I am in something of a complicated relationship with Azure.
I like that (in general) it makes my life easier.
I like that hooking up continuous integration is so incredibly easy.
I like that managing deployment slots and setting up new ones is logical and can be done quickly (albeit with something of a deployment wait); and I like that you can configure instances that will scale up or down depending on the demands that are made on their resources.
I don't like how long everything seems to take to update/deploy/propagate.
I don't like that the UI seems to have been built by about 200 people in simultaneous development so that sometimes things happen automagically and sometimes you have to hit 8 different confirm buttons before it registers that yes, you really do want to do that.
I don't like trying to troubleshoot performance issues when there are so many different places for logs/analytics/insights.
And I don't like that occasionally their idea of an error message is an unhappy cloud.
I got 99 problems (well not concurrently, but that's how it felt)
Recently, I was trying to get to the bottom of some rather frustrating performance issues on our Azure cloud catalog.
The symptoms included:
- intermittent downtime
- slow app restarts
- laggy front-end performance
One NET Core app in particular was very sickly and would classically take 7-8 minutes to find it's wee feet again when restarted. Bafflingly, it was also one of our simplest, smallest, lowest traffic apps, so what gives?
Have you tried switching it off and on again?
Cue a montage (although in reality it was more an increasingly frustrating, ever-decreasing spiral) of trailing through spiky graph after spiky graph in Application Insights, downloading memory dumps, clicking hopefully through every log folder on blob storage and tentatively poking through various routes on the "Diagnose and Solve Problems" dashboard which wants to "chat" to you. Endearing.
I started using phrases like "possible thread starvation" when colleagues asked how I was getting on, and spent enough time reading about startup configuration in Net Core that I was able to troubleshoot app bootstrapping at 50 paces, and yet still felt no closer to a solution.
Although, that's not strictly true. I knew a little more about why things were happening...
- we have ~7 production sites sitting within one App Plan, and this plan scaled up and down on a schedule (7am and 10pm) as well as when resources were under pressure or released outwith this period
- when the plan scaled, the app service instances within it were either spun up or wound down for the sites and it was this period that made the poor wee Net Core app the most unhappy
- the Net Core app was the one which Pingdom kept pulling up for downtime issues, but actually all of the apps had a bit of a wobble during the restarts (they were just sitting under a different alert criteria, doh!)
With this information, I could at least narrow my conversation with Google from the abstract and teenage-angst flavoured "but why?" to a more concrete "managing azure app restarts" and "configuring multiple instances of net core apps". This was small but hopeful progress.
Cutting a long (and, oh, terribly exciting) story a bit short
Further investigation and coding montages led me to a set of guidance that I will lay here for future reference, and for any who are also trying to nurse sickly Azure Web Apps back to health:
First, and the biggest win for me: the AlwaysOn
setting on the Application Settings tab. For those not familiar:
When Always On is enabled on a site, Windows Azure will automatically ping your Web Site regularly to ensure that the Web Site is always active and in a warm/running state. This is useful to ensure that a site is always responsive (and that the app domain or worker process has not paged out due to lack of external HTTP requests).
Extracted from Scott Guthrie's blog
Sound sensible eh? And it is - on production sites. But, and here is the small hole we'd dug for ourselves having been lent a shovel by Microsoft, it is not a slot specific option so - to avoid production sites idling by accident after a staging swap - we had the AlwaysOn
option always on. On every slot. On every environment. For every project.
That means that every time our 7 production sites scaled up, we'd get (e.g.) 2 instances of each, and both of these would get restarted and warmed up and then pinged to ensure they are AlwaysOn
. So far, so good. But then all of the staging and dev slots would be pinged and forced to start up and the sheer volume of I/O totally destroys the performance of, well, pretty much everything and gives the perceived downtime. Why does Azure Web App suffer so much with this? That's a different kettle of fish.
There's no nice way of managing this for us at the moment - if you handle slot swaps with a script I imagine you can toggle the AlwaysOn
option post-swappage. We've just had to add it as a manual check at the end of a deployment. It's not the end of the world, but it's certainly a little irritating nuance to be aware of!
Other, smaller, wins included: moving from IMemoryCache to IDistributedCache on the Net Core app (to minimise I/O storage writing, and to enable us to take future advantage of load balancing), and ensuring that HTTPS Only flag is set to true
so that the app initializer isn't bounced around anywhere silly on startup.
Top comments (6)
I'm quite sure that in next weeks/months you'll abandon the Azure Portal and move to the Azure CLI or Azure Powershell. Many of my friends went through the Azure Portal stage. Command line is more , and it's more up to date than the Azure Portal ;)
Aye that seems like the next logical step for us - gives me one less thing to whinge about as well ;)
If you are getting into troubles by running your app on app services, maybe you should consider some different solution? Dockerize apps and deploy them into some orchestration cluster? There are many options around the cloud ;)
Aye there are a number of different ways to skin the proverbial cat :) Much happier with the performance now though so we'll see how it goes!
This is interesting, I’ll share this with the team tomorrow! We’re definitely finding some performance issues.
Also... “take 7-8 minutes to find it's wee feet again” - your Scottish is showing 😂
My Scottish is always showing a little 😛 people are lulled into a false sense of security with my accent and then confused when I use phrases like “pure dinghied” 😇