DEV Community: Stefano d'Antonio

Business continuity and disaster recovery blueprints for enterprises

Stefano d'Antonio — Thu, 25 Nov 2021 17:17:12 +0000

Planning for disasters and having recovery processes in place is critical to any business; whether your domain is an e-commerce platform or a financial institution, a disruption of IT services means loss in revenue and reputation for your enterprise.

What changes is the tolerance to those outages.

If your system is down for 10 minutes it can either mean that the users will happily come back and play your online game later or that you will lose millions in critical transaction fees and your clients will go somewhere else where the system is more reliable.

Why do I need to account for downtimes?

Hardware failures is physiological; a rack of servers can suffer for faulty power suppliers, network switches, overheating and so on, despite all good prevention measures in place. Software can also fail and bugs/attacks can cause disruptions to the service.

If a component develops a fault, your workload on that hardware/software can be disrupted and data can also be lost.

Terminology

Before we dive deeper into the different options, let's clarify few terms that may not be familiar if you do not have to deal regularly with systems' reliability:

Recovery Time Objective (RTO)

This is what the business defines as the maximum acceptable time it takes to get the system back up in case of an outage. It can be informally or contractually agreed with consumers.

E.G.

Your system is deployed in the West Europe Azure region on Virtual Machines within a single Availability Zone and a single fault/update domain.

The rack of servers that hosts your VM fails, users cannot reach your website anymore.

How long do you deem acceptable to wait until the system is back up in a different rack/datacenter? That's RTO in a nutshell.

Recovery Point Objective (RPO)

How much data can you afford to lose? RPO is measured in time from the outage.

This is all about data, has nothing to do with your system being back to life.

E.G.

Your rack of server from the previous example develops a fault on hard drives and data is lost.

Assuming you back up your data regularly, how old is your last back up? That's the RPO.

If you back up the data at 9:00PM every day, worst case scenario is that the outage happens at 8:59PM, then your backup will be 1 day old and you would have lost ~24 hours of data. 24h is your RPO. You could be lucky and the outage may happen at 9:01PM right after a completed backup, then you have lost 1 minute of data, but you need to account for the worst case scenario.

Service-level agreement (SLA)

This is probably the most familiar term; likely you have heard of SLA as it is all over the place for cloud resources.

This is the maximum contractual downtime for a service over the year.

In Azure, if a service is down for more than that time, you can ask for credits.

This is to indicate the confidence level of the provider in the availability of the service; things could still go wrong and an outage could last longer, but it is extremely unlikely as this figure comes out of careful Microsoft BCDR planning internally for each service.

You often hear "three nines, four nines, ...", this refers to the digits in the percentage. E.G. 99.9 -> Three nines, 99.9999 -> Six nines...

How do we translate that into time? Let's consider 99.99% SLA, that mean that there is a chance that 0.01% of the year, the service will be unavailable; OK, what's that in actual time the system can be down?

Daily: 8s
Weekly: 1m 0s
Monthly: 4m 22s
Quarterly: 13m 8s
Yearly: 52m 35s

You could very well have a server unreachable for 8 seconds every day over the year of one off for 52 minutes. Having your system down for an hour could have a massive impact in certain domains, but yet 99.99% is quite a good number.

Composite SLA

If you consider a single service, 99.99% is a common figure, but your system will unlikely be that simple, it will be usually composed of multiple chained components.

E.G.

One web app talking to a back-end API talking to a database server.

Even if each component has 99.99% SLA, in the worst case scenario, they could all go down sequentially: Web app is down for 52m then is back up, when the Web App is back, the API is down for 52m and then is back up, when the API is back up, the DB is down for 52m... 2 hours and 36 minutes. OK, that syzygy is quite unlikely to happen, but nonetheless is possible and there is no responsibility on the service provider if that happens.

The provider would have respected their contractual SLAs for each component, so no credits for you, but your system would have still been down for hours.

You can use this calculator to convert percentage into time over the year: Calculate time from SLA

Those are the Azure services SLAs in a nice map: Azure charts SLA

As discussed, a given system is usually composed of different parts, it is possible to calculate the composite SLA with the formulas here: Composite SLA

I have built a tool to do this for you by adding components and defining dependencies, it's on GitHub: https://github.com/UnoSD/SlaCalculator

2022-02-03 update: I have finally published a graphical web app version of the SlaCalculator, please find it here: http://wiki.unosd.com/slacalculator/

This article is more for business guidance so I will not dive deeper into the technical aspects, but in Azure you can leverage within datacenter distribution (Availability sets), across datacenters distribution (Availability zones) and across regions distribution (Region pairs) to maximise the SLA for your solution, see the picture below:

Picture from: Azure resiliency infographic

High availability (HA)

This is quite a generic term and, frankly, you will find it mostly in marketing papers. It means a system resilient to failures, but it does not bear a unit of measure and what I consider to be highly available could be something that has a 70% SLA with a manual failover, someone would say HA only for 99.999% SLA. If you would like a less "abrupt" explanation, have a look at the comprehensive Wikipedia page

Disaster recovery (DR)

By now, if you have gone through the whole article, you should know this is our main focus here; A dictionary definition could be a set of policies, tools and process to recover data/compute from an unforeseen disaster, all the options available to implement this are in the next section.

Different levels of DR

Bear in mind that different levels can apply to data and compute. If your web portal is inaccessible, that's still a disaster, but it is not usually as bad as where there is data loss; as long as you have a recent backup of your data, you are in a much better position and can tolerate the system being out of service, but bringing it back up to the same state where you left it.

No DR

This is the most self-explanatory option; no plan, no resources and no costs.

This can still be a valid option in certain scenarios where your service is not critical and users can wait days/months to access your system again.

If my blog was down for a month, I would be disappointed, but I could just start all over again on a new platform from scratch. There are also real business which could tolerate this, but it is quite rare. It could apply to some internal employee system in certain companies.

You could still have a backup of your articles (like I do) on your laptop and you could restore that on a different platform if dev.to is unavailable; if I have a process to back that up on every change, to me that would count as having a manual DR plan for data, but no DR for compute.

Manual

This option has some overlap with the previous one. When I refer to "manual", I do not include someone noticing the system is down and tries to fix it and puts it back up by clicking around cloud portals and uploading a site somewhere else, I will classify that as "no DR" for the purpose of this article.

Manual is a well documented process for a step-by-step procedure to react to outages.

A cloud operations team of system administrators will get a notification of service disruption and will be able to respond accordingly and start the failover process.

The process could be going into the Azure portal, creating a new virtual machine, uploading a web application, switching the DNS servers configuration to point to the new server. All documented and the environment could be back up in a matter of hours.

This is how many organisations approach DR, but this approach is significantly slow (can take from hours to days). Problems can happen outside office hours and you need to have a team on-call during the night, make sure they are skilled, there is no single point of failure (people on holidays, sick leave, leaving the company and so on...) and despite all of this, human error is still a significant factor.

This approach may sound appealing as there is no additional cost for redundant infrastructure, but the TCO (total cost of ownership) of the solution must take into account people, training and errors.

Infrastructure as code

This is the first automated approach. We are removing human error from the equation, RTO will be much more predictable, you can perform regular tests of this and measure reaction times.

You need to make sure your development teams understand and build scripted environments (or your DevOps engineers, or sysadmins). This is good practice in any case, but not always the reality in the IT world. Most of the legacy applications are still deployed manually on bespoke infrastructure.

We think of "the cloud" as virtually unlimited scalability, but, in reality, the cloud is just yet another datacentre with its own physical and virtual limitations. There is the chance that, when you have an outage and try to redeploy your entire infrastructure in a different zone/region, you can hit capacity limits. This could be a disaster from which you cannot recover if you are not prepared.

A solution in the Azure world would be to purchase capacity reservations; you pay for the guarantee that you will have that capacity when you need to failover. This option increases significantly the cost of just a repository with your scripts, but can save the day during an emergency.

What you will save is the cost of **management of the inactive resources: no OS patching, no upgrading applications, no security alerts and so on.

The whole deployment process should be ideally automated with CD pipelines to create environments and deploy application workloads with minimal to no configuration effort in case of a disaster.

Cold environment

Now definitions start to become more woolly and more technology specific.

A cold environment is an environment that is already deployed, but stopped.

Cost savings may vary depending on the resources.

In Azure this could mean: VM in a deallocated state, Azure Firewall and Application Gateway stopped et cetera.

When a VM is deallocated, you don't pay for it, you just pay for the storage disks which preserve the state (tiny fraction of the cost). This means that you can start it up in half the time and you don't have to install software on it after as you have everything ready in your disk. Similarly for other resources and for more auto-scaling cloud-native services (Azure SQL serverless, Cosmos DB et cetera).

You can still have capacity problems and you can solve them in the same way you would with the previous approach (capacity reservations). The main difference is that this approach will be faster as you have just to start your environment, not provision it and prepare it as it can take 50% or more of the recovery time.

This approach seems better than the IaC/CD one, but has a significant downside:

You still have to keep your environment up to date!

Your applications and the OSs are already deployed into the storage disks or hosting platforms; you need to have a frequent (possibly automated) process for updating also your inactive environment in case it is needed.

Imagine you install Windows Server 2016 with 2017/11 security patches and in 2017/12/14 a new dangerous security bug surfaces, you do not upgrade the system as it is in hibernation, then you need to fail-over on 2018/01 and you end up with a vulnerable environment that could be hacked during the fail-over period.

Warm standby environment

This strategy bear similar/equal costs to a fully replicated production environment.

You have a second production environment, you keep it up to date with applications, configuration, OS patches and you treat it as if it was productions; the difference?

This environment does not process any data, does not actively run any task or interact with any user. It is just there, burning money.

Why would you do that? To have an almost-instantaneous fail-over. If the primary system has an outage, you can automate a DNS change or load balancer to immediately direct the traffic to the secondary environment. In Azure, all the load balancers (DNS/Layer 7/Layer 4) include health probes to automatically fail-over to a secondary if a primary environment does not provide the expected response.

To save on costs, you could have a "smaller scale" warm standby environment (less CPUs, less RAM, cheaper SKUs for services et cetera); you will need your users to tollerate a slower experience in the "rare" event of outages.

Active/active

Eventually, we analyse the "state of the art" of BCDR strategies.

This is the same as the warm standby, but with a small difference with a massive impact: Both environments are active production environments.

This usually involves a load balancer at the ingress of the system that distributes the load to both environments, usually in a round robin fashion, but can also be more sophisticated and distribute load based on geographic latency and, in case of an outage in one region, direct all the traffic to the only working environment.

The massive difference between this and the previous option is that you can now make the most of both environments. Theoretically you could reduce performance (and costs) of both environments to 50%, but also send half of the users to one and the other half to the other environment keeping response times consistent. Most of the cloud-native services also offer no-downtime scaling, so you could scale up one environment in case of failure of the other one.

Why do people still use warm standby (active/passive) when they could do active/active? In reality, applications (in particular legacy ones) are often stateful, you cannot just handle one request on a server and another one on a different server, this can break existing applications or no-so-well designed new application; for this reason warm standby is quite popular and often still requires careful planning as switching to fail-over could be complicated and could corrupt the state; often it requires connection draining or remediation if the application does not support handling traffic on multiple hosts.

This is truly what should be the goal for migrating legacy systems and the design for all new systems.

Active/active strategy in practice

Active/active is "easy" in theory; in practice we have discussed that it can be hard for stateful applications, single-tenanted solutions (where each user/group needs to have dedicated infrastructure).

Start small, those ideas can apply to a whole solution or can be applied individually to part of the solutions.

You could apply this strategy to part of the system or to new /rewritten components.

E.G.

You need to add a new background service to your application.

Instead of running this on the same infrastructure, consider building a separate stateless microservice,

add a load balancer to distribute the tasks,

think about concurrency when storing the results to a data store,

use asynchronous message queues to send requests,

create two queues one per region and add retries with fail-over to your application and distributed reads on the microservice,

avoid strict ordering requirements et cetera.

This is just a simple example and we could go on forever with best practices to build a resilient highly available solution.

Education

Education is key to innovation; it is a culture that needs to encouraged by leadership and built from the ground in each single line of code.

With strong guidance and a good enterprise skilling program, you can educate developers to build resilient systems in each piece of code and that would enable a good modern architecture for the whole system.

Stateful applications cannot be made stateless with infrastructure and architecture, that needs to start at code level.

How can you drive this? Hire talents with growth mindset, nurture them by providing opportunities for learning in-person, virtually, on-demand and promoting cloud adoption.

Microsoft Learn, LinkedIn learning, Pluralsight and so on; there are plenty of platforms with excellent material on stateless, cloud-native, modern architectures.

RTO/RPO image: "Graphic representation of RPO and RTO in case of an incident" by Own work is licensed under CC BY-SA 4.0.

API Management and Azure Functions secured without secret keys leveraging managed identities (RSS of TV series' new episodes)

Stefano d'Antonio — Sat, 20 Nov 2021 11:57:20 +0000

I wanted to keep up with new episodes of my favourite TV shows as they aired; being a big RSS feed fan, I wanted to centralise all the news in one place and I built a simple API that checks for new episodes and turns them into RSS XML for my feed reader and hosted it on cheap Azure Functions on consumption plan.

Link to the GitHub repository: TvShowRss

This is the architecture:

Securing with function keys

Making the Azure Function public, I would have opened my solution to abuse attacks.

E.G.

Someone knowing the URL, could run a massive invocation of my functions; given that Azure Functions scales seamlessly with no virtual limit, I could end up with a colossal bill on my Azure subscription.

So I decided to add an API Management instance in consumption plan in front of the functions API and set quotas, so only a certain amount of executions would be allowed per minute.

I wanted also to restrict access to my functions only from the outbound IP ranges of my APIM instance, but the consumption plan does not have a specific outbound range, so I had to secure it only with function authorisation (for now).

To secure the back-end I initially restricted the functions with function keys; so only people in possession of that key could invoke my functions.

This well known pattern, carries a big burden: to make sure your solution is secure, you need to have a key rotation system in place, in case someone manages to steal that secret value.

Authentication policies

Sometimes in 2021, a new feature was introduced in API Management, the authentication policies: https://docs.microsoft.com/en-us/azure/api-management/api-management-authentication-policies

With this feature, I could leverage the managed identity of the APIM instance and get rid of the keys completely.

Easy Auth

There is another amazing feature of App Services/Functions required, this is Easy Auth; this name is not well known, and the feature in the portal is just called Authentication/Authorization, which is a quite generic term and makes it harder to find in search engines. This is the link: https://docs.microsoft.com/en-us/azure/app-service/configure-authentication-provider-aad

This feature takes the burden of auth away from your application code and gets implemented at PaaS level.

You just configure your app to block non-authenticated traffic and it does the rest for your (create AAD App Registration, configure it, block anonymous access, check authentication token on calls):

Now you can set up your Functions as back-end of API Management and add the authentication-managed-identity resource policy inbound, which will automatically add the Authorization header to the requests and authenticate seamlessly, without stored secrets to your functions.

Unexpected problem

I was confident that it was going to take a few minutes to enable this and remove all the references to the old function key, but when I ran my test I had the following error:

IDX10214: Audience validation failed. Audiences: ‘[PII is hidden]’. Did not match: validationParameters.ValidAudience: ‘[PII is hidden]’ or validationParameters.ValidAudiences: ‘[PII is hidden]’.

Helpful, isn't it?

I kept looking at my audience in the JWT token from the APIM trace and at the allowed audiences in Easy Auth and they were exactly the same!

APIM trace

JWT.MS inspection

Easy Auth config

OK, the data is obfuscated, but you can trust me on that one, the rest of the 27c65941- GUID, matched everywhere.

The PII hidden did not help as I was not able to see the difference in the error message. If you try and find how to allow PII to be shown in debug environments, that can be done by setting a property in code, but the Easy Auth code is not something we can change.

Then I stumbled across this article from a fellow CSA and I figured it out!

Despite the error message talking about audience, the problem was with the issuer! If you look at the Easy Auth configuration, it is requesting v2.0, whereas the auto-generated AAD application issues v1.0 by default.

So I changed that in the app manifest et voilà... authentication worked and I was able to get rid of the secrets.

Infrastructure as code

OK, that's sorted, but it was time to script the changes and make sure my Pulumi deployment created the app registration with the correct configuration.

I started typing in Visual Studio in C#, under the Application definition: AccessTokenAccep<ctrl+space><ctrl+space><ctrl+space><ctrl+space> but no joy from IntelliSense...

This property wasn't in the Pulumi class! Which is odd, considering Pulumi auto generates code from the REST API (I presume also for the Azure AD module now).

So I looked up in the Azure AD application REST API docs and... nothing. Turns out (see here) it was only available on the beta API.

So I had to include the REST API call in Pulumi after the creation of the app registration, using a workaround to force the call that is my code but I will not discuss in this article (see code here).

Leveraging Logic Apps to prevent over-provisioning owner access to subscriptions

Stefano d'Antonio — Thu, 11 Nov 2021 08:06:51 +0000

Often happens that agility and freedom conflict with security.

(Aaron Paul voice) Has this ever happened to you?

Have you ever had developer teams request ownership of a full subscription to be able to freely experiment?

You still want to keep isolation, segregate responsibilities and permissions.

Ability to experiment freely is paramount to innovation, but uncontrolled proliferation of subscriptions can bear a significant management overhead.

Can we have the best of both worlds? The short answer is: yes.

A resource group can be an effective boundary as it allows its contributors to yet create any resource, but also restrict the scope of access within a subscription.

You can also enforce tags and Azure policies to control costs and enforce security.

But now a team is restricted to creating resources within a single resource group and it can get messy quite quickly and permission-wise is not so granular within teams.

What if we could allow teams to create their own resource groups within a subscription with contributor access and not being able to read/write other resource groups?

We can quickly set up a Logic App to enable this. Orchestrating creation and role assignment of resource groups within a single workflow, this enables:

Invoking the Logic App manually through a REST API
Invoking from a DevOps pipeline to create resources as part of a dev/test automated environment

The Logic App can create a resource group and assign a certain contributor based on the input payload from the HTTP trigger.

Warning

Setting up this workflow could lead to a security loophole: What if someone uses the name of an existing resource group so the workflow grants access to other teams' resources? We need to make sure we address this concern when building our Logic App as, usually, Azure management operations are idempotent and the Logic App won't fail if we pass the name of an existing resource group.

The Logic App

Let's have a look at the flow:

The logic is pretty simple and most of the operations we require have a native connector; the only missing one is "Create role assignment", but we can easily perform the operation by invoking the Azure REST API and we do not have to worry about authentication as the managed identity will do this for us.

Now, you may have noticed that there is no condition stating: "If the group already exists, interrupt the flow"; but if you look at the picture above, you may notice a red dotted line between two operations, this is because we changed the Run after settings of our create operation:

As you can see, the operation of creation (and so the role assignment after) will only occur if the Read resource group failed, hence the group does not exist; this will block the loophole described above.

Now, we can prevent people from having broad access, but our Logic App's managed identity still requires owner permissions on the subscription; more secure, but we can do even better. Let's create a custom role that has only enough permissions to read/create a resource group and assign permissions to it:

We can now use the UI to create a custom role, but we may also want to script it and define it in JSON format:

Let's now assign that to the managed identity of our Logic App:

Now we should have all permissions in place. If you also want to use a security group, your Logic App identity may also require Directory.Read.All permissions on your Azure AD instance.

Creating role assignment

I mentioned above that all the other actions can be performed with native Logic App connectors, but the role assignment, at the time of writing, requires the HTTP connector to invoke the Azure REST API, let's have a look at that:

Even if we cannot do it in idiomatic Logic App, that is yet pretty simple.

This is the API documentation: https://docs.microsoft.com/en-us/rest/api/authorization/role-assignments/create

We just need to set the right method (PUT), the correct URL, using the resource group ID as scope from the output of the previous connector and we can auto-generate a random GUID as name for the assignment using Logic App expressions (guid()). The body must contain the role definition ID, which needs to be the built-in contributor GUID under our subscription ID, I have used a variable for that to improve clarity:

The subscription ID is one of the input of our workflow and the hardcoded GUID can be found here: Contributor

To get values as input from the workflow invocation, we need to set the input JSON schema in the HTTP trigger:

We need:

principalId The object ID of the assignee of the contributor role, this can be found looking up the user in Azure AD from the portal
resourceGroupLocation, resourceGroupName, subscriptionId Quite self-explanatory arguments

{
    "properties": {
        "principalId": {
            "type": "string"
        },
        "resourceGroupLocation": {
            "type": "string"
        },
        "resourceGroupName": {
            "type": "string"
        },
        "subscriptionId": {
            "type": "string"
        }
    },
    "type": "object"
}

After adding the schema above to the trigger, those values will be available as variables in the rest of the workflow.

Testing

All there is left now is to test, let's run our workflow with the following input:

{
    "subscriptionId": "<Target subscription ID to create groups>",
    "resourceGroupName": "rg-test2",
    "resourceGroupLocation": "West Europe",
    "principalId": "<Object ID GUID of your user from AAD>"
}

And this is what happens, assuming rg-test2 does not exist:

Looks good, all the steps we wanted to run were successful.

Now, let's try and run this again with the same inputs:

OK, as you can see from the grey circles next to the actions below Read a resource group, none of the other operations were performed, exactly as expected.

Now let's have a look at our newly created resource group IAM blade:

Exactly what we wanted; a new resource group of which I am contributor without requiring any permission on the subscription.

The full template for the Logic App is available on my GitHub so you can save the extra 5 minutes it took me to create it to enable this security feature for your teams.

You can also enhance security of the Logic App to prevent unauthorised users from calling it by fronting it with API Management or you can use Azure Active Directory Authorization Policies on the Logic App itself or a combination of the two.

No-rewrite free TLS offloading, WAF and more for legacy web applications with Azure Front Door

Stefano d'Antonio — Fri, 05 Nov 2021 15:27:29 +0000

You have a one or more legacy web applications, running on virtual machines and no time or resources to rewrite and implement security; you can get all those features and more just with infrastructure leveraging Azure Front Door.

There are two flavours we are going to explore:

1) Front Door forwarding requests to your VM on its public IP
2) Front Door Premium (preview at the time of writing) with private origins forwarding requests directly into your VNET without public internet exposure of the machine

Azure Front Door forwarding requests to your VM's public IP

In this scenario you can set up the backend of Azure Front Door directly to the public IP attached to the VM.

This has the following advantages:

Works with Azure Front Door classic (GA)
Simpler infrastructure and configuration

This approach has the downside of being less secure; using an NSG against the VM (subnet/NIC level) you can still ensure that the traffic originating from the internet will never hit your VM by allowing only traffic from the Azure Front Door service tag, this way, any other source will be blocked, with two pitfalls:

1) DDoS attacks could still block access to the VM (although they can be mitigated by adding standard DDoS protection for VNETs)
2) A sophisticated and knowledgeable attacker (perhaps an internal agent) who manages to find the public IP of the VM, could spawn up their own Front Door instance and point it to your public IP, bypassing the existing Front Door security.

Those threats can be still mitigated, but we will not explore that in this article; just to share them at high level:

You can filter incoming traffic by checking the X-Azure-FDID header that Front Door adds to the forwarded requests with its unique ID; using that and the service tag will ensure that the traffic is coming only from your Front Door.

If you still do not want to make any application changes, you can add an instance of Application Gateway and let it do this filtering for your before forwarding the traffic to your VM.

If you go through a threat modelling exercise and decide this risk stil needs mitigation, you can go for the second option below.

Azure Front Door Premium forwarding requests to your VM's internal IP via Private Link

This approach improves significantly your security posture, but bears increased costs (Front Door Premium is required for private origins support) and a slightly more complex architecture and configuration (requires extra components such as a Standard Load Balancer, which also adds to the costs, and Private Link).

Using this approach, you can get completely rid of the public IP and project your Front Door instance directly into your VNET, eliminating the attack vectors of DDoS and Front Door hijacking.

Without the IP, an attacker has no endpoint to use beside Front Door which carries enhanced security and can optionally be set up with a WAF (Web Application Firewall), mitigating the most common attacks with several security rules.

Advantages:

No public endpoint for the VM
No DDoS risk
No Front Door hijacking attack

Both the options enable your application to leverage advanced security without application changes. You get:

Free TLS certificates with Azure Front Door also on your custom domains
WAF capabilities (protect against OWASP common attacks and more)
Dynamic site acceleration
Enables HTTP/2
Global HA with load balancing if you have multiple instances of your application

Multiple instances with path-based routing

Now, let's pretend you love this, but have multiple apps to secure behind Azure Front Door and want use a single domain in front of those applications:

mydomain.com/app1
mydomain.com/app2
...

You can easily set up routing rules to redirect all the traffic to your backends depending on the URL segment.

You just set up multiple backend pools in Front Door and create routing rules to forward the traffic that has /webapp1/ to the backend 1.

Assuming again that you do not want to make any changes to your web application, you can also use the rules to rewrite the URL to avoid forwarding the /webapp1/ and just keep the rest of the URL:

So your application does not need to be aware of Front Door at all.

What if your application, not being aware, redirects then the user to a /* endpoint?

Imagine that your application calls a /api/dosomething or /login, that will not preserve the /webapp1, hence Front Door will not know where to direct the traffic.

Once again, we can have a pure infrastructure solution, without application changes.

The idea is that: on a first call to /webapp1, Azure Front Door can alter the response and add a new header value; this header can easily be a Set-Cookie header that sets a value unique to the web app 1 backend; on subsequent calls, the browser will include that cookie and the Front Door rules engine can override the routing configuration based on the content of the returned Cookie header and make the correct decision to forward /* traffic to /webappX/*. See configuration below:

This sets the cookie to the correct backend on the initial call, the full header value contains:

BACKEND=backendIdentified; Domain=mydomain.com; Path=/

It requires to specify the path, if omitted, it will assume that the cookie will be specific to mydomain.com/webapp1/* and will not work also on mydomain.com/*.

Now the rule to override the behaviour on calls to mydomain.com/*:

Now you need to add the rules engine rules to your routing configuration and you are done.

You can optionally set up an error page backend if the cookie is not present and someone requests directly mydomain.com.

Et voilà, with few simple infrastructure changes, you have seamlessly added free TLS and WAF and more to your legacy web applications without changing a single line of code.

Virtual machine scale sets flexible orchestration mode (and benefits over regular VMs)

Stefano d'Antonio — Wed, 11 Aug 2021 09:25:56 +0000

Introduction and VMSS benefits

When we close our eyes and we try and picture "the cloud", two quintessential IaaS services come to mind: Virtual machines and Virtual machine scale sets (in Azure).

Historically, VMs and VMSSs are the first iteration of our cloud migration journey; a VM facilitates the lift-and-shift pattern and a VMSS our first step towards leveraging cloud scaling for our workload.

A traditional VMSS sacrifices control in favour of simpler deployment, simpler management, faster recovery and faster horizontal scaling.

You define an image and the service will stamp instances of that blueprint on demand, it can auto-scale based on the usage, on failure, on updates; all on your behalf.

Downsides of VMSS

Those glorious benefits have some drawbacks:

1) VMSS API diverges from the standard VM API for individual instances
2) Lack of RBAC granular permissions per VM
3) Lack of Azure Site Recovery and Azure Backup support

In addition, to handle VMs availability, there are two different patterns making the experience inconsistent between VMSSs and VMs; for datacenter redundancy for VMs, availability sets must be used, VMSSs have native support for distribution across fault and update domains.

Enter VMSS flexible

A new option for VMSS has been created, the "orchestration mode".

The classic experience, has been re-branded as uniform orchestration, but is no different from the traditional VMSS experience.

The new option, still in preview at the time of writing, is called flexible orchestration and it promises to:

Unify the experience, no more availability sets, the availability experience will be handled in the same way between individual VMs and VMSSs
Individual instances have full control with the same VM API as regular VMs
VMs can be added to VMSS flex after creation
Custom instance naming
Target individual VMs with extensions
Assign machines to specific fault domains
In guest OS security patching (without re-imaging)
Mix Windows and Linux in the same set

Why should I use that instead of regular VMs?

VMSS flex shines when managing a substantial number of VMs (30+).

It provides a single control plane for distribution of the machines across a datacenter with automatic and optimized spread.

When dealing with many machines, it can be a significant overhead to manage availability.

It does also support the template-based scaling like a uniform scale set should you need that, but, I believe, its main purpose is to unify the experience between VMs and VMSSs.

It supports up to 1000 instances spread across fault domains, whereas uniform supports only up to 100. Fault domains are also treated the same as update domains.

There are obvious advantages (and some cons) over a uniform mode, including having different machine sizes and other features listed above in the article; in addition, there is a comprehensive table in the official Microsoft documentation that would be pointless to copy and paste here.

Current limitations

VMSS flex does not currently support availability zones.

Does not support single placement groups.

How to get started

To try VMSS flex you need to register the feature in your subscription, see the full guide here: https://docs.microsoft.com/en-us/azure/virtual-machine-scale-sets/virtual-machine-scale-sets-orchestration-modes#register-for-flexible-orchestration-mode

Quickstart ARM template to deploy VMSS flex: https://github.com/Azure/azure-quickstart-templates/tree/master/quickstarts/microsoft.compute/vm-vmss-flexible-orchestration-mode

Infrastructure as code in 2021

Stefano d'Antonio — Fri, 16 Apr 2021 19:12:55 +0000

There are numerous advantages to using IaC, here are few examples:

Automation Scripted deployments reduce surface for human error and are faster to execute
Reproducibility Being able to recreate the same environment every time or multiple identical environment based on the same templates
Security Permission to production environments can be granted only to deployment service accounts, reducing risks
Costs Being able to destroy and recreate environments quickly can enable fast de-provisioning of expensive resources when they are not used

What are the options?

This article is Azure-centric, but most of what will be discussed will also apply to AWS, GCP and other clouds and targets. Cloud-specific systems such as ARM templates/Bicep may be comparable with AWS CloudFormation, GCP Cloud Deployment Manager and to K8s YAML for on-premises targets.

The code examples below all deploy exactly the same resources

Full working examples on GitHub here

ARM templates

Azure-native way of scripting resources; the language used is JSON.

I have worked extensively with ARM templates, but I still get an headache when I open one for the first time.

JSON is human readable, but not intuitive.

It requires a great deal of boilerplate and, JSON being JSON, naturally requires an abundance of curly braces, double quotes and other symbols that represent a substantial distraction from the actual meaningful content and hence easily lead to cognitive overload. Many people I spoke to dislike it for this precise reason.

The nesting doesn't help and it handles modules poorly.

Enough being negative, what's great about it? It's Azure's mother-tongue. It supports all Azure resources as soon as they are available.

In addition, if you use the Azure Resource Explorer, you will find exactly what has been deployed in your subscription and that will be in JSON and compatible with your ARM templates;

If you create something in the portal, you can easily export its current configuration as an ARM template.

This is a template that creates a resource group and a storage account:

{
  "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
  "contentVersion": "1.0.0.0",
  "parameters": {
    "rgName": {
      "type": "string",
      "defaultValue": "rg-arm"
    },
    "rgLocation": {
      "type": "string",
      "defaultValue": "West Europe"
    }
  },
  "variables": {},
  "resources": [
    {
      "type": "Microsoft.Resources/resourceGroups",
      "apiVersion": "2018-05-01",
      "location": "[parameters('rgLocation')]",
      "name": "[parameters('rgName')]",
      "properties": {}
    },
    {
      "type": "Microsoft.Resources/deployments",
      "apiVersion": "2017-05-10",
      "name": "storageDeployment",
      "resourceGroup": "[parameters('rgName')]",
      "dependsOn": [
        "[resourceId('Microsoft.Resources/resourceGroups/', parameters('rgName'))]"
      ],
      "properties": {
        "mode": "Incremental",
        "template": {
          "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
          "contentVersion": "1.0.0.0",
          "parameters": {},
          "variables": {},
          "resources": [
            {
              "type": "Microsoft.Storage/storageAccounts",
              "apiVersion": "2017-10-01",
              "name": "[concat('sa', uniquestring(subscription().id))]",
              "location": "West Europe",
              "kind": "StorageV2",
              "sku": {
                "name": "Standard_LRS"
              },
              "properties": {
                "supportsHttpsTrafficOnly": true
              }
            }
          ],
          "outputs": {}
        }
      }
    }
  ],
  "outputs": {}
}

Pros

Native deployment history tracking
Always up to date with new resources
Supported by Microsoft

Cons

Unfriendly language
Hard to manage modules
Complexity increases exponentially for large environments

Bicep

OK, now take ARM templates, remove all the downsides and here you have Bicep.

A Bicep template is pretty much the same as an ARM template, in fact it transpiles to ARM JSON to use the same underlying deployment system, but Bicep addresses the two main cons of ARM:

The first massive advantage over ARM is the language; it's a bespoke DSL, easy to write and understand.

Modularisation is also easier than ARM as it allows to reference other templates in the same container (also same directory locally).

Being the same as ARM means also that is Azure-only, which, currently, is the main drawback.

Same resources from the ARM template above, but in Bicep this time:

targetScope = 'subscription'

resource rg 'Microsoft.Resources/resourceGroups@2020-01-01' = {
  name: 'rg-bicep'
  location: 'West Europe'
  scope: subscription()
}

module stgModule './storageAccount.bicep' = {
  name: 'storageDeploy'
  scope: rg
  params: {
    location: rg.location
  }
}

param location string

resource stg 'Microsoft.Storage/storageAccounts@2019-06-01' = {
  name: 'sa${uniqueString(resourceGroup().id)}'
  location: location
  kind: 'StorageV2'
  properties: {
    supportsHttpsTrafficOnly: true
  }  
  sku: {
    name: 'Standard_LRS'
  }
}

Even without syntax highlighting from the blog engine, this immediately looks awesome, way more readable and succinct. In my opinion a great option if you are Azure-only and want to avoid the burden of state management. You will miss out on clean-up of resources that Pulumi and Terraform achieve with an external state, but it is worth evaluating.

Pros

Same pros as ARM
Expressive (concise) language
Good modularisation support

Cons

Azure-only
No resources clean-up

Terraform

Terraform is the (first, I believe) exceptional attempt at a multi-cloud, human-readable, infrastructure as code tool; a successful attempt; it is quickly becoming the industry standard for cloud deployments.

The HashiCorp tool uses an external state storage, this registers all the resources deployed and enables destruction of the resources removed from the templates and clean up of an entire environment on demand.

Uses a custom DSL called HCL, quite friendly and understandable by anyone at a glance without prior training.

One major downside is that is a limited language; it covers the basic conditionals, loops, variables (poorly, in my opinion).

It's a beautiful solution for simple deployments, but it gets pretty frustrating when attempting to be more clever with the logic.

Same resources as above in the example here, in Terraform this time. I am using Azure blob as a state storage, but it has several options including the local file system.

One big downside is that often you find yourself in a situation where a new resource or a new feature of a resource comes up in Azure (and I presume the same for other providers), but you have to wait for the Terraform team to implement it to leverage it in idiomatic Terraform; you can always deploy ARM templates from within Terraform, but it is ugly and you miss a richer diff experience.

Modules support in Terraform is also great and it allows you to reference modules directly from external Git repositories.

terraform {
  backend "azurerm" {
    resource_group_name  = "rg-iac-demo"
    storage_account_name = "saiacdemo"
    container_name       = "terraform"
    key                  = "demo.tfstate"
  }
}

variable "subscription_id" {
  type      = string
  sensitive = true
}

variable "client_id" {
  type      = string
  sensitive = true
}

variable "client_secret" {
  type      = string
  sensitive = true
}

variable "tenant_id" {
  type      = string
  sensitive = true
}

provider "azurerm" {
  features {}

  subscription_id = var.subscription_id
  client_id       = var.client_id
  client_secret   = var.client_secret
  tenant_id       = var.tenant_id
}

resource "random_id" "storage_account" {
  byte_length = 8
}

resource "azurerm_resource_group" "example" {
  name     = "rg-terraform"
  location = "West Europe"
}

resource "azurerm_storage_account" "example" {
  name                      = "sa${lower(random_id.storage_account.hex)}"
  resource_group_name       = azurerm_resource_group.example.name
  location                  = azurerm_resource_group.example.location
  account_tier              = "Standard"
  account_replication_type  = "LRS"
  enable_https_traffic_only = true
}

Pros

Unofficial industry standard as of 2021 (endorsed by several organisations)
Language and CLI are incredibly easy to understand and to use

Cons

HCL has its limits and simple logic can turn into complex templates (and can be frustrating to code)
Storage of the state is in plain text, including secrets; responsibility for securing it lies with the user
Tooling and code completion is not always great and misses "compile"-time checks
Being an open-source tool, is supported by the community only unless you pay for Terraform Enterprise
Resources support is delayed, sometimes quite heavily; bugs can stay unfixed for years

Pulumi

Dulcis in fundo... my favourite IaC tool as of today.

The people at Pulumi had a great intuition:

Using general purpose programming languages to define infrastructure.

It was such a good idea that a few months later, Terraform published a preview of its CDK to write Terraform in TypeScript (and I believe they now also support other languages).

With all the pressure for a DevOps culture, this fits really well as it enables developers to use a familiar language to also define infrastructure (it also allows interop between apps and infra code)

It supports .NET languages (C#/F#/VB.NET/...), Go, TypeScript, Python.

The ability to use the Azure management SDKs is not new, we could always do that in those languages, but Pulumi manages all the resources and dependencies for us.

You can specify the resources to create in a declarative way; you just build a list of stuff to create and Pulumi works out dependencies, changes and everything else for you.

It also features an encryption capability for secrets in the state; you can use external providers to encrypt the content moving the responsibility of securing the state more towards the tool.

The only downside may be that, for system administrators, picking up a programming language may have a steeper learning curve than learning HCL or Bicep and you are more likely to find ops talents on the job market that know Terraform rather than C#.

The support story is similar to Terraform. Pulumi offers a paid plan for storage, but the actual tool is open-source and community-supported.

One feature that Terraform has, but it's missing in Pulumi is the ability to plan the changes to an output file and then apply that file later; this eliminates the risk of race conditions if you plan and deploy independently and I find it really useful in CD pipelines (I've already created a GitHub issue to request the feature).

There is no roadmap for it at the moment, but if they will support PowerShell as a language in the future, that may remove the need for sysadmins to learn a new language and I believe it could ramp up its adoption.

using Pulumi;
using Pulumi.AzureNative.Resources;
using Pulumi.AzureNative.Storage;
using Pulumi.AzureNative.Storage.Inputs;

class MyStack : Stack
{
    public MyStack()
    {
        var resourceGroup = new ResourceGroup("rg-pulumi");

        new StorageAccount("sa", new StorageAccountArgs
        {
            ResourceGroupName = resourceGroup.Name,
            Sku = new SkuArgs
            {
                Name = SkuName.Standard_LRS
            },
            Kind = Kind.StorageV2,
            EnableHttpsTrafficOnly = true
        });
    }
}

Pros

The AzureNative provider is always up-to-date with new Azure resources and features
If you are a developer, you do not need to learn a new language
It gives you the full power of a real programming language, whatever you can do in C# (or Python etc...) you can do in a Pulumi project
Great CLI, quiet in the output by default

Cons

Niche, less likely to find experts and documentation is poor
More hostile to pick up for sysadmins than Terraform
No output to the planning phase

Pulumi.FSharp.Extensions

I wanted to take Pulumi a step further; I love the technology, but I still do not like the verbosity of C# and, using Pulumi in F# was ugly, it is not made for this and you end up with code that looks like this to mimic property initialisation in C#:

let infra () =
    let resourceGroup = ResourceGroup("rg-pulumi")

    StorageAccount("sa", StorageAccountArgs(
        ResourceGroupName = resourceGroup.Name,
        Sku = SkuArgs(
            Name = SkuName.Standard_LRS
        ),
        Kind = Kind.StorageV2,
        EnableHttpsTrafficOnly = true
    ));

So I decided to write an extension to use Pulumi, but make it look even better and simpler than Terraform in F# using computational expressions:

let rg =
    resourceGroup {
        name                   "rg-pulumi"
    }

let sa =
    storageAccount {
        name                   "sa"
        resourceGroup          rg.Name
        accountReplicationType SkuName.Standard_LRS
        accountTier            Kind.StorageV2
        enableHttpsTrafficOnly true
    }

Link to the GitHub repo here.

Bird's eye view and lines of code

Purely looking at the core of the template, probably Pulumi in C# is the shorter one, but, being fair, you also have an external Pulumi.yaml file with project and state configuration (which is included in my Terraform example) and language-specific files such as project files (csproj), solution etc... Terraform and Bicep are also quite short. The ARM template is the most verbose as anticipated.

Other options

PSArm

A new option that recently came up (announced a few days before this blog post) is: PSArm

I have not had a chance to play with it and I will update this article later, but PSArm seems to answer the prayers of many sysadmins fed up with learning new languages;

It is a way of writing ARM templates using idiomatic PowerShell which is a familiar language to ops.

It sounds quite appealing to those who want to reuse existing skills to embrace IaC and PowerShell is almost as flexible as a programming language which may help overcome some limitations of Terraform.

Farmer

Honourable mention goes to Farmer which lets you write nice-looking F# that generates ARM templates; I have not played with it much, but I will try to see if there is any advantage in using it over Pulumi (and Pulumi.FSharp.Extensions if you want it to look pretty). Bear in mind that, generating ARM it means that it works only on Azure.

Azure CLI (Bash/PS) and PowerShell (Az module)

In my opinion, a less honourable mention. Many infrastructure engineers adopted this method because they did not know any better (early in the days when ARM/Terraform/etc were not so popular or did not exist at all), I personally see no benefits in using this approach nowadays as it moves a massive burden towards the engineer:

You have to worry about dependencies
You have to worry about error handling
You have to make sure it is idempotent (and cmdlets not always are)
...

I would not recommend this option or the ARM templates route as of today, I believe that there is no compelling reason to write IaC in this way. Please let me know in the comments if you have a good use case and I will happily update the article including it.

Comparison table

This is a comparison table where I evaluate features for each tool.

Features comparison

Declarative

That is the difference between getting into a shop and asking for a "chocolate cake with cream filling and 30 candles" and telling the baker: OK, now shake the eggs, mix with sugar, add milk etc...

Writing declarative code means telling the system what you want, not how to do it. You lose (unnecessary) control in favour of a feature-rich simplicity. That results also in less verbosity.

Most of the options are declarative, you can define resources in whichever order you prefer and the tools will work out dependencies and parallelisation for you. They will also manage retries, error handling and so on without having to explicitly code for it.

The only options that are imperative are Azure CLI and PowerShell (or directly using the REST API/SDK) to create the resources. I would not recommend this to anyone. It is worth upskilling (if you are a sysadmin) to understand Pulumi or Terraform and avoid PowerShell (or potentially try PSArm or wait for Pulumi to support PowerShell)

Idempotency

All the options can be idempotent; idempotency means that you can rerun the same deployment as many times as you like and, as long as the resources are unchanged, it will do nothing;

if a resource drifted away from the configuration or does not exist, it will be picked up.

Azure CLI and PowerShell are in yellow as you can still achieve this, but you have to code against it in certain cases; many cmdlets and AzCLI commands will be idempotent, but there is no guarantee.

Fallback mechanism

Both Terraform and Pulumi can include ARM templates in their code;

if a resource is not supported yet, you can temporarily use an ARM template and then update it later when the provider gets updated. Pulumi is unlikely to be out of date as it auto-generates from the Azure REST API; the folks there just need to kick off another build and in a matter of minutes a new Pulumi library is ready with the new Azure resources supported.

ARM/Bicep are updated immediately, AzCLI/PS almost immediately. I have never seen a resource available only in the REST API and not in all those.

Modularisation

ARM has an awful way of modularising templates, Bicep improves that significantly.

Terraform works nicely with modules and also supports modules directly from a Git repository and Pulumi is as good as the language you pick (which is very good for all the languages); you can use NuGet packages in .NET, npm packages with TypeScript, I presume pip with Python and I am sure that also applies to Go with its own package system.

Legacy deployments

ARM is not that flexible, but it won't matter at all as most of the time it will not need to use any legacy code; I consider ARM itself the "legacy".

Bicep has an automated tool to convert from ARM.

From Terraform you can invoke commands locally (including PowerShell/Bash scripts)

Pulumi, again, can do whatever TS/C#/Python/Go can do; which is pretty much everything your computer can do, including invoking Terraform, REST APIs, buy a pizza on every deployment using your favourite pizza place APIs, feed your cat in your smart home or play a fanfare when a resource is created...

Also worth noting that Pulumi has tf2pulumi, a tool that converts your Terraform to Pulumi in your chosen flavour of language, worth also noting that I got an exception the first time I tried using it; I will insist before judging it too harshly, but it did not seem mature enough at the time I tried it.

Supportability

ARM/Bicep are supported by Microsoft, not much else to say there, it is a massive plus.

Terraform and Pulumi by the community, but you can get a paid support plans if you use their storage; although, it may still mean that you are covered for the storage, but if the tooling has a bug you may have to wait in line like any other mere mortal.

AzCLI/PS you get the obvious support for the tools, but, if your custom code goes wrong, you're on your own and it will be mostly your custom code that will fail as there is so much more to write to achieve what the tools above achieve naturally.

Error handling/Plan/Clean up

This is all managed for your by all the tools, except CLI/PS where you have to look after this yourself; write conditional code and specify retries and what to do if it all goes bad.

I will soon publish a repository on GitHub with working examples in each language

This is a live article, I will try and keep it up to date with the new development and to complete the missing bits, if you want to suggest a change, please submit a pull request to this repository.

Cover image: "A visual representation of the DevOps workflow" by Kharnagy (edited) is licensed under CC BY-SA 4.0.

High availability for Event Hubs processors

Stefano d'Antonio — Tue, 10 Nov 2020 16:23:04 +0000

Azure Event Hubs with Stream Analytics is a powerful combination for quasi real time high throughput data processing.

It’s a great solution if you want fast reporting on live data and save your architecture from extra complexity.

Use case: Your service receives real time high volume user data I.E. website tracking or application telemetry or IoT.

The traditional process for generating reports would involve slow and painful steps such as:

Sensors/source sending data to a queue
Service A constantly fetching messages from the queue
Service A storing data
Service B querying the data persistence layer
Service B processing/aggregating the data
Service B storing the result of the processor

And, of course, the process involves development resources for the implementation and several machine resources.

The solution: Event Hub, Stream Analytics, Service Bus

This can be simplified by letting Azure take care of most of the steps:

Sensors sending data to Event Hub
Stream Analytics aggregating/processing the data in time windows and storing the result or handing it over to a service.

In diagram the result is sent to a Service Bus queue to be stored/processed by a service later, but Stream Analytics has the capability of storing directly to many data layers.

Event Hub is a chronological stream storage, there is no concept of message locking or deletion and stores data in partitions allowing parallel access for multiple consumers. We don’t have to worry about removing messages once processed as our progress can be stored into checkpoints which record the point in time where we stopped processing to resume later.

Data is safely stored there and resilient and it will only be deleted after it passed the expiration set in the retention option, no manual/accidental deletion.

The architecture described in the diagram is the cheapest and simplest but it doesn’t take into account high availability; if the Azure Event Hub in our region has an outage we will lose all the messages during that period.

We can setup a failover stream in a different region to accept messages in case the primary hub is unavailable:

In this example our web application has to implement a failover logic to try and send messages to the primary and failover to the secondary during downtimes (I.E. in case of exceptions on sends).

We now have a good solution to prevent data loss from the web application side as the event will find its way in one or another hub and, assuming we set up the secondary in a different Azure region, we achieved geo-redundancy and we also qualify for the Azure 99.9% SLA making our messages resilient to alien invasions (I do not take responsibility if they target your two Azure clusters specifically).

Is it over? Are we happy our customers will never contact our support again? Of course not...

What happens then if the Stream Analytics service has an outage? Messages will be safe and sound but their journey will be delayed for the duration of the outage.

It is true that looking at the history of outages there has never been one involving Stream Analyitics, but we cannot rely on hope when it comes to the danger of pissing off customers.

Let’s try and solve this, too. On a Microsoft blog, the recommended solution is similar to the following:

Now we have a geo-redundancy of all the resources, but what happens when one Stream Analytics only has an outage? It’s unlikely that the whole region will be down, and we have no guarantee the services will be in the same fault/update domain. It might only happen in case a bomb lands on the whole data centre set, but in that eventuality we have bigger fish to fries so let's go back to it later.

If only Stream Analytics is down, the primary Event Hub and our web application will be unaware and will continue to store messages in the stream that will not be processed down in the chain; we will be unaware until we hear customer shouting.

We need to make sure a Stream Analytics works as failover and we could do so by checking the status of the service before sending messages... but:

There is no public Azure API to check the status of the service
Our high throughput and performance critical application receiving millions of requests would have to delay the completion to make a new check request to a service on every call.

So, even if we had the API, it would not be the best way to go. Maybe a separate background service checking the status, but we still risk of losing messages in the delay for our service to give us the status after the real outage occurred.

So let's try something else:

But now, both the SA will process the same messages and generate an output which is likely to be similar, but extremely unlikely to be exactly the same (Resources contention, time windows unaligned, et cetera).

This is still a good solution if you do not care about having potential duplicates in the data.

But if this is not your case, how do we know what’s in the primary and what’s in the secondary? How do we avoid duplication?

At the time I’m writing this there is no solution.

Event Hubs works with checkpoints and Stream Analytics stores its checkpoint somewere and a good solution would be to share checkpoints across the two Stream Analytics as they will work in synergy and never overlap each other; but Azure does not support this feature at the moment.

So we go back to this original solution if we care about the precision of our data and we have to hope the weak ring (Stream Analytics) will not fail (which is unlikely as it is also based on Service Fabric.

I will stress the fact that we never lose data in this configuration so we might only delay the processing for the duration of the outage:

But yet we have another problem… What happens when the output Service Bus is down? I verified empirically that Stream Analytics has an internal cache for messages, but there is no guarantee the message will land eventually and the retention can be days, minutes or seconds (source:Microsoft). So we can send message identical copies to two geo-redundant Service Bus queues to prevent loss.

We would also like our services not to process twice the same message though, so what can we do?

My first idea was to add a unique identifier to the messages so our service can cache it and make sure it doesn’t process it twice (in a distributed cache/database if we want to make the service scalable over multiple instances). Great, so let’s ask Stream Analytics to generate a GUID for us and attach it to the two messages… failed. ASA cannot generate GUIDs.

OK, not a problem, I will just get the current date and time which should be unique enough if we have time windows… failed. ASA doesn’t have a GETDATE() or any time function so we need to rely on some data in our message to generate a sort of “hash” or unique identifier for the message.

I choose to use the combination of the first event date and the last event date so it could precisely define our window.

Now our architecture is as highly available as possible (still with a chance of delaying the results in case of a ASA outage) so we should be able to sleep reasonably well knowing our service is (almost) always available (bombs/aliens apart).

Migrating my blog from WordPress.com

Stefano d'Antonio — Tue, 10 Nov 2020 16:20:29 +0000

Moving my old https://rocket.science.blog/ here, so I can carry on ghosting a different blog.

I'll copy the only blog post I have in the hope that new others will follow.