DEV Community: Mithun Shanbhag

CloudSkew featured at Azure Community Conference

Mithun Shanbhag — Sat, 26 Dec 2020 11:14:40 +0000

A big thanks to the Azure Community Conference for featuring CloudSkew. Here's the recording in which Mithun Shanbhag (CloudSkew creator) talks about the internals of cloudskew.com and how it was built on top of Azure.

CloudSkew crosses 20K user signups

Mithun Shanbhag — Sat, 05 Sep 2020 10:58:14 +0000

cloudskew.com is acquiring users at a steady clip and just crossed 20K user signups this week.

CloudSkew featured at Cloud Community Days conference

Mithun Shanbhag — Mon, 20 Jul 2020 20:33:04 +0000

A big thanks to the Cloud Community Days conference for featuring CloudSkew. Here's the recording in which Mithun Shanbhag (CloudSkew creator) talks about the internals of cloudskew.com and how it was built on top of Azure (skip to the 3:49:00 mark for the start of the talk).

CloudSkew featured on the Cloud Lunch & Learn show

Mithun Shanbhag — Mon, 20 Jul 2020 20:05:04 +0000

A big thanks to the Cloud Lunch & Learn show for featuring CloudSkew last week. Here's the recording in which Mithun Shanbhag (CloudSkew creator) and Hugo Barona (host) talk about the internals of cloudskew.com and how it was built on top of Azure.

Please do subscribe to Cloud Lunch & Learn's youtube channel for great Azure content!

How Cloudskew.com was built on Azure

Mithun Shanbhag — Tue, 14 Jul 2020 04:47:17 +0000

CloudSkew Architecture

CloudSkew is a free online diagram editor for sketching cloud architecture diagrams (see a quick demo video). Icons for AWS, Azure, GCP, Kubernetes, Alibaba Cloud, Oracle Cloud (OCI) etc are already preloaded in the app. All diagrams are securely saved in the cloud. Here are some sample diagrams created with CloudSkew. The full list of CloudSkew's features & capabilities can be seen here. Currently, the product is in public preview.

In this document, we'll do a deep-dive on CloudSkew's building blocks while also discussing the lessons learnt, key decisions & trade offs made (this living document will be frequently updated as the architecture evolves). The diagram below represents the overall architecture of CloudSkew.

CloudSkew Architecture

CloudSkew's infrastructure has been built on top of various Azure services - snapped together like lego blocks. Let's now take a look at the individual pieces.

This article is a part of the #FestiveTechCalendar2020 (https://festivetechcalendar.com/), #AppliedCloudStoriesContest (aka.ms/applied-cloud-stories) and #AzureDevStories (http://konf.me/ds) initiatives.

The video below recaps how CloudSkew was built. You can skip reading the rest of the article and watch this video instead!

Apps

At it's core, CloudSkew's front-end consists of two web apps:

The landing page is a static VuePress site, with all pages authored in markdown. The default VuePress theme is used without any customization, although we're loading some marketplace plugins for image zoom, google analytics, sitemap generation etc. All images on this site are loaded from a CDN. The choice of VuePress for SSG was mainly down to its simplicity.
The diagram editor is an Angular 8 SPA written in TypeScript (more details on the internals of this app will be shared in future articles). To access the app, users are required to login using their GitHub or LinkedIn credentials. This app too loads all its static assets from a CDN, while relying on the back-end web APIs for fetching dynamic content. The choice of Angular as the front-end framework was mainly driven by our familiarity with it from prior projects.

Web APIs

The back-end consists of two web API apps, both authored using ASP.NET Core 3.1:

The CloudSkew APIs facilitates CRUD operations over diagrams, diagram templates and user profiles.
The DiagramHelper APIs are required for printing or exporting (as PNG/JPG) diagrams. These APIs are isolated in a separate app since the memory footprint is higher causing the process to recycle more often.

Using ASP.NET Core's middleware, we ensure that:

JWT authentication is enforced. Use of policy-based authorization for RBAC ensures that claims mapping to user permissions are present in the JWT.
Only the diagram editor (front-end app) can invoke these APIs (CORS settings).
Brotli response compression is enabled for reducing payload sizes.

The web APIs are stateless and operate under the assumption that they can be restarted/redeployed any time. No sticky sessions & affinities, no in-memory state, all state is persisted to DBs using EF Core (an ORM).

Separate DTO/REST and DBContext/SQL models are maintained for all entities, with AutoMapper rules being used for conversions between the two.

Identity, AuthN & AuthZ

Auth0 is used as the (OIDC compliant) identity platform for CloudSkew. Users can login via Github or LinkedIn; the handshake with these identity providers is managed by Auth0 itself. Using implicit flow, ID and access tokens (JWTs) are granted to the diagram editor app. The Auth0.JS SDK makes all this really trivial to implement. All calls to the back-end web APIs use the access token as the bearer.

Auth0 creates & maintains the user profiles for all signed-up users. Authorization/RBAC is managed by assigning Auth0 roles to these user profiles. Each role contains a collections of permissions that can be assigned to the users (they show up as custom claims in the JWTs).

Auth0 rules are used to inject custom claims in the JWT and whitelist/blacklist users.

Databases

SQL Azure is used for persisting user data; primarily three entities: Diagram, DiagramTemplate and UserProfile. User credentials are not stored in CloudSkew's database (that part is handled by Auth0). User contact details like emails are MD5 hashed.

Because of CloudSkew's auto-save feature, updates to the Diagram table happens very frequently. Some steps have been taken to optimize this:

Debouncing the auto-save requests from the diagram editor UI to the Web API.
Use of a queue for load-leveling the update requests (see this section for details).

For the preview version, the SQL Azure SKU being used in production is Standard/S0 with 20 DTUs (single database). Currently, the DB is only available in one region. Auto-failover groups & active geo-replication (read-replicas) are not being used at present.

SQL Azure's built-in geo-redundant DB backups offer weekly full DB backups, differential DB backups every 12 hours and transaction log backups every 5 - 10 minutes. SQL Azure internally stores the backups in RA-GRS storage for 7 days. RTO is 12 hrs and RPO is 1 hr. Perhaps less than ideal, but we'll look to improve matters here once CloudSkew's usage grows.

Azure CosmosDB's usage is purely experimental at this point, mainly for the analysis of anonymized, read-only user data in graph format over gremlin APIs (more details on this will be shared in a future article). Technically speaking, this database can be removed without any impact to user-facing features.

Hosting & Storage

Two Azure Storage Accounts are provisioned for hosting the front-end apps: landing page & diagram editor. The apps are served via the $web blob containers for static sites.

Two more storage accounts are provisioned for serving the static content (mostly icon SVGs) and user-uploaded images (PNG, JPG files) as blobs.

Two Azure App Services on Linux are also provisioned for hosting the containerized back-end web APIs. Both app services share the same App Service Plan.

For CloudSkew's preview version we're using the B1 (100 ACU, 1.75 GB Mem) plan which unfortunately does not include automatic horizontal scale-outs (i.e. scale-outs have to be done manually).
Managed Identity is enabled for both app services, required for accessing the Key Vault.
The Always On settings have been enabled.
An Azure Container Registry is also provisioned. The deployment pipeline packages the API apps as docker images and pushes to the container registry. The app services pull from it (using webhook notifications).

Caching & Compression

An Azure CDN profile is provisioned with four endpoints, the first two using the hosted front-end apps (landing page & diagram editor) as origins and the other two pointing to the storage accounts (for icon SVGs & user-uploaded images).

In addition to caching at global POPs, content compression at POPs is also enabled.

Subdomains & DNS records

All CDN endpoints have <subdomain>.cloudskew.com custom domain hostnames enabled on them. This is facilitated by using Azure DNS to create CNAME records that map <subdomain>.cloudskew.com to their CDN endpoint counterparts.

HTTPS & TLS Certificates

Custom domain HTTPS is enabled and the TLS certificates are managed by Azure CDN itself.

HTTP-to-HTTPS redirection is also enforced via CDN rules.

Externalized Configuration & Self-Bootstrapping

Azure Key Vault is used as a secure, external, central key-value store. This helps decouple back-end web API apps from their configuration settings (passwords, connection strings, endpoint urls, IP addresses, hostnames etc).

The web API apps have managed identities which are RBAC'ed for Key Vault access.

The web API apps self-bootstrap by reading their configuration settings from the Key Vault at startup. The handshake with the Key Vault is facilitated using the Key Vault Configuration Provider.

Queue-Based Load Leveling

Even after debouncing calls to the API, the volume of PUT (UPDATE) requests generated by auto-save feature causes the SQL Azure DB's DTU consumption to spike, resulting in service degradation. To smooth out this burst of requests, an Azure Service bus is used as an intermediate buffer. Instead of writing directly to the DB, the web API instead queues up all PUT requests into the service bus; to be drained asynchronously later.

An Azure Function app is responsible for serially dequeueing the brokered messages off the bus using the service bus trigger. Once the function receives a peek-locked messages, it commits the PUT (UPDATE) to the SQL Azure DB. If the function fails to process any messages, the messages automatically gets pushed onto the service bus' dead-letter queue. An Azure monitor alert is triggered when this happens.

The Azure Function app shares the same app service plan as the back-end web APIs (i.e. uses the dedicated app service plan instead of the regular consumption plan)

Overall this queue-based load-leveling pattern has helped plateau the load on the Sql Azure DB.

APM

The Application Insights SDK is used by the diagram editor (front-end Angular SPA) to get some user insights.

E.g. We're interested in tracking the names of icons that the users couldn't find in the icon palette (via the icon search box). This helps us add these frequently searched icons into the palette later on.

App Insight's custom events help us log such information. KQL queries are used to mine the aggregated data.

The App Insight SDK is also used for logging traces. The log verbosity is configured via app config (externalized config using Azure Key Vault).

Infrastructure Monitoring

Azure Portal Dashboards are used to visualize metrics from the various azure resources deployed by CloudSkew.

Incident Management

Azure Monitor's metric-based alerts are being used to get incident notifications over email & slack. Some examples of conditions that trigger alerts:

[Sev 0] 5xx errors in the web APIs required for printing/exporting diagrams.
[Sev 1] 5xx errors in other CloudSkew web APIs
[Sev 1] Any messages in the Service Bus' dead-letter queue.
[Sev 2] Response time of web APIs crossing specified thresholds.
[Sev 2] Spikes in DTU consumption in SQL Azure DBs.
[Sev 3] Spikes in E2E latency for blob storage requests.

Metrics are evaluated/sampled at 15 mins frequency with 1 hr aggregation windows.

Currently, 100% of the incoming metrics are sampled. Over time, as usage grows, we'll start filtering out outliers at P99.

Resource Provisioning

Terraform scripts are used to provision all of the Azure resources & services shown in the architecture diagram (storage accounts, app services, CDN, DNS zone, container registry, functions, sql server, service bus etc). Use of terraform allows us to easily achieve parity in dev, test & prod environments. Although these three environments are mostly identical clones of each other, there are some minor differences:

Across the dev, test and prod environments, the app configuration data stored in the Key Vaults will have the same key names but different values. This helps apps to bootstrap accordingly.
The dev environments are ephemeral, created on demand and are disposed when not in use.
For cost reasons, smaller resource SKUs are used in dev & test environments (e.g. Basic/B 5 DTUs SQL Azure in test environment as compared to Standard/S0 20 DTU in production).

The Auth0 tenant has been set up manually since there are no terraform providers for it. However it looks like it might be possible to automate the provisioning using Auth0's Deploy CLI.

CloudSkew's provisioning script are being migrated from terraform to pulumi. This article will be updated as soon as the migration is complete.

Continuous Integration

The source code is split across multiple private Azure Repos. The "one repository per app" rule of thumb is enforced here. An app is deployed to dev, test & prod environments from the same repo.

Feature development & bug fixes happen in private/feature branches which are ultimately merged into master branches via pull requests.

Azure Pipelines are used for continuous integration: checkins are built, unit tested, packaged and deployed to the test environment. CI pipelines are automatically triggered both on pull request creation as well as checkins to master branches.

The pipelines are authored in YAML and executed on Microsoft-hosted Ubuntu agents.

Azure pipelines' built-in tasks are heavily leveraged for deploying changes to azure app services, functions, storage accounts, container registry etc. Access to azure resource is authorized via service connections.

Deployment & Release

The deployment & release process is very simple at moment (blue-green deployments, canary deployments and feature flags are not being used). Checkins that pass the CI process become eligible for release to production environment.

Azure Pipelines deployment jobs are used to target the releases to production environment.

Manual approvals are used to authorize the releases.

Future Architectural Changes

As more features will be added and as usage grows, some architectural enhancements will have to be considered:

HA with multi-regional deployments and using Traffic Manager for routing traffic.
Move to a higher App Service SKU to avail of slot swapping, horizontal auto-scaling etc.
Use of caching in the back-end (Azure Cache for Redis, ASP.NET's IMemoryCache)
Changes to the deployment & release model with blue-green deployments and adoption of feature flags etc.
PowerBI/Grafana dashboard for tracking business KPIs.

Again, any of these enhancements will ultimately be need-driven.

Closing Notes

CloudSkew is in very early stages of development and there are some simple thumb rules it abides by:

Preferring PaaS/serverless over IaaS: Pay as you go, no server management overhead (aside: this is also why K8s clusters are not in the picture yet).
Preferring microservices over monoliths: Individual lego blocks can be independently deployed & scaled up/out.
Always keeping the infrastructure stable: Everything infra-related is automated: from provisioning to scaling to monitoring. An "it just works" infra helps maintain the core focus on user-facing features.
Releasing Frequently: The goal is to rapidly go from idea -> development -> deployment -> release. Having ultra-simple CI, deployment & release processes go a long way in helping achieve that.
No premature optimization: All changes for making things more "efficient" is done just-in-time and has to be need-driven (e.g: Redis cache is currently not required at the back-end since API response times are within acceptable thresholds).

When CloudSkew reaches critical mass in the future, this playbook will of course have to be modified.

Please feel free to email us in case you have any questions, comments or suggestions regarding this article. Happy Diagramming!

High Availability in Azure: App Service, Function Apps

Mithun Shanbhag — Fri, 10 Jul 2020 21:00:43 +0000

Azure App Service Apps (web apps)

An Azure App Service Plan is pinned to a specific Azure Region. Any App Service Apps created in the App Service Plan will be provisioned in that same region. If your app needs additional redundancies in other regions or geographies, you'll have to:

Provision them yourself (you'll need to create new App Service Plans in those regions, if they don't already exist).
Use Azure Traffic Manager to route traffic to all available redundancies (you can only specify one App Service endpoint per region in a Traffic Manager profile). More details here.

The SLA for Azure App Services guarantee a 99.95% uptime for each regional deployment.

Azure Function Apps

Azure Function Apps too have regional deployments. If you're using the consumption plan, then you explicitly specify the region. If on the App Service Plan, then the region is the same as that of the App Service Plan.

Similar to App Services above, any additional redundancies will have to be explicitly created and traffic to these will have to be routed via Azure Traffic Manager.

The SLA for Azure Functions guarantee a 99.95% uptime for each regional deployment (for both app service plan and consumption plan).

Miscellaneous

Horizontally scaled instances

As I've previously mentioned, horizontal auto-scaling exists to address performance concerns rather than high-availability concerns.

App Service Apps: When horizontal auto-scaling is enabled on a parent App Service Plan, additional instances are created, and each instance hosts all App Service Apps contained in the parent App Service Plan. All instances are created in the same WebSpace. The App Service's integrated load-balancer (non-accessible) manages the traffic. Note that all scaled out instances of an app will still have the same endpoint URL.

Function Apps: Based on a combination of factors (trigger types, rate of incoming requests, language/runtime and perhaps the host health-monitor stats), the scale controller will create additional instances of an Azure Function App (max limit of 200 instances). Note that the scaling unit is the Function App (host) itself and not individual functions.

Bonus reading:

Read more about the scaling limits imposed on App Service Apps based on pricing tiers.
Read more about ARR affinity and ARRAffinity cookies for scaled out instances.
You can now enable per-app horizontal scaling. More details in this blog post.
Read more about the scaling behavior of Function Apps.

The "Always On" setting

If you have an App Service App or a Function App associated with an App Service Plan in the production or isolated tier, then you should consider enabling the "always on" setting. This ensures that your app is always running and never unloaded (default behavior is to deactivate/unload idle apps to conserve resources).

Notes:

This setting is not available for App Service Apps in dev/test tier.
Idle Function Apps in the consumption plan will be subject to cold start latency.

Cloning and Moving App Service Apps

Using Azure Powershell, it is possible to create clones of existing App Service App within the same region or in a new region. Please note that there are some caveats/restrictions though.

You can also move an App Service App to another App Service plan as long as both the source plan and the destination plan are within the same WebSpace.

FWIW, I've never tried this out myself.

And yes, like any other Azure Resource, App Service Plans and App Service Apps can be moved between resource groups.

WebSpaces

WebSpaces are units of deployment for Azure App Service Plans. An App Service Plan's WebSpace is identified by the combination of its resource group and the region in its deployed. Any additional App Service Plan deployments to the same resource group + region combination gets assigned to the same WebSpace. See more details here.

To see the WebSpace associated with an App Service App or App Service Plan, navigate to that resource in the Azure Resource Explorer (via the Azure Portal or via the website) and see the WebSpace and SelfLink properties.

High Availability in Azure: Traffic Management

Mithun Shanbhag — Fri, 10 Jul 2020 20:59:53 +0000

Azure Traffic Manager

Azure Traffic Manager routes a client's DNS query to an appropriate service endpoint, selected based on a combination of factors:

traffic routing methods (user selected)
health of the endpoints (user configured probing/monitoring rules)
latency tables (internally maintained map of ip address ranges to regions)

Some scenarios that can be addressed with Azure Traffic Manager are:

always routing to primary endpoint (with failover to secondary when primary endpoint's health degrades).
always routing to endpoint with lowest latency.
always routing to specific regional endpoint for data sovereignty compliance.
enabling blue/green deployments with weighted routing.

The Azure Traffic Manager SLA is 99.99%.

Things it is not (or does not do)

not a gateway or a proxy. Traffic between the client and the service endpoint does not pass through the traffic manager. Once the traffic manager points a client to a service endpoint, the client communicates with the endpoint directly.
not a layer-7 (application level) solution.
not a DNS server.
not a WAF.
does not offer TLS termination / SSL offload.
does not offer sticky sessions.

The holy trinity

Azure Traffic Manager is used in conjunction with Azure Application Gateways and Azure Load Balancers. Here is a nice article that explains how the trio complement each other.

Traffic routing methods

The official docs capture all the traffic routing methods in great detail. However let me provide a quick recap below:

performance routing

(official docs | tutorial)

Use this when you need to route traffic to a service endpoint with the lowest network latency (as measured from the client IP address).

The Azure Traffic Manager maintains an internal "latency table", that maps the latencies of IP address ranges to various Azure Regions. Upon an incoming recursive DNS request, it looks up the client's IP address and detects the IP address range that it falls under. For that address range, it picks up an available service endpoint from an Azure region with the lowest possible latency. If multiple service endpoints are detected within the same Azure region, then the Azure Traffic Manager distributes traffic evenly across them.

priority routing

(official docs | tutorial)

Use this when you want to route all traffic to a primary endpoint (with a secondary on standby).

All service endpoints are assigned a priority number (value between 1 and 1000 with 1 being highest priority and 1000 being lowest). The primary gets assigned the highest priority (i.e. lowest number) and as a result all traffic gets routed to it. If the primary's health degrades, all traffic gets routed to the secondary, which has the next highest priority. Manual fail-overs can be initiated by bumping the secondary to higher priority.

weighted routing

(official docs | tutorial)

Use this when you need to do staggered roll-outs, blue/green deployments.

All service endpoints are assigned a weight (value between 1 and 1000, 1 being lowest weight and 1000 being highest). The traffic manager will attempt to route traffic to available service endpoints based on weighted priorities.

Note: Weighted routing is different from priority routing mentioned above. In priority routing, only the highest priority endpoint is selected and others are ignored (until the highest priority endpoint's health degrades). With weighted routing, the traffic manager does route traffic to all endpoints, but uses the assigned weights to choose a specific endpoint on each incoming request.

geographic routing

(official docs | tutorial | faqs)

Use this when you need to geo-fence your users to specific regions/geographies (for data sovereignty reasons etc).

Per configuration, client requests will get serviced by endpoints from the specified region (this may or may not be the endpoint with lowest latency). Regional endpoints can be assigned at the following granularities:

world (highest granularity)
regional grouping (roughly the same as Azure Geographies)
country
state (lowest granularity, only available for USA, Canada and Australia as of the time of writing this blog post).

Lookup always starts from the lowest granularity goes to highest granularity and first match found is returned.

subnet routing

(official docs | tutorial | faqs)

Use this when you need to map specific client IP address ranges to specific service endpoints.

multivalue routing

(official docs | faqs)

Just mentioning it for completeness sake; I haven't actually used it ever.

High Availability in Azure: Storage Redundancies

Mithun Shanbhag — Fri, 10 Jul 2020 20:58:05 +0000

Azure Storage Account

In Azure, the following entities are backed by Azure storage accounts: blobs, file shares, queues, NoSQL table storages, Data Lake Storage (gen2) and unmanaged disks. In this blog post, we'll go over the various redundancy options available for these storage accounts. We'll compare & contrast them based on the following parameters:

Replication latency: How soon before all replicas are in full-sync?
Disaster scenarios: Are you looking at partial data loss or fully unrecoverable data? How easy (or difficult) is it to get back on track once things have hit rock bottom?
SLAs: How many 9s?

Hopefully this blog post will serve as a cheat-sheet and help you choose the right Azure storage redundancy options for your use cases.

LRS (locally-redundant storage)

With LRS, your data is replicated thrice across multiple fault domains & update domains within a single storage scale unit (all within a single datacenter). Note that all three replicas are addressed by a single endpoint (i.e. you can't target individual replicas for read/write operations).

Replication latency: No replication latency, data is synchronously written to all three replicas on every write request.

Disaster scenarios:

disaster type	service interruption?	data loss?	recovery possible?
hardware failure in physical rack/node	NO	NO¹	N/A
datacenter disaster	YES	YES	NO²
availability zone disaster	"	"	"
regional disaster	"	"	"
geographic disaster	"	"	"
worldwide disaster	"	"	"

Since the replicas are spread across multiple fault domains.
Assuming all three replicas within the storage scale unit are affected, your data is permanently lost & unrecoverable.

SLAs:

object storage >= 99.999999999% (11 nines)
read requests (hot tier) >= 99.9% (3 nines)
read requests (cool tier) >= 99% (2 nines)
write requests (hot tier) >= 99.9% (3 nines)
write requests (cool tier) >= 99% (2 nines)

ZRS (zone-redundant storage)

With ZRS, your data is replicated across three availability zones within the same region (please note that currently not all regions support availability zones). As in the earlier case with LRS, all three replicas are addressed by a single endpoint.

Replication latency: Very low latency, data is synchronously written to all three replicas on every write request.

Disaster scenarios:

disaster type	service interruption?	data loss?	recovery possible?
hardware failure in physical rack/node	NO	NO¹	N/A
datacenter disaster	"	"	"
availability zone disaster	YES²	NO	N/A
regional disaster	YES	YES	NO³
geographic disaster	"	"	"
worldwide disaster	"	"	"

Only one replica will be affected, since the replicas are spread across different availability zones.
Temporary service interruption until Azure finishes DNS updates (not entirely sure how long these updates take, the official docs do not mention this). To mitigate this, best to use transient fault-handling patterns (retries with back-offs and circuit breakers) for all reads/writes on the storage account. More details can be found here and here.
Assuming all three replicas across the availability zones are affected, your data is permanently lost & unrecoverable.

SLAs:

object storage >= 99.9999999999% (12 nines)
read requests (hot tier) >= 99.9% (3 nines)
read requests (cool tier) >= 99% (2 nines)
write requests (hot tier) >= 99.9% (3 nines)
write requests (cool tier) >= 99% (2 nines)

GRS (geo-redundant storage)

With GRS, your data is replicated across two paired-regions (within the same Azure geography) in a primary region + secondary region setup. This ensures that one regional replica will be available in the event of a regional disaster.

The primary region & the secondary regions are addressed by separate endpoints. The secondary endpoint is generally inaccessible. However in case of a fail-over, the secondary is promoted to primary and read + write access is enabled for this endpoint. Fail-overs are automatically initiated by Azure in the event of a regional disaster. Azure is also introducing user-initiated fail-overs, which is currently in preview mode as of the time of writing this post.

Note: Both GRS (geo-redundant storage) and RA-GRS (read-access geo-redundant storage) are misnomers. They don't create redundant copies across Azure geographies, only across paired-regions within the same Azure geography.

Replication latency: Your data is first replicated synchronously within the primary region via LRS. The data is then replicated asynchronously to the secondary region (eventually consistent). Within the secondary region, it is replicated synchronously using LRS. The official SLA for Azure storage does not make any guarantees about the time needed for geo-replication.

Disaster scenarios:

disaster type	service interruption?	data loss?	recovery possible?
hardware failure in physical rack/node	NO	NO	N/A
datacenter disaster	YES¹	POSSIBLE²	YES³
availability zone disaster	"	"	"
regional disaster	"	"	"
geographic disaster	YES	YES	NO
worldwide disaster	"	"	"

Within the primary & secondary regions itself, the data is replicated via LRS. In the event of a datacenter disaster in the primary region, it is possible that all replicas within the storage scale unit are affected and the primary endpoint will now be both inaccessible & unrecoverable. Although the secondary region has replica data, its endpoint will be inaccessible until a fail-over is initiated (the data will be inaccessible until the fail-over is complete).
With GRS, the replication from primary to secondary regions is asynchronous. In the event of the primary being destroyed before it has completely replicated the data to secondary, the secondary will have a stale copy and un-replicated writes will be permanently lost.
Only when a fail-over has completed, the secondary endpoint becomes the new primary, accessible for read + write operations, with LRS replication.

SLAs:

object storage >= 99.99999999999999% (16 nines)
read requests (hot tier) >= 99.9% (3 nines)
read requests (cool tier) >= 99% (2 nines)
write requests (hot tier) >= 99.9% (3 nines)
write requests (cool tier) >= 99% (2 nines)

RA-GRS (read-access geo-redundant storage)

Same as GRS, but you always have read-only access to the secondary replica.

Replication latency: Same as GRS.

Disaster scenarios:

disaster type	service interruption?	data loss?	recovery possible?
hardware failure in physical rack/node	NO	NO	N/A
datacenter disaster	YES¹	POSSIBLE²	YES³
availability zone disaster	"	"	"
regional disaster	"	"	"
geographic disaster	YES	YES	NO
worldwide disaster	"	"	"

In the event of primary replica being destroyed, the secondary region will still have read-only access, even without a failover being initiated (unlike GRS where the secondary is inaccessible until a fail-over has been completed).
In the event of the primary being destroyed before it has completely replicated the data to secondary, the secondary will have a stale copy and un-replicated writes will be permanently lost (same as GRS).
Prior to fail-over, the secondary will have read-access. After fail-over, the secondary becomes the new primary, with read + write access and LRS replication.

SLAs:

object storage >= 99.99999999999999% (16 nines)
read requests (hot tier) >= 99.99% (4 nines)
read requests (cool tier) >= 99.9% (3 nines)
write requests (hot tier) >= 99.9% (3 nines)
write requests (cool tier) >= 99% (2 nines)

High Availability in Azure: Availability Zones

Mithun Shanbhag — Fri, 10 Jul 2020 20:53:44 +0000

Azure Availability Zones

In the opening post of this blog series we talked about availability zones and how resources can be classified as zone-redundant, zonal (zone-specific) or non-zonal (regional). If you haven't seen that post, please take a minute to do so.

Availability zones exist to shield your resources against a datacenter-level disaster.

As of the time of writing this blog post, only a few Azure regions support availability zones.

Availability zones are free (you're only charged for the VMs and resources placed in the availability zones).

Supported Azure Resources

Only a few Azure resource types support availability zones (we're highlighting a couple of important ones below. The complete list is available here).

Virtual Machines: During creation, a VM can be configured as zonal. Its managed disk and public IP address (standard sku only) are then automatically placed in that same zone.
Managed Disks: During creation, a managed disk can be configured as zonal or non-zonal. Snapshots of any managed disks (zonal or otherwise) can be be persisted to zone-redundant storage.
Public IPs: During creation, a Public IP address (standard sku only) can be configured as zone-redundant (default) or zonal. Public IPs with basic sku are non-zonal.
Storage Accounts: With zone-redundant storage, your data is replicated across three availability zones within the same region. We already covered ZRS storage in part 5 of this blog series.
Load Balancers (standard sku only): During creation, load balancers (standard sku only) can be configured as zone-redundant or zonal. Load balancers with basic sku are non-zonal.

Availability Sets vs Availability Zones

Availability sets provide redundancies within a datacenter, while availability zones provide redundancies within a region. The former shields you against hardware failures in a physical rack, while the latter shields you against a datacenter-level disaster.
SLA for VMs in availability zones is predictably higher (99.99% uptime guarantee) than that of VMs in availability sets (99.95% uptime guarantee). Full SLA details here.
With an availability set, all VMs in it must belong to the same VNET and same resource group. However an availability zone imposes no such restrictions (zonal VMs can belong to any VNET and any resource group within the region).
When placing a VM in an availability set, you cannot specify its placement (fault domain, update domain etc). However when placing a VM in an availability zone, you have to specify its zone.

Caveats, restrictions, gotchas & tidbits

Zonal resources, once created, cannot be moved to other availability zones within the region. It is however possible to use Azure Site Recovery to move non-zonal VMs to availability zones in another region.
All VMs in an availability zone need not be identical
To ensure redundancies in all tiers of your n-tier application, each tier should ideally be placed in a separate availability zone.
Some additional caveats with zonal VMs:
- A zonal VM can only attach to a public IP address that is zone-redundant or zonal (i.e. standard sku only. Basic skus don't have zonal support).
- A zonal VM can only attach a managed disk from the same availability zone. A non-zonal VM can however attach any managed disk from the same region, irrespective of whether it's zonal or not.
- It's not possible for a zonal VM to use unmanaged disks.
Pro tip: Pair zonal VMs with a zone-redundant load balancer (standard sku) for traffic equi-distribution amongst the VMs in that availability zone. All the zonal VMs must be connected to the same VNET.

High Availability in Azure: Availability Sets

Mithun Shanbhag — Fri, 10 Jul 2020 20:49:48 +0000

Azure Availability Sets

We've already discussed the concepts of fault domains, update domains and availability sets in the first post of this series. Visually, you can represent an availability set with a table as follow:

-------	FD0	FD1	FD2
UD0	VM1		VM6
UD1		VM2
UD2			VM3
UD3	VM4
UD4		VM5

No two VMs in an availability set share the same fault & update domain. This ensures that there will be at least one available VM in the event of a planned maintenance (where an entire update domain is affected) or hardware failure (where an entire fault domain is affected). The SLA for Azure VMs guarantees that if an availability set has two or more VMs, then at least one VM will be available 99.95% of the time.

Availability sets are free (you're only charged for the VMs and resources placed in the availability sets).

Caveats, restrictions, gotchas & tidbits

A VM must be placed in an availability set at the time of creation. Once created, it can't be moved into an availability set. Also it's not possible to change an existing VM's availability set.
An availability set forces all its associated VMs to:
- Be in the same resource group and region (technically they all reside in the same data center actually).
- Have their network interfaces associated with the same VNET.
For HA, a VM can be placed in an availability set or in an availability zone. But NOT both. The former offers HA within a datacenter, the latter offers HA within a region.
All VMs in an availability set need not be identical, but there are hardware size constraints. Use the Get-AzVmSize powershell cmdlet to list all the VM sizes available for a particular availability set (more details).
For an availability set with (say) 3 FDs and 5 UDs, the placement of the VMs will generally be as follows:
- 1st VM: FD0, UD0
- 2nd VM: FD1, UD1
- 3rd VM: FD2, UD2
- 4th VM: FD0, UD3
- 5th VM: FD1, UD4
- 6th VM: FD2, UD0
- and so on...
Generally an availability set is paired with a load balancer for traffic equi-distribution amongst the VMs in that availability set.
Pro tip: To ensure redundancies in all tiers of your n-tier application, each tier should be placed in a separate availability set.
Pro tip: Use managed disks & managed availability sets for higher availability. Read more below.

Managed disks and managed availability sets

The issue with unmanaged disks in an availability set

The storage accounts associated with unmanaged disks in an availability set are all placed in a single storage scale unit (stamp), which then becomes a single point of failure.

Benefits of managed disks

With Azure managed disks, you no longer have to explicitly provision storage accounts to back your disks. Managed disks provide a convenient abstraction over storage accounts, blob containers and page blobs. Internally, managed disks use LRS storage (3 redundant copies within a storage scale unit inside a single datacenter).

Managed disks go in managed availability sets

If you plan to use managed disks, please ensure you select the "aligned" option while creating the availability set. This effectively creates a managed availability set.

To migrate VMs in an existing availability set to managed disks, the availability set itself needs to be converted to a managed availability set. This can be done via the Azure portal or via the Update-AzAvailabilitySet powershell cmdlet. Once converted, only VMs with managed disks can be added to the availability set (existing VMs with unmanaged disks in the availability set will continue to operate as before).

Please note that the max number of managed FDs will depend on the availability set's region.

Managed availability sets get it right

The managed disks in an availability set are all placed in a multiple storage scale units (stamps), aligned with VM FDs, avoiding a single point of failure. In the event of a storage scale unit failing, only VMs with managed disks in that storage scale unit will fail (other VMs will be unaffected). This increases the overall availability of the VMs in that availability set.

High Availability in Azure: The basics

Mithun Shanbhag — Fri, 10 Jul 2020 13:56:05 +0000

High availability what now?

In order to understand high availability in Azure, we first need to dig into some underlying Azure concepts. To explain these, I've cobbled together a diagram (it's not 100% accurate, but it does make it simpler to explain things).

Geography

The "highest-level" entity that exists to meet data residency, compliance and sovereignty requirements. Currently there are 4 Azure geographies - Americas, Europe, Asia Pacific and Middle East + Africa. An Azure geography contains two or more Azure regions within it.

Region

As of the time of writing this post, there are 53 Azure regions (with 8 more announced) spread across 4 Azure geographies. Each Azure region contains a inter-connected set of datacenters (all datacenters within an azure region are connected via a dedicated regional low-latency network).

Some Azure regions support availability zones (each such region contains 3 or more availability zones).

Paired Regions

It is recommended that your redundancies span across a set of paired regions in order to meet data residency & compliance requirements even during planned platform maintenance & outages. Azure ensures that during planned platform maintenance, only one region in each pair is updated at a time. Also during multi-regional outages, azure ensures that at least one region in each pair will be prioritized for recovery.

Availability Zone

An availability zone comprises of one or more datacenters. Each availability zone has its own autonomous, independent infrastructure for power, cooling, and networking.

The Azure resources that support availability zones are listed here. Please note that these Azure resources can be categorized as follows:

zone-specific (zonal) resources: Azure ensures that the resources are contained within a specific availability zone. VMs, managed disks and IP addresses fall in this category.
zone-redundant resources: Azure automatically replicates the resources across multiple availability zones. Zone-redundant storage accounts and SQL databases fall in this category.
non-zonal (regional) resources: Azure resource that are not supported by availability zones.

I'll talk about availability zones in detail in a future blog post in this series.

Datacenter

You can watch one of Mark Russinovich's excellent presentations (link1, link2, link3 and link4) to peek into what an Azure datacenter comprises of. Also you can take a virtual tour of an Azure datacenter.

Fault domain (physical server rack)

A single physical rack is considered as a fault domain, since all servers in that rack are connected by common points of failure (common power source and common network switch).

Update domain

An update domain is a logical grouping of machines that Azure upgrades/patches simultaneously during planned platform maintenance.

Availability Set

It's always a bad idea to run a production workload on a single VM. Best to provision multiple VMs in an availability set, which is a logical grouping of VMs within a datacenter across multiple fault & update domains.

When you create multiple VMs within an availability set, Azure distributes them across these fault & update domains. This ensures that at least one VM is remains running in event of either a planned platform maintenance (only one update domain in an availability set is patched at a time) or in the event of a server rack facing hardware failure, network outage or power supply issues.

My next blog post will explore availability sets for VMs in detail.

Preemptive FAQs

What about VM scale sets?

VM scale sets exist for horizontal scaling under load. In my opinion, they have almost nothing to do with redundancies for high availability. So I'll be excluding them from this particular blog series. Perhaps I'll address them in a future series on horizontal & vertical scaling for Azure resources.

Aside: Horizontal scaling & high availability address slightly different issues (performance & reliability respectively). The former adds additional instances when under load to ensure performant service. The latter adds redundant instances (irrespective of load) to prevent service disruption during outages.

Will I address high availability on Azure's government cloud?

No. I know very little about Azure's government cloud. You're welcome to read the documentation yourself.

CloudSkew crosses 14K user signups

Mithun Shanbhag — Fri, 10 Jul 2020 12:22:29 +0000

cloudskew.com just crossed 14K user signups. It's been a good week (we made it to the HackerNews front page). Onwards & upwards!