<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Denis Cooper</title>
    <description>The latest articles on DEV Community by Denis Cooper (@denis_cooper_2e25f94af904).</description>
    <link>https://dev.to/denis_cooper_2e25f94af904</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1960529%2Fb721eded-b241-4a81-a2e2-e8b8b54ed498.png</url>
      <title>DEV Community: Denis Cooper</title>
      <link>https://dev.to/denis_cooper_2e25f94af904</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/denis_cooper_2e25f94af904"/>
    <language>en</language>
    <item>
      <title>Creating a shared configuration module in Terraform</title>
      <dc:creator>Denis Cooper</dc:creator>
      <pubDate>Wed, 19 Nov 2025 20:47:40 +0000</pubDate>
      <link>https://dev.to/denis_cooper_2e25f94af904/creating-a-shared-configuration-module-in-terraform-d5f</link>
      <guid>https://dev.to/denis_cooper_2e25f94af904/creating-a-shared-configuration-module-in-terraform-d5f</guid>
      <description>&lt;h2&gt;
  
  
  💭 The Problem
&lt;/h2&gt;

&lt;p&gt;I found myself committing a cardinal sin in Terraform — repeating configuration across multiple projects.&lt;/p&gt;

&lt;p&gt;When deploying Azure resources, we often need region- or environment-specific values. For example, virtual networks may need different routes or DNS servers depending on the location.&lt;/p&gt;

&lt;p&gt;I was managing these using lookup tables in local variables, like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight terraform"&gt;&lt;code&gt;&lt;span class="nx"&gt;locals&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;firewall_ip&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="s2"&gt;"westeurope"&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"10.0.0.1"&lt;/span&gt;
    &lt;span class="s2"&gt;"northeurope"&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"10.0.0.2"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This worked fine… until it didn’t.&lt;/p&gt;

&lt;p&gt;Every time a region or IP changed, I had to update the same table in multiple projects — a maintenance nightmare.&lt;/p&gt;

&lt;h2&gt;
  
  
  💡 The Realisation
&lt;/h2&gt;

&lt;p&gt;What I really needed was a global lookup table — a shared, central source of truth I could query across all Terraform projects.&lt;/p&gt;

&lt;p&gt;The solution turned out to be beautifully simple:&lt;/p&gt;

&lt;p&gt;✅ A centralised shared configuration Terraform module.&lt;/p&gt;

&lt;h2&gt;
  
  
  🧱 The Shared Configuration Module
&lt;/h2&gt;

&lt;p&gt;All my Terraform modules live in Azure DevOps Git Repos, so I can reference them directly rather than duplicating code.&lt;/p&gt;

&lt;p&gt;My shared configuration module only needs one file — outputs.tf — that defines all the shared values:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight terraform"&gt;&lt;code&gt;&lt;span class="k"&gt;output&lt;/span&gt; &lt;span class="s2"&gt;"firewall_ips"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;description&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Firewall IP Addresses By Region"&lt;/span&gt;
  &lt;span class="nx"&gt;value&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;prod&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="s2"&gt;"westeurope"&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"10.27.0.4"&lt;/span&gt;
        &lt;span class="s2"&gt;"centralus"&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"10.28.0.4"&lt;/span&gt;
        &lt;span class="s2"&gt;"australiaeast"&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"10.29.0.4"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="nx"&gt;dev&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="s2"&gt;"westeurope"&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"10.27.1.4"&lt;/span&gt;
        &lt;span class="s2"&gt;"centralus"&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"10.28.1.4"&lt;/span&gt;
        &lt;span class="s2"&gt;"australiaeast"&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"10.29.1.4"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="err"&gt;`&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;output&lt;/span&gt; &lt;span class="s2"&gt;"dns_servers"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;description&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"DNS Servers by Regions"&lt;/span&gt;
  &lt;span class="nx"&gt;value&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="s2"&gt;"westeurope"&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"10.27.1.1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"10.27.1.2"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="s2"&gt;"centralus"&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"10.28.1.1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"10.28.1.2"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="s2"&gt;"australiaeast"&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"10.29.1.1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"10.29.1.2"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This module simply outputs all the shared configuration values I want to reference elsewhere.&lt;/p&gt;

&lt;h3&gt;
  
  
  🪄 Calling the module
&lt;/h3&gt;

&lt;p&gt;In any Terraform project, I reference it like so:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight terraform"&gt;&lt;code&gt;&lt;span class="k"&gt;module&lt;/span&gt; &lt;span class="s2"&gt;"shared_configuration"&lt;/span&gt; 
  &lt;span class="nx"&gt;source&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"gitssh://git@ssh.dev.azure.com/v3/example/Terraform-Modules/tf_module_shared_configuration"&lt;/span&gt;
&lt;span class="err"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then I use a locals.tf file to do my lookups:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight terraform"&gt;&lt;code&gt;
&lt;span class="nx"&gt;locals&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;# Get region-specific values from the shared module &lt;/span&gt;
  &lt;span class="nx"&gt;firewall-ip&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;shared_configuration&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;firewall_ips&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;environment&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;location&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="nx"&gt;dns-servers&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;shared_configuration&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;dns_servers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;location&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here’s what’s happening:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;• firewall_ip dynamically selects the right address for the target environment and region.
• dns_servers fetches region-specific DNS values.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Whenever I reference local.firewall_ip or local.dns_servers in my Terraform code, I’m automatically pulling the correct config for that deployment.&lt;/p&gt;

&lt;h2&gt;
  
  
  🔁 The Benefit
&lt;/h2&gt;

&lt;p&gt;Now, when I need to add a region or update a value, I just modify the shared module once.&lt;br&gt;
All projects referencing it get the change instantly.&lt;/p&gt;

&lt;p&gt;No duplication.&lt;br&gt;
No missed updates.&lt;br&gt;
No drift between environments.&lt;/p&gt;

&lt;h2&gt;
  
  
  ⚙️ Summary
&lt;/h2&gt;

&lt;p&gt;By creating a shared configuration Terraform module, you can:&lt;br&gt;
✅ Keep environment and region settings consistent&lt;br&gt;
✅ Avoid copy/paste duplication&lt;br&gt;
✅ Simplify maintenance and updates&lt;br&gt;
✅ Scale your Terraform codebase cleanly&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Sometimes the best solutions are the simplest ones — we just need to stop and think about the problem first.&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>terraform</category>
      <category>azure</category>
    </item>
    <item>
      <title>Is the Cloud Really Fault Proof?</title>
      <dc:creator>Denis Cooper</dc:creator>
      <pubDate>Fri, 07 Nov 2025 20:44:52 +0000</pubDate>
      <link>https://dev.to/denis_cooper_2e25f94af904/is-the-cloud-really-fault-proof-2hmk</link>
      <guid>https://dev.to/denis_cooper_2e25f94af904/is-the-cloud-really-fault-proof-2hmk</guid>
      <description>&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;What is the Cloud, Really?&lt;/li&gt;
&lt;li&gt;
Designing for Cloud Reliability&lt;ul&gt;
&lt;li&gt;
&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fs.w.org%2Fimages%2Fcore%2Femoji%2F16.0.1%2F72x72%2F1f9f1.png" alt="🧱" width="72" height="72"&gt; The Reliability Pillar – Building Resilient Systems&lt;/li&gt;
&lt;li&gt;Summary&lt;/li&gt;
&lt;li&gt;Final Thoughts&lt;/li&gt;
&lt;li&gt;Further reading&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;Recent events have seen large-scale disruptions across two major cloud platforms – first AWS, then Azure. The high profile outages have sparked fresh questions about cloud reliability and design resilience. I’ve seen plenty of posts and comments on forums and LinkedIn suggesting that “the cloud is doomed” and that we should all go back to running everything on-premises.&lt;/p&gt;

&lt;p&gt;It seems there are still some cloud skeptics out there – and rightly so. The recent outages are a good reminder that simply running your systems in “the cloud” doesn’t make them immune to failure.&lt;/p&gt;

&lt;p&gt;In this article, I want to explore what we can (and should) do to design resilient systems — and bust a few myths along the way. Using a cloud provider doesn’t automatically mean your systems are protected, backed up, or guaranteed 100% uptime. Reliability is something we design for, not something that comes out of the box.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is the Cloud, Really?
&lt;/h2&gt;

&lt;p&gt;Let’s start by being clear about what “the cloud” actually is. It’s a collection of interconnected data centres offering platforms and services you can deploy to or consume. Whether you’re using IaaS, PaaS, or a mix of both, you still have responsibility for how your workloads are configured, deployed, and maintained.&lt;/p&gt;

&lt;p&gt;Yes, cloud providers offer automation and protection for certain failure scenarios, but if you want to guarantee uptime, you must design for it.&lt;/p&gt;

&lt;p&gt;I’ve heard people argue that relying on a single cloud provider is “putting all your eggs in one basket.” When an AWS or Azure region experiences issues, multiple businesses are impacted at once — and the scale of those events makes headlines.&lt;/p&gt;

&lt;p&gt;But we should remember: the same could be said for on-premises solutions. Consider the impact if VMware pushed an update that broke virtual environments globally, or if a major telecom provider like BT, AT&amp;amp;T, or Verizon suffered a nationwide outage. Those events would take down thousands of businesses too.&lt;/p&gt;

&lt;p&gt;The real takeaway isn’t where we host workloads — cloud or on-premises — but how we design them to handle real-world disruptions. Resilience comes from engineering systems that anticipate and mitigate failure, regardless of the platform.&lt;/p&gt;

&lt;h2&gt;
  
  
  Designing for Cloud Reliability
&lt;/h2&gt;

&lt;p&gt;There are far too many technologies and architectures to cover in one post, so instead of listing specific tools, let’s focus on the core design principles from the Microsoft Well-Architected Framework and how they influence reliability.&lt;/p&gt;

&lt;p&gt;The Five Pillars of the Well-Architected Framework&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Reliability&lt;/strong&gt; – Ensure your applications recover from failures and continue to function.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security&lt;/strong&gt; – Protect applications and data from threats.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost Optimisation&lt;/strong&gt; – Deliver business value by managing costs effectively.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Operational Excellence&lt;/strong&gt; – Keep systems running smoothly through automation and continuous improvement.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Performance Efficiency&lt;/strong&gt; – Ensure your solution scales to meet demand efficiently.
&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;True cloud reliability comes from understanding your shared responsibility model and architecting for redundancy, not from assuming the platform will never fail.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Let’s focus on the first pillar – &lt;strong&gt;Reliability&lt;/strong&gt; – and explore how to apply it across both cloud and on-premises environments.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftsmupj2dfhzk0c3gzk5n.png" alt="🧱" width="72" height="72"&gt; The Reliability Pillar – &lt;em&gt;Building Resilient Systems&lt;/em&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Goal&lt;/strong&gt; : Ensure a system can recover from failures, continue operating correctly, and meet availability commitments.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mindset&lt;/strong&gt; : Reliability isn’t about never failing — it’s about failing gracefully and recovering predictably.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fochlv9u5hxxyibve49b7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fochlv9u5hxxyibve49b7.png" alt="🧩" width="72" height="72"&gt;&lt;/a&gt; &lt;strong&gt;1. Design for Failure and Graceful Degradation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Cloud&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Assume everything can and will fail – design stateless services, redundant regions, and fault-tolerant architectures.&lt;/li&gt;
&lt;li&gt;Use managed services with built-in SLAs (e.g., Azure SQL HA, AWS RDS Multi-AZ).&lt;/li&gt;
&lt;li&gt;Implement retry logic with exponential backoff and circuit breaker patterns.
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;On-Premises&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use redundant hardware (power, NICs, storage paths).&lt;/li&gt;
&lt;li&gt;Implement clustering and heartbeat monitoring (e.g., Windows Failover Cluster, VMware HA).&lt;/li&gt;
&lt;li&gt;Regularly test failover procedures.
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4b2y0m4hrgrahy569su0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4b2y0m4hrgrahy569su0.png" alt="✅" width="72" height="72"&gt;&lt;/a&gt; &lt;strong&gt;&lt;em&gt;Key principle: Failure is expected – resilience is engineered.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fenglqpve3l9y7mwzcm27.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fenglqpve3l9y7mwzcm27.png" alt="🌍" width="72" height="72"&gt;&lt;/a&gt; &lt;strong&gt;2. Redundancy and High Availability&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Cloud&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Deploy across Availability Zones and paired regions.&lt;/li&gt;
&lt;li&gt;Replicate data asynchronously (e.g., Azure GRS Storage, AWS S3 Cross-Region Replication).&lt;/li&gt;
&lt;li&gt;Use load balancers and global routing (e.g., Front Door, Traffic Manager) for failover. Remember, global services can and do fail too, so don’t assume global means resilient. Combine multiple options where necessary.
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;On-Premises&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Design redundant power, network, and cluster paths.&lt;/li&gt;
&lt;li&gt;Use stretched clusters or DR sites with replication (e.g., SQL Always On, Veeam).
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4b2y0m4hrgrahy569su0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4b2y0m4hrgrahy569su0.png" alt="✅" width="72" height="72"&gt;&lt;/a&gt; Key principle: No single point of failure.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiwdgarc6q3hxtmh2b1rn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiwdgarc6q3hxtmh2b1rn.png" alt="🧠" width="72" height="72"&gt;&lt;/a&gt; &lt;strong&gt;3. Monitoring, Telemetry, and Health Modelling&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;_ &lt;strong&gt;Cloud&lt;/strong&gt; _&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use telemetry and health probes (Azure Monitor, Application Insights, CloudWatch).&lt;/li&gt;
&lt;li&gt;Automate recovery actions and alerting.&lt;/li&gt;
&lt;li&gt;Detect degradation early with availability tests and service health alerts.
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;On-Premises&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Centralise monitoring (SCOM, Prometheus, Zabbix, or Nagios).&lt;/li&gt;
&lt;li&gt;Correlate logs in a SIEM (Sentinel, Splunk).&lt;/li&gt;
&lt;li&gt;Measure end-to-end service health, not just uptime.
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4b2y0m4hrgrahy569su0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4b2y0m4hrgrahy569su0.png" alt="✅" width="72" height="72"&gt;&lt;/a&gt; Key principle: You can’t fix what you can’t see.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyp1n25ik3en5b9fp8t9r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyp1n25ik3en5b9fp8t9r.png" alt="🔁" width="72" height="72"&gt;&lt;/a&gt; &lt;strong&gt;4. Backup, Recovery, and Disaster Recovery (DR)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cloud&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Define RPO and RTO per workload.&lt;/li&gt;
&lt;li&gt;Use Azure Backup, Site Recovery, or multi-region replication.&lt;/li&gt;
&lt;li&gt;Automate DR testing and validate recovery playbooks.
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;On-Premises&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use snapshot-based backups and offsite replication.&lt;/li&gt;
&lt;li&gt;Test restores regularly — not just backups.&lt;/li&gt;
&lt;li&gt;Use immutable or offline storage to defend against ransomware.
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4b2y0m4hrgrahy569su0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4b2y0m4hrgrahy569su0.png" alt="✅" width="72" height="72"&gt;&lt;/a&gt; Key principle: Backup is not recovery until tested.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F70gkenl9jgjk4oreyxvp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F70gkenl9jgjk4oreyxvp.png" alt="🧮" width="72" height="72"&gt;&lt;/a&gt; 5. Capacity and Scalability Planning&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cloud&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;• Use autoscaling (VMSS, AKS, App Service).&lt;/p&gt;

&lt;p&gt;• Design for scale-out, not scale-up.&lt;/p&gt;

&lt;p&gt;• Use queue-based load levelling to handle burst traffic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;On-Premises&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;• Forecast capacity and monitor utilisation trends.&lt;/p&gt;

&lt;p&gt;• Use HCI or modular infrastructure for flexibility.&lt;/p&gt;

&lt;p&gt;• Consider hybrid cloud bursting for peak loads.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4b2y0m4hrgrahy569su0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4b2y0m4hrgrahy569su0.png" alt="✅" width="72" height="72"&gt;&lt;/a&gt; Key principle: Reliability fails when capacity is exhausted.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9u6voxh5465880xi3rd1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9u6voxh5465880xi3rd1.png" alt="🔄" width="72" height="72"&gt;&lt;/a&gt; 6. Change Management and Chaos Testing&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cloud&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use IaC and CI/CD pipelines for predictable environments.&lt;/li&gt;
&lt;li&gt;Deploy updates gradually with Blue/Green or Canary models.&lt;/li&gt;
&lt;li&gt;Test resilience with chaos engineering (Azure Chaos Studio, AWS FIS).
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;On-Premises&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Manage configuration with version control (Ansible, DSC, Puppet).&lt;/li&gt;
&lt;li&gt;Validate updates in staging before production rollout.&lt;/li&gt;
&lt;li&gt;Maintain rollback plans for firmware and software.
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4b2y0m4hrgrahy569su0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4b2y0m4hrgrahy569su0.png" alt="✅" width="72" height="72"&gt;&lt;/a&gt; Key principle: Reliability is operational discipline, not luck.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftsmupj2dfhzk0c3gzk5n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftsmupj2dfhzk0c3gzk5n.png" alt="🧱" width="72" height="72"&gt;&lt;/a&gt; 7. Dependency Management&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cloud&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Map dependencies with Application Insights or Service Map.&lt;/li&gt;
&lt;li&gt;Use queues and event-driven design to decouple services.&lt;/li&gt;
&lt;li&gt;Prefer managed dependencies (databases, DNS, storage).
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;On-Premises&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Segment workloads with VLANs or SDN.&lt;/li&gt;
&lt;li&gt;Document dependencies in your CMDB.&lt;/li&gt;
&lt;li&gt;Use APIs and message buses for internal decoupling.
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4b2y0m4hrgrahy569su0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4b2y0m4hrgrahy569su0.png" alt="✅" width="72" height="72"&gt;&lt;/a&gt; Key principle: Loosely coupled systems fail independently, not catastrophically.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Principle&lt;/th&gt;
&lt;th&gt;Cloud Focus&lt;/th&gt;
&lt;th&gt;On-Prem Focus&lt;/th&gt;
&lt;th&gt;Key Concept&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Design for Failure&lt;/td&gt;
&lt;td&gt;Fault-tolerant microservices&lt;/td&gt;
&lt;td&gt;Clustered services&lt;/td&gt;
&lt;td&gt;Fail gracefully&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Redundancy&lt;/td&gt;
&lt;td&gt;Multi-zone, multi-region&lt;/td&gt;
&lt;td&gt;Hardware &amp;amp; site redundancy&lt;/td&gt;
&lt;td&gt;Eliminate SPOFs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Monitoring&lt;/td&gt;
&lt;td&gt;Azure Monitor, Log Analytics&lt;/td&gt;
&lt;td&gt;SCOM, SNMP, SIEM&lt;/td&gt;
&lt;td&gt;Detect &amp;amp; respond early&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Backup &amp;amp; DR&lt;/td&gt;
&lt;td&gt;Geo-redundant, automated&lt;/td&gt;
&lt;td&gt;Offsite &amp;amp; tested&lt;/td&gt;
&lt;td&gt;Recover predictably&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scalability&lt;/td&gt;
&lt;td&gt;Autoscale, scale-out&lt;/td&gt;
&lt;td&gt;Capacity planning&lt;/td&gt;
&lt;td&gt;Avoid resource exhaustion&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Change Control&lt;/td&gt;
&lt;td&gt;IaC, pipelines, chaos testing&lt;/td&gt;
&lt;td&gt;Config management, rollback&lt;/td&gt;
&lt;td&gt;Controlled evolution&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dependency Mgmt&lt;/td&gt;
&lt;td&gt;Queues, retries, isolation&lt;/td&gt;
&lt;td&gt;Segmentation, decoupling&lt;/td&gt;
&lt;td&gt;Contain failure domains&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;Cloud isn’t fault-proof — and it never will be. But neither is on-premises. Outages are inevitable, regardless of where systems live. What truly matters is how we design for those failures.&lt;/p&gt;

&lt;p&gt;If you design with reliability in mind — by building for redundancy, automating recovery, monitoring intelligently, and testing relentlessly — you can deliver systems that stay resilient in the face of almost anything. And remember, the key takeaway is that not every system needs full reliability — focus your investment and resilience design on mission-critical systems.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Remember, cloud reliability isn’t about perfection — it’s about anticipating failure, mitigating impact, and keeping mission-critical systems running no matter where they live.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Because reliability isn’t a checkbox you tick once; it’s a discipline you live by.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;p&gt;You can gain further insight through the official Microsoft and AWS documentation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Azure Well-Architected Reliability Pillar:
&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdh89ekwiq02gxjzadfjc.png" alt="👉" width="72" height="72"&gt; &lt;a href="https://learn.microsoft.com/en-us/azure/architecture/framework/resiliency/overview" rel="noopener noreferrer"&gt;https://learn.microsoft.com/en-us/azure/architecture/framework/resiliency/overview&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;AWS Reliability Pillar:
&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdh89ekwiq02gxjzadfjc.png" alt="👉" width="72" height="72"&gt; &lt;a href="https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/welcome.html" rel="noopener noreferrer"&gt;https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/welcome.html&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>dr</category>
      <category>azure</category>
      <category>guide</category>
      <category>ha</category>
    </item>
  </channel>
</rss>
