<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: David R</title>
    <description>The latest articles on DEV Community by David R (@davidstevenrojas).</description>
    <link>https://dev.to/davidstevenrojas</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F309099%2F5b7cccb2-ba03-4275-9ac3-86a594e293e7.png</url>
      <title>DEV Community: David R</title>
      <link>https://dev.to/davidstevenrojas</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/davidstevenrojas"/>
    <language>en</language>
    <item>
      <title>Always Up, Never Chill: A Friendly Intro to Availability in Software</title>
      <dc:creator>David R</dc:creator>
      <pubDate>Sat, 29 Nov 2025 13:41:52 +0000</pubDate>
      <link>https://dev.to/davidstevenrojas/always-up-never-chill-a-friendly-intro-to-availability-in-software-55a5</link>
      <guid>https://dev.to/davidstevenrojas/always-up-never-chill-a-friendly-intro-to-availability-in-software-55a5</guid>
      <description>&lt;p&gt;Imagine your favorite coffee shop. Every time you show up, the doors are locked and there’s a sticky note saying “Back in 5 minutes” that’s clearly been there since the Stone Age.&lt;br&gt;
Technically the shop &lt;em&gt;exists&lt;/em&gt;, the coffee machine is shiny, the barista is on payroll… but for you, that place has &lt;strong&gt;terrible&lt;/strong&gt; availability.&lt;/p&gt;

&lt;p&gt;Software is the same: users don’t care how elegant the code is if the “doors” (APIs, UIs, services) are often closed, flaky, or too slow to be usable.&lt;br&gt;
Availability is about making sure your digital coffee shop is open, serving, and not spilling espresso on people when they walk in.&lt;/p&gt;




&lt;h3&gt;
  
  
  What “availability” actually means
&lt;/h3&gt;

&lt;p&gt;Availability is the percentage of time a system is up, reachable, and doing its job correctly when users need it.&lt;br&gt;
Put simply: if users can hit your app and it behaves as promised, it’s available; if it’s down, unreachable, or constantly erroring, it’s not.&lt;/p&gt;

&lt;p&gt;Many definitions boil down to uptime over total time, often expressed as a percentage of how often a workload is “available for use” and performing its agreed function successfully.&lt;br&gt;
This can take into account not just binary up/down but also errors, timeouts, DNS issues, and failures along the chain from user to backend.&lt;/p&gt;




&lt;h3&gt;
  
  
  Availability vs reliability vs performance
&lt;/h3&gt;

&lt;p&gt;Availability: “Is it there and responding?” Reliability: “Does it keep working correctly over time?”&lt;br&gt;
A service might be technically up but frequently return wrong results or crash mid‑request, which makes it available but unreliable.&lt;/p&gt;

&lt;p&gt;Performance is about how fast and how much—latency and throughput—not simply whether the system responds at all.&lt;br&gt;
If responses are so slow that users give up, you’ve crossed from “bad performance” into “practically unavailable,” even if your uptime metric still looks decent.&lt;/p&gt;




&lt;h3&gt;
  
  
  Measuring availability (and the math bits)
&lt;/h3&gt;

&lt;p&gt;At a high level, availability is often computed as: 

&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;(1−downtimetotal time)×100%\left(1 - \frac{\text{downtime}}{\text{total time}}\right) \times 100\%&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="minner"&gt;&lt;span class="mopen delimcenter"&gt;&lt;span class="delimsizing size1"&gt;(&lt;/span&gt;&lt;/span&gt;&lt;span class="mord"&gt;1&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;−&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mopen nulldelimiter"&gt;&lt;/span&gt;&lt;span class="mfrac"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord text mtight"&gt;&lt;span class="mord mtight"&gt;total time&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="frac-line"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord text mtight"&gt;&lt;span class="mord mtight"&gt;downtime&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mclose nulldelimiter"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mclose delimcenter"&gt;&lt;span class="delimsizing size1"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;×&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;100%&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 which gives a nice uptime percentage.&lt;br&gt;
Industry definitions also describe it as the percentage of time a workload or application is available for use and meeting its agreed function.&lt;/p&gt;

&lt;p&gt;A more reliability‑engineering style formula uses: 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;A=MTBFMTBF+MTTRA = \frac{\text{MTBF}}{\text{MTBF} + \text{MTTR}}&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;A&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;=&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mopen nulldelimiter"&gt;&lt;/span&gt;&lt;span class="mfrac"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord text mtight"&gt;&lt;span class="mord mtight"&gt;MTBF&lt;/span&gt;&lt;/span&gt;&lt;span class="mbin mtight"&gt;+&lt;/span&gt;&lt;span class="mord text mtight"&gt;&lt;span class="mord mtight"&gt;MTTR&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="frac-line"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord text mtight"&gt;&lt;span class="mord mtight"&gt;MTBF&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mclose nulldelimiter"&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
&lt;br&gt;
 where MTBF is mean time between failures and MTTR is mean time to repair.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;MTBF captures how long the system typically runs before failing again.&lt;/li&gt;
&lt;li&gt;MTTR captures how long it takes, on average, to restore service once something breaks.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Shrinking MTTR via good monitoring, on‑call, and automation can significantly boost availability without making failures themselves rarer.&lt;/p&gt;




&lt;h3&gt;
  
  
  The famous “nines” of availability
&lt;/h3&gt;

&lt;p&gt;Teams usually describe targets as “nines” like 99%, 99.9%, 99.99%, and 99.999%.&lt;br&gt;
Each extra nine drastically cuts allowed downtime per year—for example, 99.9% is roughly 8.76 hours of unplanned downtime annually, while 99.999% is under 5 minutes.&lt;/p&gt;

&lt;p&gt;High‑availability (HA) systems typically aim for at least 99.5% and often 99.9% or 99.99% uptime, depending on how critical they are.&lt;br&gt;
Ultra‑critical industries like healthcare, finance, and transportation may push towards five nines or “fault‑tolerant” designs to keep outages to a bare minimum.&lt;/p&gt;




&lt;h3&gt;
  
  
  Availability in the CAP theorem sense
&lt;/h3&gt;

&lt;p&gt;In CAP theorem, availability has a very specific, stricter meaning: every request to a non‑failing node must result in a response, without guaranteeing it’s the latest data.&lt;br&gt;
This CAP‑availability definition differs from high availability SLAs: it’s more about never rejecting requests during partitions than about long‑term uptime percentages.&lt;/p&gt;

&lt;p&gt;CAP forces a choice, under a network partition, between strict consistency and availability; systems that favor availability will keep serving responses even if some are stale.&lt;br&gt;
For example, an AP‑leaning database cluster might let users keep reading and writing on both sides of a partition at the cost of temporary inconsistencies.&lt;/p&gt;




&lt;h3&gt;
  
  
  High‑availability architecture basics
&lt;/h3&gt;

&lt;p&gt;High availability design focuses on keeping systems accessible and functional despite hardware failures, software bugs, network blips, and maintenance.&lt;br&gt;
The core idea is eliminating single points of failure, building in redundancy, and automating detection and failover.&lt;/p&gt;

&lt;p&gt;Key ingredients include:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multiple instances behind load balancers so one crashing node doesn’t take everything down.&lt;/li&gt;
&lt;li&gt;Health checks and automatic rerouting away from sick instances.&lt;/li&gt;
&lt;li&gt;Replication of state (databases, queues, storage) so a single node or disk dying doesn’t lose data or halt traffic.&lt;/li&gt;
&lt;li&gt;Clear failover strategies so standby nodes or clusters can take over quickly.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Cloud, regions, and infrastructure choices
&lt;/h3&gt;

&lt;p&gt;Cloud platforms give a lot of high‑availability primitives “out of the box”: multi‑AZ databases, managed load balancers, auto‑scaling groups, and global CDNs.&lt;br&gt;
Using multiple availability zones or regions protects you from data center‑level outages, at the cost of more complex networking and consistency trade‑offs.&lt;/p&gt;

&lt;p&gt;CDNs can keep static content or cached versions of your app available even if core infrastructure is having a bad day, sometimes in a limited “read‑only but still up” mode.&lt;br&gt;
Cloud‑native HA design often combines load balancing, caching, DDoS protection, and global routing to shield applications from localized failures.&lt;/p&gt;




&lt;h3&gt;
  
  
  Application‑level tactics to stay “up”
&lt;/h3&gt;

&lt;p&gt;At the app and service layer, patterns focus on avoiding cascading failures and degrading gracefully instead of just falling over.&lt;br&gt;
Retry logic with exponential backoff, circuit breakers, and timeouts help services survive transient downstream issues without turning a small glitch into a full outage.&lt;/p&gt;

&lt;p&gt;Stateless services can be replaced, scaled, and rolled out more easily, improving availability during deploys and failures.&lt;br&gt;
For stateful components, replication, sharding, and careful data partitioning can spread risk and reduce the blast radius of any one node’s failure.&lt;/p&gt;




&lt;h3&gt;
  
  
  Monitoring, SLOs, and error budgets
&lt;/h3&gt;

&lt;p&gt;To keep availability high, you first need to see when it drops; that’s where metrics, logs, and traces come in.&lt;br&gt;
External synthetic checks (pings from outside your network or multiple regions) give a realistic view of whether users can actually reach your service.&lt;/p&gt;

&lt;p&gt;Service level objectives (SLOs) often define availability as a percentage, and error budgets quantify “how much failure is allowed” in a time window.&lt;br&gt;
These error budgets guide trade‑offs: if availability is burning too fast, you slow down risky changes; if it’s healthy, you can ship more aggressively.&lt;/p&gt;




&lt;h3&gt;
  
  
  Planned downtime and “zero‑downtime” dreams
&lt;/h3&gt;

&lt;p&gt;Even planned maintenance and deployments affect whether the system is “available for use,” depending on how you define your SLA.&lt;br&gt;
High‑availability setups aim to perform as much maintenance as possible without noticeable downtime using rolling updates, blue‑green deployments, and online schema migrations.&lt;/p&gt;

&lt;p&gt;Some SLAs only count unplanned downtime, but it’s important to be explicit so customers know what “99.9%” really means.&lt;br&gt;
By taking pieces of the system out of rotation incrementally, you can patch, upgrade, and reconfigure while the overall service remains available.&lt;/p&gt;




&lt;h3&gt;
  
  
  Trade‑offs and reality checks
&lt;/h3&gt;

&lt;p&gt;Chasing more nines is expensive and complex: redundancy, geo‑replication, and fault‑tolerant hardware all drive cost and operational overhead.&lt;br&gt;
At some point, the marginal value of shrinking downtime from an hour per year to a few minutes only pays off for very high‑stakes use cases.&lt;/p&gt;

&lt;p&gt;Distributed systems also run into CAP‑style trade‑offs: favoring high availability may require relaxing strict consistency or accepting eventual consistency for some operations.&lt;br&gt;
In practice, teams pick availability targets that match business impact, then layer defenses—good architecture, cloud primitives, observability, and strong ops—to hit those numbers.&lt;/p&gt;




&lt;p&gt;Wrapping up, availability in software is really about a simple promise: “when you need this, it’ll be there and it’ll work.” Under the hood that promise is backed by math (MTBF, MTTR, the “nines”), design patterns (redundancy, failover, graceful degradation), and good engineering habits (monitoring, testing, and thoughtful incident response).&lt;/p&gt;




&lt;h3&gt;
  
  
  Where to read more
&lt;/h3&gt;

&lt;p&gt;If you want to keep nerding out on availability:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Definitions and examples of application availability and uptime metrics.[&lt;a href="https://www.cloudflare.com/learning/performance/glossary/application-availability/" rel="noopener noreferrer"&gt;1&lt;/a&gt;][&lt;a href="https://www.pluralsight.com/resources/blog/tech-operations/uptime-availability-metrics-app-reliability" rel="noopener noreferrer"&gt;2&lt;/a&gt;][&lt;a href="https://www.a10networks.com/glossary/what-is-application-availability/" rel="noopener noreferrer"&gt;3&lt;/a&gt;]&lt;/li&gt;
&lt;li&gt;High availability design and “nines” math for different types of systems.[&lt;a href="https://en.wikipedia.org/wiki/High_availability_software" rel="noopener noreferrer"&gt;4&lt;/a&gt;][&lt;a href="https://www.penguinsolutions.com/en-us/resources/blog/rule-nines-availability-always-on-world" rel="noopener noreferrer"&gt;5&lt;/a&gt;][&lt;a href="https://www.nobl9.com/service-availability/high-availability-design" rel="noopener noreferrer"&gt;6&lt;/a&gt;][&lt;a href="https://dev.to/anwaar/high-availability-mathematics-and-mission-critical-systems-51j9"&gt;7&lt;/a&gt;]&lt;/li&gt;
&lt;li&gt;CAP theorem’s view of availability and consistency trade‑offs in distributed data stores.[&lt;a href="https://en.wikipedia.org/wiki/CAP_theorem" rel="noopener noreferrer"&gt;8&lt;/a&gt;][&lt;a href="https://blog.algomaster.io/p/cap-theorem-explained" rel="noopener noreferrer"&gt;9&lt;/a&gt;][&lt;a href="https://www.bmc.com/blogs/cap-theorem/" rel="noopener noreferrer"&gt;10&lt;/a&gt;][&lt;a href="https://milvus.io/ai-quick-reference/what-is-availability-in-the-cap-theorem" rel="noopener noreferrer"&gt;11&lt;/a&gt;]&lt;/li&gt;
&lt;li&gt;Cloud provider reliability and availability guidance (e.g., AWS reliability pillar).[&lt;a href="https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/availability.html" rel="noopener noreferrer"&gt;12&lt;/a&gt;][&lt;a href="https://www.techtarget.com/searchcloudcomputing/definition/cloud-computing" rel="noopener noreferrer"&gt;13&lt;/a&gt;]&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Treat your system like that coffee shop: keep the doors open, the line moving, and only run out of beans once every few years.[&lt;a href="https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/availability.html" rel="noopener noreferrer"&gt;12&lt;/a&gt;][&lt;a href="https://www.cloudflare.com/learning/performance/glossary/application-availability/" rel="noopener noreferrer"&gt;1&lt;/a&gt;]&lt;/p&gt;

</description>
      <category>availability</category>
      <category>cloud</category>
      <category>software</category>
      <category>sla</category>
    </item>
  </channel>
</rss>
