{ Abhilash Kumar Bhattaram : Follow on LinkedIn }
Moving on to the hot topic on on Clouds for enterprises - Uptime , MTTR and MTBF
Cloud providers advertise 99.99% or 99.999% availability, but very few teams pause to understand what is actually being promised. Availability SLAs are defined at the service infrastructure layer — not necessarily at the business service layer. The SLA may cover compute instance availability, storage durability, or control plane uptime — but it does not automatically include application responsiveness, database performance degradation, misconfiguration, architectural saturation, or customer-managed failover delays. The devil is in the definitions: what counts as downtime, what is excluded (planned maintenance, regional events, customer actions), and how availability is measured.
MTTR ( Mean Time to Repair ) - For OCI Databases ( DB Systems and ExaCS ) , this is customer defined For Autonomous Databases is this OCI defined
Therefor MTBF ( Mean Time Between Failures ) - is a maturity indicator.
ExaCS HA Documentation here
ADB HA Documentation here
1. Ground Zero: Where Challenges Start
+--------------------------------------------------------------------------------------------------+
| 1. Ground Zero: Where Challenges Start |
+--------------------------------------------------------------------------------------------------+
| - Uptime reported as “database was open” |
| - No distinction between planned vs unplanned downtime |
| - MTTR calculated inconsistently |
| - Incident severity not standardized |
| - No historical trend tracking |
| |
| Typical KPI Mistakes |
| • Counting partial outages as “available” |
| • Ignoring performance degradation in uptime calculation |
| • Measuring MTTR from ticket assignment, not incident start |
| • No clear incident start/end timestamp discipline |
| • DR failovers counted as uptime improvement without context |
| |
| >> Numbers exist. Reliability insight does not. |
+--------------------------------------------------------------------------------------------------+
2. Underneath Ground Zero:
+--------------------------------------------------------------------------------------------------+
| 2. Underneath Ground Zero: Finding the Real Problem |
+--------------------------------------------------------------------------------------------------+
| - SLAs defined commercially, not operationally |
| - Monitoring tools not aligned with business impact |
| - No separation between infrastructure and application outage |
| - Root cause trends not tied to KPI analysis |
| - No MTBF tracking to identify systemic instability |
| |
| Hidden Structural Issues |
| • Frequent small incidents masking instability |
| • Reactive firefighting improves MTTR but not MTBF |
| • Patching windows inflating downtime metrics |
| • Global teams measuring differently |
| • Lack of service-level ownership |
| |
| Core Reality |
| You cannot improve what you don’t measure correctly. |
| And you cannot measure correctly without consistent definitions. |
+--------------------------------------------------------------------------------------------------+
3. Working Upwards:
+--------------------------------------------------------------------------------------------------+
| 3. Working Upwards: From Understanding to Solution |
+--------------------------------------------------------------------------------------------------+
| - Define precise KPI definitions |
| • Uptime = Service availability excluding approved maintenance |
| • MTTR = Time from incident detection to full service restoration |
| • MTBF = Total uptime / number of unplanned failures |
| |
| - Standardize incident classification & severity models |
| - Automate incident timestamp capture |
| - Separate performance degradation from full outages |
| - Track rolling 30/90/180 day KPI trends |
| - Correlate KPIs with change events (patches, deployments, upgrades) |
| - Benchmark across environments (Prod vs DR vs Non-Prod) |
| |
| Mature Oracle Service KPI Model |
| • Uptime tied to business service, not instance status |
| • MTTR trending downward through structured response |
| • MTBF increasing as architectural stability improves |
| • KPI dashboards shared with leadership |
| |
| CTO Outcome |
| • Data-driven reliability decisions |
| • Justified architecture investments |
| • Predictable SLA adherence |
| • Fewer executive surprises |
| |
| >> Uptime is a result. |
| MTTR is a capability. |
| MTBF is a maturity indicator. |
+--------------------------------------------------------------------------------------------------+
How Nabhaas helps you
If you’ve made it this far, you already sense there’s a better way — in fact, you have a way ahead.
If you’d like Nabhaas to assist in your journey, remember — TAB is just one piece. Our Managed Delivery Service ensures your Oracle operations run smoothly between patch cycles, maintaining predictability and control across your environments.
TAB - Whitepaper ,
download here
Managed Delivery Services - Whitepaper ,
download here
Top comments (0)