DEV Community

Murat Kayan
Murat Kayan

Posted on • Originally published at betweenthelayers-en.hashnode.dev

The Invisible Tax Every Oracle-Database Enterprise Pays

What it really costs when nobody can say exactly what happened inside your critical systems, and why that cost never appears as a line item.

Series ยท Article 1/?


First, try to picture a crisis in your head.

A 24/7 application is running slow; the service desk is filling up with tickets, or the on-call team just caught it themselves. On the APM tool's dashboards, the greens are turning amber one by one, the alert count is climbing, and it's painfully obvious that without an early intervention everything is about to go red...

Within fifteen minutes a war room forms: the service-desk lead, the application team, the database team, the middleware team... Each team shares its own findings in turn, on separate screens, dropping screenshots into the chat. The application team says, "Our side is clean, we didn't ship any code. Could the delay be in that other service?" So the owner of the service the buck got passed to is pulled into the room on suspicion. "Our side is clean too. Could it be the database?" The database team says, "Queries are running normally, it's not us." The infrastructure team says, "No packet loss on the network." Every team is right. And the problem is still happening.

Reading this, you probably didn't recall one specific night, because this isn't one night. It's an incident that recurs at different frequencies. In nearly every organisation running critical systems on Oracle Database, this pattern repeats a few times a year, sometimes a few times a month. End-of-day reconciliation at a bank, the claims pipeline at an insurer, the billing window at a telco... the names change, the choreography stays the same.

That's exactly where we come from. We've sat in plenty of these war rooms, and more often than not on the "accused" side of the table, hunting for shared evidence to prove whether or not the problem came from the application we were responsible for. And in the cases where it was on us, all those wasted hours got assigned squarely to us. What a special kind of stress that is.

Over the years we noticed one thing: the real cost of these war rooms isn't the downtime everyone talks about. The biggest cost is a tax nobody can invoice.

The line item nobody sees

When an outage ends and we look back, we usually talk about a single number: "The system was slow, or down, for forty minutes." But what was actually spent inside those forty minutes was far more expensive.

The three senior developers who sat in that room didn't do the next day's work. That team spent its time not on the fix but on the diagnosis; and in critical incidents, most of the time goes not into repairing but into "figuring out what happened." The fix itself is usually short. It's the three hours before it, spent working out which layer is guilty, that drain you.

On top of that time, a few things erode a little more with every repeat: trust between teams... and if you're sitting there as a turnkey project owner, the client's trust in the vendor erodes too. In an organisation where the application, middleware, database and network teams blame each other by default, the next incident gets diagnosed even more slowly, because now everyone has to defend themselves first and solve the problem second. None of this shows up on the balance sheet. But it's in the room at every war-room call.

And none of it collects into a single cost line. The senior/engineer hours go to one account, the SLA penalty to another, the customer churn somewhere else entirely. The explanation owed to the regulator gets written down nowhere, until someone finally says, "let's at least start keeping a record of this, or find a fix." That's why it's invisible: because it's distributed, no one sees the total; because no one sees the total, no one owns it; because no one owns it, the same price gets paid year after year. Organisations call this "the natural cost of running critical systems" and normalise it. In fact, because it happens constantly, many have already baked it into the budget as a cost line, which is exactly how it becomes a normalised chronic problem. "Normalised chronic": a problem every team now knows about and has tacitly shaken hands on never solving. A kind of agreement that no one should bother with it anymore, and that whatever palliative was put in place, we'll just keep applying it.

We never made peace with that normalising, that shrugging-off, that "it was a momentary glitch" rhetoric, or that handshake on staying unsolved.

Why it only happens now and then

The reason this pattern repeats isn't that the teams are incompetent. Most of the time the people in that room are the organisation's best. (There's a junior engineer at the table too, and oh, how delightful these problems are for them... let's make that a separate article, so it doesn't get lost here.) The problem is that everyone is looking only at their own slice of the truth.

Think about what happens to a single customer request: it lands on the application, the application runs a database query, the query travels over the network to Oracle, Oracle does something inside, a response comes back, a piece of data changes along the way, maybe another transaction is waiting on that data. Maybe, in the most invisible part of all, it traces back to a missing index on the reference table that a constraint on the relevant table depends on. (Let those technical details be another article too.) Each link in this chain is watched by a different team, with a different tool, on a different timestamp. The application team's clock doesn't line up exactly with the database team's. Because of datasource pooling, the database user is one identity while the customer hitting the problem in the application is another. The query one team calls "slow" looks "normal" on the other's screen, because the two are looking at the same event through two separate windows, with no shared point of reference.

The result: a room where no one is lying, and yet no one can see the whole. Picture a niche carved into a black wall: if it's huge, everyone spots it; if it's tiny, who can? That is why the war room drags on for hours. Every team waits for someone believed to be neutral to step up and say, "Found it! The niche is right here." We'll deep-dive into this structural cause in the second article and take it apart piece by piece. For now, this much: it's less a missing tool than the absence of a shared truth.

Calculate your own tax

The most reliable way to make this tax concrete isn't to reach for industry averages; most of those numbers won't fit your organisation. The truest figure comes out of your own table. A few questions will do it:

How many times a year does a war room form around a critical system just "to figure out what happened"? On average, how many senior engineers sit in it, and for how long? What share of that time goes to the fix, and what share to the diagnosis, that is, to "finding who's guilty"? In how many of those incidents was a customer-facing flow affected? In how many was an SLA threshold breached?

(Of course there are Incident Management tools, and everything's logged in there... we're after something else.)

Answer those five questions honestly, multiply them out, and the number that emerges surprises most organisations. And that's only the direct cost. The asymmetric ones we haven't even added yet: the reputational damage one long outage causes in a customer-facing, human-facing system, or a single regulator finding. Those items are rare; but when one does land, it eclipses years of the invisible tax in a single stroke.

"Customer satisfaction" sat right at the top of that year's corporate goals, by the way, and from a place that looks entirely unrelated, every department's push toward that goal took a hit. Now, because they can no longer advance toward their separate goals, every department also has to clear its own name. The extra effort that clearing demands, once again...

Is it really worth untangling this seemingly unrelated web of cross-department dependencies? (We think it is.)

Why Oracle-Database systems specifically

This tax exists in every system; but in enterprise systems carrying Oracle Database it runs especially high and especially expensive. The reason is simple: these are usually the organisation's most-feared-to-touch core systems. They carry the core ledger, the policy records, the subscription and billing backbone. New campaign? Data and table relations grow. A regulatory request? Data and table relations grow...

Over the years they've grown layer upon layer; often the last person who understood them end to end retired long ago. And these days, because everyone tries to understand nothing outside their own remit (or rather, because applications are run through micro-responsibilities), anyone who does understand them end to end gets looked at as if they're mad. "We never assigned them that work. Look at this one, went off and learned it on their own! Someone else plans your time; don't go learning on a whim, just do and learn what we tell you, that's plenty!" (Let's take this up in another article too. Yes, I'm aware I'm venting.)

It's precisely this criticality that makes the blindness most expensive. Not knowing why some unimportant internal service is slow is tolerable. Not knowing why the end-of-day reconciliation didn't balance, at which step a claims payment got stuck, a patient waiting in a hospital for provisioning approval, the person picking up their car from the service centre getting it a day late: that is not tolerable. The more critical the system, the higher the price of "not knowing what happened", and all of these systems are critical.

Why this series exists

For now, here's what we'd like you to know: we're writing this not to sell a product, but to name a tax we paid for years. We sat at that war-room table, on both the accusing and the accused side. At some point we couldn't stop asking: is it truly inevitable that these crisis days unfold the way they do, or is it only because we've accepted it?

In this series we'll talk about the problem first; honestly, end to end. One by one, we'll take: why the blame war is structural, why problem tracking goes dark at precisely the most critical point, why the other half of the truth (the "who changed what" question) always sits in a separate silo, and why visibility itself has turned into such an expensive luxury. We'll also talk about which tools don't solve these problems, and why.

An organisation can pay this tax for years and never notice. For the ones who do notice, the first step is to see it, to make it visible. This whole series is an attempt to put that invisible line item on the table.

What's next

In the next article we'll open with the four sentences we hear most often (the same content, really, just spoken by different layers): "Not middleware." "Not the database." "Not the application." "Not the network." If all four can be true at once, who's guilty? We'll talk about why the blame war isn't a character flaw, but a direct consequence of the way we build our systems.


This is the first in a series of articles on the observability and change-tracking problems faced by organisations running critical systems on Oracle.

Top comments (0)