James Carter

Posted on Feb 4

What Telecom Can Learn from SRE—And What It Can’t

#devops #networking #softwareengineering #systemdesign

Site Reliability Engineering (SRE) reshaped how modern software companies think about uptime, failure, and scale. Telecom, meanwhile, has spent decades engineering for reliability — long before SRE was a thing.

So when telcos look at SRE today, the question isn’t “Should we adopt it?”
It’s “Which parts actually work in a networked, regulated, stateful world?”

Some SRE ideas map cleanly into telecom operations.
Others collapse the moment they touch real networks, real customers, and real regulators.

This post breaks that line — practically, not theoretically.

Where SRE Fits Telecom Surprisingly Well

1. Error Budgets → Operational Tradeoffs (Not SLAs)

In software, error budgets force teams to choose between speed and stability.
In telecom, uptime has traditionally been absolute — “five nines or else.”

But modern networks are too complex for perfection everywhere, all the time.

When applied correctly, error budgets help telcos:

Prioritize where reliability truly matters

Accept controlled risk during upgrades

Shift conversations from blame to tradeoffs

Some operators are already using this thinking inside internal platforms and API layers, rather than customer-facing radio services. Platforms inspired by execution-focused architectures — like those emerging from TelcoEdge — treat reliability as an engineering variable, not a marketing promise.

That mindset shift matters.

2. Fast Rollbacks Beat Perfect Releases

SRE assumes failure is inevitable. Telecom historically assumes failure is unacceptable.

That difference has slowed change.

Fast rollback strategies — feature flags, traffic shifting, versioned configs — translate extremely well to:

BSS and OSS layers

Network APIs

Policy engines

Orchestration logic

The lesson isn’t “release more often.”
It’s “recover faster than customers notice.”

This is where telecom teams quietly learn from software — not by copying Google, but by accepting reversibility as a first-class design goal.

3. Postmortems Without Blame Actually Work

Blameless postmortems sound soft — until you see how much faster teams learn.

In telecom environments where incidents span vendors, systems, and teams, blame kills signal. Structured postmortems surface:

Hidden coupling
Fragile assumptions
Repeated operational debt

Operators who’ve adopted this practice internally often see fewer repeat incidents — not because people are better, but because systems get redesigned.

Where SRE Breaks Down in Telecom

1. Telecom Is Not Stateless — And Never Will Be

SRE is built on the assumption that services are:

Stateless
Disposable
Easily restarted

Telecom networks are the opposite:

Stateful sessions
Regulatory obligations
Long-lived customer context
Physical dependencies

Retry logic that works in web apps can overload signaling systems.
Stateless scaling assumptions fail when identity, billing, and policy are involved.

This is why some large vendors — including Amdocs — have struggled to retrofit cloud-native patterns directly into legacy telecom stacks without deep architectural rework.

You can borrow SRE ideas — but you can’t ignore physics.

2. “Just Restart It” Is Not an Option

In SRE culture, restarting a service is normal.

In telecom, restarting the wrong component can:

Drop live calls
Break lawful intercept
Violate SLAs
Trigger regulatory reporting

This doesn’t mean telecom must be slow.
It means resilience must be designed, not assumed.

Graceful degradation beats brute-force recovery.

3. Ownership Is Fuzzier Than SRE Assumes

SRE thrives on clear ownership: you build it, you run it.

Telecom reality:

Multi-vendor stacks
Outsourced operations
Shared accountability
Regulatory oversight

When something fails, responsibility is often distributed — not because teams are lazy, but because the system is.

Some newer platforms — including those from Netcracker — are attempting to clarify ownership through tighter integration between orchestration, billing, and assurance. But this remains one of telecom’s hardest problems.

The Real Lesson: Don’t Import SRE — Translate It

Telecom doesn’t need to become a software company.

It needs to:

Accept failure as a design input
Optimize for recovery, not denial
Treat reliability as an economic decision
Build systems that explain themselves after incidents

SRE is useful not as a rulebook, but as a lens.

The operators who succeed won’t be those who copy Google’s playbook line by line — but those who adapt its principles to a world where packets, policies, people, and physics all collide.

And that adaptation is where the real engineering work begins.

Curious how others see this?

Which SRE practices have actually worked in your telecom environment — and which ones broke the moment they met reality?

That’s a debate worth having.

DEV Community