Hossein Zolfi

Posted on Jul 2

Your "DevOps Team" Is the Silo DevOps Was Supposed to Kill

#sre #devops #cicd #platformengineering

If your org chart has a "DevOps team" sitting between the people who write the code and the people who run it, stop and look at what you actually built. You didn't add DevOps. You relabeled the old Ops silo with a trendier name and moved the same wall one desk over. The developers still throw work over it; it just has a nicer sign now.

That's not a nitpick about job titles. It's the single most common way organizations think they've adopted DevOps while structurally guaranteeing they haven't. So it's worth being precise about what the term actually means, what it doesn't, and why the mistakes cluster around the same handful of patterns.

The short answer

DevOps is the set of principles, practices, and cultural norms that treat building software and running it in production as one continuous responsibility, instead of two departments that hand work to each other and then argue about who broke what. The name is a portmanteau of Development and Operations, and the entire point of the term is to erase the seam between them.

That's the one-sentence version. Here's what it actually means day to day, why it works, and where it gets misapplied.

What DevOps is not

It helps to start with the myths, because they're where most organizations go wrong.

It's not a job title. Hiring a "DevOps engineer" and putting them between the developers and the infrastructure is usually just relabeling the old Ops silo with a trendier name. If one person or one team now owns "the DevOps stuff," you've reconstructed the wall you were trying to tear down. The goal isn't a new specialist role; it's that the people who build a thing also carry responsibility for how it behaves in production.

It's not a tool or a tool stack. Kubernetes, Terraform, Jenkins, Docker: none of these make an organization "do DevOps." Large-scale research comparing delivery performance across companies has found almost no correlation between the type of system (legacy mainframe vs. greenfield microservices) and how well teams deliver software. The newest, trendiest architecture is no guarantee of anything if the underlying coupling and process problems aren't solved. What predicts performance is whether a team can test and deploy its own service independently, without waiting on another team's calendar; not which vendor's logo is on the yaml file.

It's not the absence of change control. A common misreading is that DevOps means "move fast, skip the approvals." In practice, the highest-performing teams still gate every change: they just replace a slow, low-context external review board with fast, high-context peer review plus an automated pipeline that runs the same checks on every single change, every time. That combination has been shown to outperform traditional approval boards on both speed and stability. Removing a bottleneck isn't the same as removing a check.

It's not confined to engineering. The deepest version of the idea is that technology stopped being a department a business can delegate and forget. It's more like literacy or basic math, a competency every part of an organization now needs, because when the systems fail, the business fails with them, regardless of which org chart box the systems lived in.

The core idea, in one framework: flow, feedback, learning

Most of what gets called "DevOps practice" traces back to three underlying principles. They were originally articulated for factory production lines and manufacturing flow, then translated into knowledge work, and they hold up as a mental model no matter which specific tools or ceremonies an organization uses.

1. Optimize the whole flow, not the local parts. A team that maximizes its own throughput while creating a pile of unfinished, unintegrated work for the next team downstream hasn't made anything faster. It's just moved the waiting somewhere less visible. The unit of optimization has to be the entire path from an idea to a customer actually using it, not any one stage of that path. In practice this means making all work visible, keeping batch sizes small so nothing sits half-finished for weeks, and relentlessly finding whatever single step is truly constraining the whole system's throughput, because improving anything except that bottleneck is, provably, wasted effort. If your database migrations take three weeks to provision an environment, shipping code twice as fast doesn't help; the environment is the bottleneck, and it needs to be fixed first.

2. Make problems visible immediately, and stop to fix them. In a complex system, failure is not a possibility to be prevented once and forgotten: it's a constant, ambient condition to be managed. The healthy response, borrowed from manufacturing floors, is a literal or figurative cord anyone can pull: when something breaks, the team stops taking on new work and swarms the problem until it's understood and contained, rather than patching around it and moving on. This feels expensive in the moment and is cheap, because the alternative is letting small, easily-diagnosed problems compound into large, hard-to-diagnose ones. A broken build that nobody fixes for a day doesn't stay a one-line bug; it becomes three unrelated changes tangled together, each hiding the others' effects.

3. Build a culture where learning beats blame. The first two principles only function if people are willing to surface problems honestly, and that requires a culture where reporting a mistake doesn't end a career. This isn't a soft, optional add-on: it's the load-bearing precondition for the other two. A team that hides failures to protect itself politically cannot make problems visible, and a team that gets punished for pulling the stop cord will stop pulling it. The counterintuitive part is how you get this culture: not by hanging a poster of values in the break room or running a workshop, but by changing what people do. Adopt blameless postmortems, protect time for fixing root causes instead of just symptoms, and the trust follows the behavior. You don't need to fix people's beliefs first.

What this looks like in practice

The principles above cash out into a fairly concrete set of practices. None of them is DevOps by itself; together, they're what makes the principles real instead of aspirational.

Continuous delivery. The ability to get any change, such as a feature, a config tweak, or a one-line bug fix, into production safely and quickly enough that releasing becomes routine rather than a quarterly event everyone dreads. This rests on automated testing that runs on every change, developers integrating their work into a shared mainline at least daily instead of working in long-lived branches, and treating a red build as an emergency, not background noise.
Loosely coupled architecture. However a system is decomposed, whether services, modules, or a well-organized monolith, the property that actually matters is whether a team can change and ship its piece without needing permission or fine-grained coordination from every other team. This is consistently the single largest lever on delivery speed, larger than any specific automation tooling. Using "microservices" doesn't automatically buy this: services that all read and write the same shared data are just a monolith with extra network calls and extra latency.
Infrastructure as code. Defining servers, networks, and configuration as version-controlled, reviewable, testable code instead of manual console clicks or one-off scripts. This turns infrastructure changes into the same kind of reviewable, revertible, auditable artifact that application code already is.
Deployment pipelines. An automated, multi-stage path that every change travels from commit to production, running increasingly rigorous checks at each stage. This is where governance actually lives in a mature setup: not in a meeting where a committee eyeballs a spreadsheet, but in code that runs identically, every time, on every change.
Visible work and limited work-in-progress. Dashboards and boards that show what's actually in flight, combined with hard limits on how much can be in flight at once. Counterintuitively, limiting work-in-progress increases throughput, because a system running at 100% utilization has infinite queueing time in front of it.
Security built in, not bolted on. Security review and automated scanning integrated into the same pipeline as everything else, rather than a separate gate that shows up right before launch with a list of findings nobody has time to fix. Teams that do this consistently show both faster delivery and fewer security incidents: it isn't a speed-versus-safety trade-off, it's the same "catch it early, catch it cheap" logic applied to a different category of defect.

It's measurable, and the measurements are counterintuitive

One of the things that separates this body of practice from a lot of management fashion is that it has been rigorously studied at scale, across thousands of organizations, using validated survey instruments rather than anecdote. Four metrics have proven to reliably distinguish high- from low-performing organizations:

Lead time: how long from a change being committed to it running in production.
Deployment frequency: how often changes reach production.
Time to restore service: how quickly the team recovers when something breaks.
Change failure rate: what fraction of changes cause a problem requiring a fix.

The first two describe tempo; the last two describe stability. The finding that keeps getting rediscovered and keeps surprising people is that these are not in tension. Organizations that are fast are also stable, and organizations that are slow are also unstable: there's no meaningful cluster of "fast but reckless" or "slow but rock-solid" teams. The old assumption that you must trade velocity for safety, and that a change-approval board and a slow release cadence buy you reliability, doesn't hold up against the data. What actually buys reliability is small changes, fast feedback, and automated verification, the same things that buy speed.

A related finding worth internalizing: don't try to measure your way to a "mature" end state. A maturity model implies a fixed ladder everyone climbs in the same order to reach a finish line called "done." A capability model instead treats improvement as continuous and context-dependent, where different teams need different levers depending on where their constraint actually is, and the bar for what counts as "good" keeps rising as the whole industry improves. There is no finish line, and pretending there is one is itself a source of complacency.

How it relates to nearby ideas

A few terms get used interchangeably with DevOps that are actually more specific applications of it, worth distinguishing:

Site Reliability Engineering is best understood as one particular, unusually rigorous implementation of the same underlying principles, with its own vocabulary: an error budget (the acceptable amount of unreliability over a time window, spent deliberately to buy release velocity) turns "how reliable should this be" from a political argument into arithmetic.
Platform engineering is the organizational answer to the question "what happens once DevOps needs to scale past a single team?" If every team owns everything end to end, every team eventually re-solves the same infrastructure problems badly. A platform team's job is to absorb that shared complexity behind a self-service interface, run as a product, with usability, a roadmap, and real users, not as a pile of internal tools nobody asked for. Done badly, a platform team becomes exactly the bottleneck DevOps was trying to remove; the discipline that keeps it from happening is building the smallest platform that actually helps, and no bigger.
DevSecOps extends the same "build it in, don't bolt it on" logic to a third party at the table besides Dev and Ops: security. Same argument, same evidence pattern, one more silo dissolved.
Agile is a close cousin but not a synonym. Agile's classic feedback loop mostly stops at "code complete": did we build the right thing, are we iterating on the backlog. DevOps picks up exactly where that loop stops and extends it through deployment and into production: does the thing actually work when real users touch it, and can we find out fast enough to matter.

Common failure modes

A few patterns show up often enough to be worth naming directly.

Renaming the silo instead of removing it. This is the one we opened with: creating a "DevOps team" that developers throw work over the wall to is the antipattern the whole movement exists to prevent, wearing the movement's own name as camouflage. It's the single most common way an organization can point at a team, a title, and a Slack channel and genuinely believe it has "done DevOps" while the actual handoff problem is untouched.

The rest of the failure modes tend to follow from the same root cause, treating DevOps as something you install rather than something you change:

Buying tools before fixing coupling. An organization with a tightly coupled architecture that adopts container orchestration and calls it done will find its deployment cadence barely moves, because the actual constraint, teams that can't ship without coordinating with three other teams, was never touched.
Treating the four metrics as a stick. The moment delivery metrics get used to rank individuals or teams punitively, people start gaming them, and the numbers stop reflecting reality. They're diagnostic instruments for finding where to invest, not a scoreboard for performance reviews.
Over-building the platform. Left unchecked, engineers building an internal platform will keep bolting on shelves nobody asked for, because building infrastructure scratches an itch that shipping features never quite does. The discipline is to build the thinnest thing that actually removes friction, and grow it only when a real, felt need shows up.
Skipping the culture work because it's the hardest to measure. It's tempting to adopt the tooling and the pipeline and skip the harder, fuzzier culture piece. But the pipeline only works if people trust it enough to let a red build actually stop them, and that trust is exactly what a blame-first culture destroys.

The steel-manned objection: isn't specialization just how orgs scale?

Here's the strongest version of the counter-argument, and it deserves a real answer rather than a strawman: don't organizations specialize as they grow precisely because generalists don't scale? Nobody expects every engineer to also be a security expert or a database administrator: those are dedicated teams, and nobody calls that a silo problem. Why should "runs the infrastructure" be any different? A dedicated platform or DevOps team could just be sane division of labor, not a regression.

That argument is genuinely right: under one specific condition. It wins when the specialist team operates as a self-service platform that other teams pull from on their own schedule, the same way a team pulls in a well-documented open-source library: no ticket, no queue, no waiting for someone else's sprint. It loses the moment that team becomes a gate other teams must line up behind to ship anything: a mandatory review, a provisioning request, a "file a ticket and we'll get to it." The difference isn't the org chart box the specialists sit in; it's whether using their work requires asking permission. A security team that ships a self-serve scanning library is fine. A security team that must personally approve every deploy is the silo again, just wearing a different department's name.

The underlying bet

Strip away the vocabulary and DevOps is a bet about where value gets created: not in the moment code is written, but in the moment it's running, serving real users, generating real feedback. Everything else, the pipelines, the metrics, the culture work, the org design, exists in service of shortening the distance between "we had an idea" and "we know whether the idea was any good," as many times a day as the organization can stand. Speed and safety turn out not to be opposite ends of a dial you have to choose a position on; they're both downstream of the same thing, which is how tight that feedback loop actually is.

So if you're looking at your own org chart right now: is your "DevOps team" a self-service platform other teams pull from freely, or is it a queue they wait in? That answer tells you more about whether you've actually adopted DevOps than any tool in your stack will.

Look at your own org chart: is your platform team something other teams pull from freely, or something they queue behind? Drop a comment. I especially want to hear the queue stories, those are the ones worth fixing.

DEV Community