Engineering Systems · Delivery Infrastructure
The tooling rarely fails you. What fails you is the slow, invisible accumulation of every organizational decision you were never quite ready to make explicitly, encoded instead as pipeline logic.
Publication-grade analysis · Infrastructure & Platform Engineering
There is a particular kind of pain that arrives without incident. No post-mortem, no outage, no failed deploy that anyone can point to. Just a gradual, almost polite degradation in the thing the pipeline was supposed to guarantee: speed. And by the time it's undeniable, the culprit isn't a misconfigured job or a broken runner. The culprit is the organization itself—its history, its politics, and its unresolved questions about who owns what—crystallized into YAML.
This is not a failure of tooling. Jenkins, GitHub Actions, ArgoCD, Tekton—none of them are innocent bystanders, but none of them are fundamentally guilty either. The problem runs deeper than any particular orchestrator.
The early days feel like flying
When a team first stands up a CI/CD pipeline, it genuinely works the way the marketing materials promise. One repository. One service. A push triggers a build, tests run, an artifact lands in production. The whole thing fits in one engineer's head, start to finish. The cognitive overhead is negligible. The feedback loop is measured in minutes, sometimes seconds.
This is not a golden age to be nostalgic for—it's just a reflection of organizational simplicity. The pipeline is small because the company is small. One operational model. One deployment path. No contested ownership.
"Every organizational requirement eventually becomes pipeline logic. This is not a design decision. It's entropy."
Then growth happens. And growth, in a software organization, is rarely just more engineers writing more code. It's more teams with different risk tolerances. More services with different deployment cadences. More stakeholders—security, compliance, release management, operations—each with legitimate claims on the delivery process. Each with something to add.
The accumulation problem: how pipelines absorb institutional memory
What happens next is incremental enough that no single person decides to make it happen. A security scan becomes mandatory after an audit surfaces a dependency vulnerability. A manual approval gate gets added after a production incident—temporary, everyone agrees, just until we sort out the rollback story. An additional deployment stage materializes to support a new environment. A validation script runs to verify something that no longer needs verifying, because the system it was checking was decommissioned eighteen months ago. Nobody removes it because nobody is certain it's safe to remove.
Each of these additions is defensible in isolation. Some are legally mandated. Some reflect genuinely hard-won operational wisdom. But the pipeline doesn't distinguish between "we need this forever" and "we needed this in Q3 2022." It just runs.
The pipeline becomes a fossil record. Sedimentary layers of organizational evolution, compressed into a workflow definition.
The insidious part is the pace. This doesn't happen in a sprint. It happens over years, in increments small enough that no single quarterly review flags it. The pipeline adds thirty seconds here, a manual approval there, and a new environment promotion stage somewhere else. By the time the accumulated friction becomes visible—delivery cycles stretching from hours to days and engineers spending more time shepherding deployments than writing code—the debt is enormous and the causes are diffuse.
A real pattern, no lab required
Ask any senior engineer who's been at a company through a growth phase: they can almost always name the specific incident that spawned a specific pipeline stage. "Oh, that approval gate? That was after the billing service incident in 2021." The stage outlives the incident by years. The person who added it has often left the company.
The pipeline remembers everything. The organization forgets.
Automation conceals what it cannot fix
One of the more unsettling properties of mature CI/CD systems is how healthy they appear. Deployments complete. Checks pass. The dashboard is green. Nothing in the observability stack is screaming. And so the assumption forms: the system is working.
What automation is actually doing, in many of these cases, is compensating. Quietly, continuously, it compensated for structural complexity that was never resolved—just automated around. Duplicated workflow logic in three different pipeline definitions that mostly agree with each other. Permission scopes that expanded to unblock an incident and were never tightened again. Rollback procedures that work differently depending on which team owns the service because there was never a platform-level standard that stuck. Ownership boundaries so blurred that when something breaks, the first fifteen minutes of the incident are spent figuring out whose problem it is.
The automation smooths over the surface. The fractures are underneath. And fractures underneath an automated system are harder to find than fractures in a manual one, because the automation keeps producing outputs that look like success right up until it doesn't.
"Complex organizations can produce highly automated bottlenecks. The sophistication of the tooling is no protection against the murkiness of the organization running it."
This is the failure mode that's hardest to argue against in a planning meeting, because the metrics look fine. Deployment frequency might still be high. The mean time to recovery might be acceptable. The DORA metrics haven't collapsed yet. What's collapsing is something less legible: the cognitive load required to operate the system. The number of people who can actually explain, end to end, what the deployment workflow does and why.
The cognitive load problem is the real velocity problem
Modern CI/CD systems at scale may touch container orchestration, infrastructure provisioning, secrets management, artifact signing, policy enforcement, deployment promotion logic, canary rollout configuration, runtime validation, and dependency sequencing—often across multiple platforms, often with different teams owning different slices. Individually, each of these layers is manageable. A platform engineer who lives in one layer can explain it clearly. But the pipeline as a whole? The pipeline as a system?
Nobody holds that model in their head anymore. What you get instead is a fragmented understanding: this team knows how their services deploy, this other team knows the infrastructure provisioning piece, the security team knows where their scans plug in. Nobody has the full picture. And that's fine, right up until something goes wrong in the space between the pieces.
Debugging becomes archaeology. An engineer follows a failed deployment back through its logs, through three different pipeline systems, through IAM roles and network policies and artifact store configurations, and eventually arrives at a stage added by someone who left two years ago, running a validation against a contract that no longer exists. This takes hours. Sometimes days. The fix takes twenty minutes.
Velocity doesn't decline because the pipeline is technically slow. It declines because comprehension becomes expensive.
Onboarding new engineers into a mature delivery system is its own separate cost—one that compounds as the system grows. The documentation, if it exists at all, trails reality by months. The institutional knowledge lives in Slack threads and the heads of senior engineers who are already stretched thin. New hires learn by watching and by making careful mistakes and by asking questions that reveal, gradually, how much undocumented behavior there is.
Approval layers: where organizational hierarchy becomes delivery latency
The approval gate is a fascinating artifact. It appears as a technical object—a stage in a pipeline, a configuration in a YAML file—but it is fundamentally a social object. It encodes a judgment that automation cannot or should not make. And it is rarely removed once added, because removing an approval gate requires someone to take responsibility for the risk that the gate was managing. Nobody wants that responsibility. The gate stays.
In many organizations, the deployment path looks something like this: code merges, automated checks run, a security approval is required, a release management sign-off is required, an operational review happens, compliance validation occurs, and then—finally—the artifact ships to production. Each of these approvals has a legitimate origin story. Some of them remain legitimate. Some of them have outlived their purpose. Which ones are genuinely unclear.
What's unambiguous is the coordination cost. Delivery speed is now a function of organizational availability. Does the security approver have a meeting? Is the release manager on PTO? Did the compliance review queue get backlogged by three other teams trying to ship at the same time? The pipeline is fast. The humans in it are not and cannot be—they have other work. The bottleneck is no longer infrastructure. It is communication architecture.
The hidden cost that doesn't show up in deployment metrics
Approval latency is rarely measured. Teams track deployment frequency, lead time, and MTTR. They do not typically track how long a deployment sat waiting for a human to click a button. That gap is invisible to most dashboards, which means it accumulates invisibly.
One senior SRE I know ran the numbers at a previous company and found that roughly 40% of their lead time was approval waits. The pipeline itself was fast. The organizational choreography was not.
The standardization trap and the flexibility trap
Platform teams eventually arrive at a question with no clean answer: how much should the pipeline be standardized? Push hard toward standardization, and you get consistency, a reduced maintenance burden, and clear ownership—but teams with legitimate edge cases start building workarounds, and workarounds accumulate their own debt. Push toward flexibility and every service has its own bespoke deployment story, which is operationally coherent to the team that built it and completely opaque to anyone else who needs to touch it at 2am during an incident.
Neither extreme works. This is not a solvable problem; it is a tension to be managed. The best platform teams I've seen treat the pipeline like an operating system: a stable, well-understood core with documented extension points. Standardized foundations. Localized extensibility. The key word is "documented"—not just the extension points, but the reasoning behind what's in the core and what isn't.
What tends to happen instead is that the standardization/flexibility question is never made explicit. Different teams make different implicit choices. The result is a pipeline ecosystem that is neither consistently standardized nor usefully flexible—just variegated in ways that are hard to reason about.
The pipeline as organizational mirror
Here is the uncomfortable claim, and I believe it's true: if you want to understand how a software organization actually operates—not how it says it operates, not the org chart, not the Confluence page about team responsibilities, but how decisions actually get made and who actually owns what—look at the deployment workflow.
A fragmented pipeline, one with inconsistent ownership and unclear responsibilities and stages nobody remembers adding, reflects a fragmented organization. The pipeline didn't create the fragmentation; it inherited it. Siloed decision-making produces siloed deployment logic. Unclear ownership of services produces unclear ownership of deployment paths. An organization that resolves ambiguity by adding process produces a pipeline that resolves ambiguity by adding stages.
Conversely, a clean pipeline—modular, well-owned, observable, fast—is almost always downstream of strong platform governance and genuinely shared engineering standards. Not imposed standards, where a platform team decrees a process and teams comply resentfully, but standards that emerged from actual alignment about how the organization wants to ship software. That alignment is hard to build. It's harder to maintain.
"The pipeline is rarely the root problem. It is the visible symptom of deeper organizational structure—and therefore the worst place to start trying to fix things."
Why adding automation makes it worse before it makes it better
The instinctive response to delivery friction is more automation. More orchestration. More intelligent policy engines. More script layers that abstract the complexity. And sometimes this is correct—there are genuinely manual processes that should be automated, genuinely tedious work that machines should absorb.
But automation cannot resolve unclear ownership. It cannot fix conflicting incentives between teams. It cannot substitute for the organizational conversation about who is responsible for what. What it can do is make those unresolved questions harder to see. The system becomes more sophisticated and less comprehensible simultaneously, which is a bad trade.
There's a version of this I've seen several times: an organization identifies slow deployments as a problem, invests heavily in pipeline automation, achieves technically fast deployments, and then discovers that the approval-driven coordination overhead is the actual bottleneck—which the new automation made more invisible, not less, because now the automated stages complete in seconds and the human stages look like anomalies rather than load-bearing structural elements.
What to actually change on Monday morning
Assuming you recognize your pipeline in some of this—the fossil record stages, the approval latency, the knowledge fragmentation, the green dashboards hiding structural fragility—what's worth doing?
Start by measuring what you're not measuring. Specifically: approval latency. For every human gate in your pipeline, measure how long deployments sit waiting. This number, which most teams don't track, is often the single most actionable finding. It tells you where organizational friction is actually located, which is frequently not where you assumed.
Run a pipeline archaeology session. Get the people who know the most about your deployment system in a room—or a call—and walk through every stage. For each one, why does this exist? Who owns it? What breaks if we remove it? You will find stages that nobody can explain. You will find stages that were added to address problems that no longer exist. You will find stages whose purpose is clear but whose ownership is contested. All of this is useful information.
Treat cognitive load as a first-class metric. Ask engineers who are not on the platform team to explain, without help, what happens when they merge code to main. How far can they get before they hit uncertainty? Where does their mental model of the system end? The gap between where their model ends and where the pipeline ends is operational risk.
Don't automate your way out of an ownership problem. If a stage exists because nobody has agreed on who's responsible for a decision, adding automation to that stage doesn't resolve the ownership question—it encodes the ambiguity. Have the conversation first. It's slower and harder than writing a script. It's also the thing that actually fixes the problem.
Consider the blast radius of pipeline failures. One failing workflow should not stall every service in the organization. If your pipeline is monolithic—one shared workflow that everything runs through—you have single-point-of-failure risk that no amount of uptime SLAs can fully mitigate. Modular, owned-per-team pipeline components with a thin shared core are a resilience pattern, not just an organizational preference.
The thing that's actually being measured
CI/CD pipelines do not become bottlenecks because engineers designed them poorly. They become bottlenecks because organizations grow in ways that are impossible to fully anticipate, and the pipeline absorbs the complexity that the organization hasn't yet resolved at the human level. It is one of the places where the gap between "how we say we work" and "how we actually work" becomes most legible.
The most effective engineering organizations understand something that sounds simple but is genuinely hard to operationalize: the speed of software delivery is bounded less by automation than by how much complexity humans can coordinate. You can have extraordinary tooling and still ship slowly if the organizational layer above the tooling is unclear, contested, or overloaded.
Healthy delivery systems are not just technically efficient. They are organizationally legible—comprehensible to the people who depend on them, owned clearly by the people who maintain them, and observable in ways that make dysfunction visible before it becomes catastrophic.
The pipeline tells you something true about your organization. The question is whether you're ready to hear it.
Top comments (0)