Iyanu David

Posted on Feb 4

When "Internal" Stops Meaning Safe

#architecture #cybersecurity #networking #security

For a long time, internal was a boundary.

If traffic stayed inside the network perimeter, if services talked over RFC 1918 address space, if access required VPN tunnels or corporate SAML assertions—we treated it as lower risk. Not risk-free. Just understood. The kind of understood where you'd grant broader permissions, skip TLS mutual authentication, maybe log less aggressively because the threat model felt bounded.

That mental model no longer holds.

Modern systems rarely fail because someone accidentally exposed a PostgreSQL instance to 0.0.0.0/0. They fail because "internal" quietly stops describing how the system actually works, while everyone continues to act as if it does. The YAML still references private subnets. The runbooks still say "internal service." The architecture diagrams—if they exist—still show clean boundaries.

Nothing breaks. Nothing alerts.

The assumption just expires.

How "Internal" Quietly Erodes
No one wakes up and decides to dissolve trust boundaries. It happens incrementally, in pull requests that pass review, in Slack threads that conclude with "yeah, should be fine."
A CI/CD pipeline needs access to production APIs because manual deploys are untenable at scale.

A SaaS tool needs webhook access because the alternative is polling, and polling doesn't meet the SLA.
A vendor requires temporary credentials because their integration doesn't support federated auth yet.

A service originally built for one team becomes shared infrastructure because three other teams noticed it solved their problem too.

A "temporary" exception becomes permanent because no one wants to own the migration off it.

Each change is reasonable in isolation. Defensible, even. You can point to the ticket, the design review, the risk acceptance. Collectively, though, they redefine the system—without redefining the trust model that governs it.

The architecture evolves.

The assumptions don't.

You end up with services that authenticate to each other using credentials minted eighteen months ago, scoped for a workflow that no longer exists. You have IAM roles with s3:* because someone needed quick bucket access during an outage and the precision work of scoping it properly kept getting deprioritized. You have network ACLs that permit entire CIDR blocks because the original requestor left the company and no one remembers which three IPs actually mattered.

The Pipeline Problem No One Wants to Own
CI/CD is often the first place this collapse becomes visible.

Pipelines are treated as internal plumbing: trusted runners, broad credentials, implicit access to artifact stores and secrets managers and deployment APIs. The reasoning goes: these are our systems, running our code, orchestrated by our tooling. If you can't trust your own build infrastructure, what can you trust?

But pipelines now:

Run on third-party infrastructure where "dedicated" often means "multitenant with namespace isolation"
Pull code from forks, including ones opened 47 seconds ago by accounts with zero contribution history
Execute unreviewed dependencies—that npm install pulls 843 transitive packages you've never audited
Push artifacts consumed across environments, sometimes into production within minutes

At that point, "internal" is no longer a location.

It's a hope.

I've seen this play out in postmortems. A developer opens a pull request. The pipeline runs. One of the dependencies—six layers deep—executes a script that exfiltrates AWS credentials from the runner's metadata service. Those credentials have write access to S3 buckets that feed the production data pipeline. The breach isn't discovered until someone notices unexpected egress patterns three weeks later.

When security reviews flag this, the response is usually defensive:

"But it's only internal."

Internal to what, exactly?

The VPC? The GitHub organization? The SaaS vendor's runtime environment? A runner spun up on-demand in a region you didn't specify? That Kubernetes pod running in a cluster you share with twelve other teams, any of whom could potentially mount your service account tokens if the RBAC policy has drift?

If you can't answer that clearly—if the answer requires qualifiers or depends on which subnet the traffic originated from—the boundary is already gone.

Lateral Movement Without an Attacker

One of the most uncomfortable realizations in modern systems is this:

You don't need an external attacker to violate your trust boundaries.

Drift alone is enough.

A service gains access it didn't originally need because the IAM policy was broadened during an incident and never narrowed afterward. A role accumulates permissions because removing them is risky—what if something breaks?—and no one has time to instrument every call path to verify safe removal. A token's scope widens because refactoring the service to accept narrower grants would slow down the feature roadmap by two sprints.

Eventually, the system supports lateral movement by design.

No exploit required. No intrusion detected. Just normal operation under outdated assumptions about what can reach what and with which privileges.

This is why postmortems so often conclude: "Nothing failed technically."

They're right—and that's the problem.

The database faithfully executed a query issued by a credential with SELECT permissions. The S3 bucket served an object to a role with GetObject grants. The API returned data to a service account that presented a valid JWT. Every component behaved exactly as configured.

The failure wasn't in the execution. It was in the configuration continuing to reflect a system topology that no longer existed.

Why Reviews Fail After Systems Ship

Security and architecture reviews usually happen:

Before launch, when everything is theoretical and the threat model is clean
During major redesigns, when you're forced to redraw the boxes anyway
After incidents, when the urgency justifies the disruption

Rarely during normal evolution.

So when a review finally does happen—maybe because you're pursuing SOC 2, maybe because a new security lead wants to understand the landscape—it evaluates the system as it exists today against assumptions made years ago.

That's when the gaps appear:

Undocumented dependencies. Not the ones in package.json, the ones where Service A calls Service B, which queries Service C, which publishes to a queue that Service D consumes, and somewhere in that chain there's a credential with privileges none of the original designers intended.

Unclear ownership. The team that built the authentication service moved to a different part of the org. The oncall rotation still exists, but the runbooks reference a Confluence page that 404s, and the Slack channel is archived.

Overly broad trust zones. What started as "backend services can talk to each other" has become 47 services, some of which handle PII, some of which aggregate anonymized metrics, and all of which present the same network-level trust posture because they're in the same VPC.

"We thought this was internal" explanations. And it was, once. Then someone enabled a public endpoint for debugging. Then another team needed webhook callbacks. Then the mobile client started hitting it directly because proxying through the API gateway added 200ms. Now it's handling 30% of your traffic and still has authorization logic written for internal callers.
The failure feels sudden.

But the cause wasn't.

It was slow, quiet drift—changes that made sense individually but compounded into a trust model that diverged from the architecture, then diverged further, until the gap became undeniable.

"Internal" Is Not a Property—It's a Claim

In modern systems, internal is not something a service is.

It's something the system asserts:

Through identity—not just "this request came from 10.0.0.0/8" but "this request carries a cryptographically verifiable identity bound to a specific workload."

Through authentication—mutual TLS, not unencrypted HTTP because "it's internal anyway." Service tokens with bounded lifetimes, not static API keys rotated manually every compliance cycle.

Through authorization—policies that grant the minimum viable privilege, evaluated at request time, not broad roles granted at deploy time and never revisited.

Through continuous verification—health checks, audit logs, runtime attestation that the workload actually matches what you think you deployed.

If those mechanisms don't explicitly enforce the boundary, the boundary doesn't exist—no matter how the Visio diagram looks.

Networks don't define trust anymore. They define routing, latency, maybe blast radius if you've segmented carefully. But trust?

Ownership does. Intent does. Verification does.

Anything else is nostalgia for an era when you could stand up a firewall, draw a perimeter, and reason about everything inside versus outside. That worked when "inside" meant physical datacenters with badge access and "outside" meant the internet. It doesn't work when "inside" means seven cloud regions, four SaaS vendors, CI/CD runners in someone else's infrastructure, and developers deploying from coffee shops.

The Question Teams Avoid

Most teams don't struggle to design trust boundaries.

They struggle to answer this later:

Who is responsible for noticing when the boundary stops being true?

Not who built it. Not who approved it during the design review.

Who is accountable over time?

If the answer is "everyone"—which usually means the entire engineering org has collective responsibility—then it's really no one. Everyone assumes someone else is watching. The service keeps running. The credentials keep working. The drift accumulates.

If the answer is "no one"—no explicit owner, no recurring review, no instrumentation that would surface violations—then drift is guaranteed.

I've seen teams try to solve this with tooling: automated scanners that flag overly permissive IAM policies, network analyzers that map actual traffic patterns versus intended ones, secret sprawl detectors. These help. They catch the obvious stuff.

But they don't catch the subtle erosion: the service that technically still operates within its trust boundary but now talks to three times as many dependencies as it did at launch. The API that still requires authentication but accepts tokens from a pool so broad that "authenticated" barely means anything. The VPC peering connection established for a six-week migration that's now in its fourteenth month.

The tooling shows you what the system does. It doesn't tell you whether what it does still matches what you intended.

What This Means Going Forward

Treating "internal" as a permanent state is one of the most expensive assumptions modern teams make.

It leads to surprise failures in security reviews—the kind where you're two weeks from launch and suddenly need to retrofit authentication onto services that assumed trusted callers. Brittle CI/CD systems that work fine until someone introduces a new dependency and the entire pipeline becomes an unintended privilege escalation vector. Unclear blast radius when something does go wrong, because you genuinely don't know which services can reach which data stores with which permissions.

Risk that feels invisible until it's undeniable.

The fix isn't another tool, another scanner, another dashboard showing IAM policy violations. Those are necessary but insufficient.

The fix is treating trust boundaries as living architecture, not historical artifacts.

That means:

Explicit ownership. Not just who built it, but who watches it. Oncall rotations that include boundary validation. Quarterly reviews where you ask: does this trust model still reflect reality? If not, what changed? If you don't know, that's the problem.

Instrumentation that surfaces drift. Audit logs that show not just "this access was granted" but "this access was granted and here's why this principal has this permission in the first place." Dashboards that track permission creep over time. Alerts when a service starts talking to dependencies it didn't talk to last week.

Privilege as a first-class concept. Not something you bolt on during hardening sprints, but something you design for from the start. Credentials scoped to tasks, not teams. Tokens with lifetimes measured in hours, not years. Authorization policies that default to deny and require explicit grants—not the other way around.

Recovery mechanisms. Because drift will happen anyway. You need the ability to revoke, to roll back, to narrow scope without taking down production. That requires redundancy, fallbacks, circuit breakers. It requires accepting that a service might temporarily lose access it shouldn't have had in the first place, and having that be survivable.

And maybe most importantly: cultural acceptance that "internal" is not a security posture.
It's a description of intended topology. The topology will change. If the security model is coupled to topology—if safety depends on traffic staying in certain networks or credentials only being used by certain teams—then the security model has a shelf life. You just don't know when it expires.

I don't have a clean ending for this. The honest version is messier.

You'll fix some of it. You'll document the critical paths, tighten the most egregious permissions, add mutual TLS to the services that matter most. You'll probably leave some drift in place because the refactoring cost is genuinely too high and the risk genuinely seems low.

That's fine. That's realistic.

Just stop pretending "internal" means safe.

DEV Community

When "Internal" Stops Meaning Safe

Top comments (0)