Bala Paranj

Posted on May 20

Google Engineers Can't Create Public Cloud Storage Buckets. Not Because They're Smarter. Because the Option Doesn't Exist.

#devops #cloud #architecture #security

Misconfiguration isn't a personnel failing. It's a structural property of platforms that PERMIT unsafe constructs. Google, Spotify, Netflix, and Shopify solved this by removing the unsafe constructs from the developer surface entirely. 95-99% reduction in misconfiguration incidents. Most organizations can't build that platform. Here's the alternative — and why the two approaches are complementary, not competing.

Google's internal infrastructure doesn't have publicly exposed storage buckets. Not because Google engineers are more careful. Because the construct "public bucket" doesn't exist in their developer surface. A Google engineer deploying an internal service writes a one-line service declaration. The platform synthesizes everything — network policy, RBAC, TLS certificates, monitoring, secrets management. The developer never sees the configuration knobs that would produce a misconfiguration.

The misconfiguration doesn't happen because it CAN'T happen. The unsafe construct isn't guarded against. It's ABSENT.

This is the upstream approach to misconfiguration — and it's been independently adopted by Google, Spotify (Backstage + Golden Paths), Shopify (Polaris), and Netflix (Paved Road + Spinnaker). Each reports 95-99% reductions in misconfiguration incidents.

Most organizations can't build this. Here's why — and what to do instead.

The reframe: misconfiguration is structural, not personal

The industry frames misconfiguration as a KNOWLEDGE problem:

"The engineer didn't know the right configuration"
    → Fix: more training
    → Fix: better documentation  
    → Fix: security champions
    → Fix: mandatory reviews

Each fix addresses the engineer. Each assumes the PERSON is the variable. Train them better. Document more clearly. Review more thoroughly.

The structural reframe:

"The platform PERMITS the unsafe configuration"
    → Fix: remove the unsafe configuration from the platform's vocabulary

This fix addresses the PLATFORM, not the engineer. The engineer's knowledge doesn't matter because the unsafe construct doesn't exist in the surface they interact with. A developer who doesn't know that publicly exposed storage is dangerous can't create it — not because they learned it's dangerous, but because the option literally isn't available.

Three implications:

Personnel interventions don't work at scale. Training addresses one engineer at a time. The next hire resets to baseline. Turnover regenerates the problem. The structural property persists regardless of who's on the team.

Process interventions are insufficient. Code reviews catch SOME misconfigurations. The reviewer must notice the unsafe construct among hundreds of lines of IaC. The review is human-speed; deployments are machine-speed. The process can't keep pace.

Structural interventions work. Redesign the developer surface so unsafe constructs are unexpressible. The misconfigurations disappear because they have no expressive form. Not hard to create. IMPOSSIBLE to express.

What the upstream platform looks like

A developer using the upstream platform:

Developer writes:     "Deploy service: order-processor, tier: production"

Platform synthesizes:
    ✓ Namespace with correct labels + Pod Security Admission
    ✓ NetworkPolicy: default-deny + exact required egress
    ✓ RBAC: least-privilege ServiceAccount derived from tier
    ✓ SPIFFE identity automatically issued and mounted
    ✓ Secrets from Vault with automatic rotation
    ✓ OpenTelemetry auto-injected
    ✓ Immutable root filesystem, read-only containers, drop-all capabilities

The developer's input: one line. The platform's output: a complete, production-ready, secure-by-construction service. The developer never sees NetworkPolicy YAML. Never writes RBAC rules. Never configures TLS. Never manages secrets.

The platform's vocabulary is BOUNDED to pre-approved safe forms (golden templates). The developer can't request a public endpoint without going through an explicit, reviewed approval path. The unsafe construct isn't in the default vocabulary.

Four architectural properties:

Property	What it means
Synthesis from intent	Developer declares WHAT; platform produces HOW
Golden templates only	Platform vocabulary is bounded to pre-approved forms
Continuous audit	Templates are continuously updated as threats evolve
No bypass mechanism	The platform has no ability to create unsafe configurations

The fourth property is the most distinctive: physically impossible to deploy publicly exposed storage or an overpermissive role because those constructs DON'T EXIST in the allowed schema.

Who has built this — and what they achieved

Organization	Platform	Result
Google (2015+)	Borg + internal IDP	Near-zero misconfiguration incidents in internal services
Spotify	Backstage + Golden Paths	High developer satisfaction + uniform safety properties
Shopify	Polaris	Standardized safe defaults across all services
Netflix	Paved Road + Spinnaker	Reduced incident rate; safety via template compliance

Four independent organizations. Same pattern. Same results. The convergence is empirical evidence the approach works.

Why most organizations can't do this

Despite the evidence, most organizations do NOT build upstream platforms:

Building an Internal Developer Platform requires:
    ✗ A dedicated platform team (5-20+ engineers)
    ✗ Multi-year staffing commitment
    ✗ Deep integration with cloud vendor APIs
    ✗ Continuous maintenance as cloud features evolve
    ✗ Organizational authority to mandate platform adoption
    ✗ Budget for a system that doesn't ship customer features

The investment pays off AFTER years of operation. Most organizations need safety NOW with the team they HAVE. The upstream approach is available but not accessible.

This creates a structural gap: the approach that WORKS (upstream platform) is inaccessible to the organizations that NEED it most (teams without platform-engineering capacity).

The downstream alternative

The downstream approach addresses the gap. Instead of preventing unsafe constructs from being EXPRESSED, it catches them before they reach PRODUCTION:

Upstream (Internal Developer Platform):
    Developer intent → Platform synthesizes safe config → Production
    Unsafe constructs never expressible

Downstream (Invariant evaluation):
    Developer writes IaC → Evaluation catches unsafe state → Block before production
    Unsafe constructs expressible but caught before deploy

Property	Upstream platform	Downstream evaluation
Where it operates	Authoring (before IaC exists)	Evaluation (after IaC, before deploy)
What it changes	The expression vocabulary	The state evaluation
Who absorbs complexity	Platform team (5-20+ engineers)	Catalog authors (1-3 engineers)
Adoption cost	Multi-year IDP build	Single binary integration in CI
Coverage	All configs through platform	All configs through CI/CD
Bypass risk	Very low (must bypass platform)	Moderate (can bypass CI/CD)
Safety level	95-99% reduction	Substantial reduction within catalog coverage
Time to adopt	Months to years	Hours to days

The upstream approach has HIGHER safety but HIGHER cost. The downstream approach has LOWER safety (bypassable, coverage-bounded) but DRAMATICALLY lower cost (adoptable by any team with CI/CD).

The choice matrix

Organization profile	Recommended approach
Has platform team, multi-year budget	Upstream IDP (Google-style)
No platform team, but mature CI/CD	Downstream (invariant evaluation)
Both available	Hybrid: upstream for new services, downstream for legacy
Highly regulated, zero bypass tolerance	Upstream (with explicit override paths)
High velocity, needs safety NOW	Downstream (adoptable in hours)

Most organizations are in the SECOND row: no platform team, but mature CI/CD. The downstream approach is their accessible path to safety properties the upstream approach would provide if they could build it.

The THIRD row (hybrid) is increasingly common. Large organizations build upstream platforms for new services AND use downstream evaluation for legacy services that don't yet flow through the platform. The two approaches are complementary — each covers what the other misses.

What the downstream approach catches that upstream can't

Upstream platforms are powerful but bounded:

What upstream platforms miss:
    ✗ Legacy services not on the platform (migration takes years)
    ✗ Emergency bypass / break-glass operations
    ✗ Cloud provider API changes that introduce new unsafe defaults
    ✗ Platform template bugs (the template itself is misconfigured)
    ✗ Cross-service compound risks (the template is safe per-service; the combination isn't)

The downstream approach catches ALL of these — because it evaluates ACTUAL STATE (snapshots) against invariants (catalog), regardless of how the state was produced. A legacy service that never touched the platform? Evaluated. A break-glass console change? Caught in the next post-deploy snapshot. A template bug? The invariant catches what the template missed. A compound risk across services? Chain controls evaluate cross-asset conditions.

The downstream approach is the SAFETY NET under the upstream platform. Even organizations with full IDPs benefit from a downstream evaluation layer that catches what slips past the platform.

The convergence trajectory

The industry is moving toward upstream platforms:

Period	Dominant approach
2010-2015	Pure permissive: developers write everything
2015-2020	Scanners + reviews: catch misconfigurations after deployment
2020-2025	Policy-as-code: declared rules at PR time
2025-2030	Internal platforms: synthesize safe configs (selected organizations)
2030+	Ubiquitous platforms: most organizations adopt some form of IDP

As the trajectory progresses, the downstream approach's role evolves:

Today: Primary safety mechanism for organizations without platforms (most organizations).

2030+: Complementary safety mechanism for organizations WITH platforms — catching legacy, bypass, template bugs, and compound risks the platform doesn't cover.

The future role isn't diminished. It's SPECIALIZED. Even in a fully-platformed world, the downstream evaluation layer provides defense-in-depth that the platform alone can't.

The honest comparison

The downstream approach does NOT claim upstream-level safety:

Metric	Upstream (IDP)	Downstream (invariant evaluation)
Misconfiguration reduction	95-99% (documented by Google, Spotify, etc.)	Substantial — bounded by catalog coverage and CI/CD integration
Bypass resistance	Very high (must bypass the platform)	Moderate (can bypass CI/CD; caught by post-deploy snapshots)
Time to value	Months to years	Hours to days
Team required	Platform team (5-20+)	One person can start
Coverage of legacy	Low (legacy not on platform)	High (evaluates any state snapshot)
Cost	$1M-10M+/year in platform team	Open source + operator's existing CI

The downstream approach trades SAFETY CEILING for ACCESSIBILITY. The safety ceiling is lower (bypassable, coverage-bounded). The accessibility is incomparably higher (any team, any CI pipeline, any cloud provider, today).

For 90% of organizations — the ones that will never build a Google-scale IDP — the accessible option is the only option. And the accessible option with 2,650 invariants evaluated before every deploy is DRAMATICALLY safer than the current state of no evaluation at all.

For your organization

If you have a platform team: Build the upstream IDP. The evidence supports 95-99% reduction. Add downstream evaluation as defense-in-depth for legacy, bypass, and compound risks.

If you don't have a platform team: Adopt downstream evaluation. Single binary in CI. 2,650 controls evaluating every deploy. Achievable this week, not next year.

If you're building toward a platform: Start with downstream evaluation NOW while building the platform. The catalog you develop during downstream evaluation INFORMS the golden templates you'll build for the platform. The two investments compound.

Google engineers can't create publicly exposed storage buckets because the option doesn't exist in their surface. Your engineers can — because your surface permits it. The upstream fix removes the option. The downstream fix catches it before production. Both work. One takes years and a platform team. The other takes a binary and an afternoon. Start with what you can do today.

The downstream alternative — 2,650 invariants evaluated against actual cloud state, catching what upstream platforms miss, accessible to any team with CI/CD — is Stave, an open-source Risk Reasoner. Single binary. No platform team required. Defense-in-depth for teams building toward upstream, primary safety for teams that aren't. Try it: bash examples/demo-ai-security/run.sh

DEV Community