DEV Community: Jay French

How Kubernetes Drift Detection Saved Us From Infrastructure Chaos

Jay French — Mon, 23 Mar 2026 13:29:11 +0000

Three months into a production migration, we discovered that 14 of our 47 deployments had quietly drifted from their declared state. Not in a dramatic, pager-firing way. In the slow, invisible way that turns a Tuesday afternoon into a Friday incident.

That's the thing about configuration drift. It doesn't announce itself. It accumulates.

Here's what happened, what we built to fix it, and why I think most teams are one bad deploy away from the same problem.

The Setup

We were running a mid-sized Kubernetes cluster across three environments: dev, staging, and production. Standard GitOps workflow. ArgoCD handling deployments. Helm charts checked into Git. Everything was "declarative." Everything was "source-of-truth."

Except it wasn't.

Engineers were patching things manually under pressure. kubectl edit became a habit. Resource limits got tweaked directly on pods. ConfigMaps were updated in-cluster without touching the repo. Nobody flagged it because nothing broke. The cluster kept humming. The dashboards stayed green.

Then we started seeing weird behavior. A service that should have been running with a 512Mi memory limit was sitting at 2Gi. Another deployment had two replicas when the Helm chart clearly declared three. A sidecar container version was six weeks behind what we'd intended to ship.

None of it was catastrophic. All of it was real. And we had no idea how long it had been that way.

Point 1: GitOps Sync Status Isn't the Same as Drift Detection

This is the part that trips people up. ArgoCD told us our apps were "Synced." And technically, they were, at the moment of last sync. But sync status is a snapshot, not a continuous assertion. If someone runs kubectl edit after a sync, ArgoCD doesn't know. It's not watching for that.

Drift detection means continuously comparing what's running in the cluster against what's declared in Git, and alerting when they diverge. That's a different problem than deployment sync. Most teams conflate them and pay for it later.

We built a reconciliation loop using a combination of ArgoCD's resource tracking and a custom controller that scraped live cluster state on a 5-minute interval, diffed it against our Helm-rendered manifests, and pushed the deltas into a monitoring pipeline. Nothing fancy. About 400 lines of Go and a Prometheus exporter.

The first run returned 14 drifted resources. Four of them in production.

Point 2: The Real Problem Is Toil and Pressure, Not Malicious Intent

Every one of those manual edits had a story.

A memory limit bumped because an OOMKill was happening at 2 AM and someone needed to stop the bleeding. A replica count changed because load spiked and autoscaling hadn't kicked in fast enough. A ConfigMap updated because a third-party API changed its endpoint and we needed 30 seconds to fix it, not 30 minutes to run a pipeline.

These aren't reckless engineers. These are engineers solving real problems with the tools in front of them.

The issue is the feedback loop, or the lack of one. Without drift detection, that 2 AM fix becomes permanent. Nobody goes back. The PR never gets opened. The Helm chart never gets updated. And six weeks later, someone deploys from Git and rolls back the fix that's been holding production together.

Sound familiar?

The fix isn't telling people to stop using kubectl edit. It's making the correct path faster than the escape hatch, and making drift visible so it can't quietly accumulate.

Point 3: Alerting on Drift Changes the Culture

Once engineers could see a drift dashboard, broken down by namespace, by team, by resource type, behavior shifted. Not because we mandated it. Because visibility creates accountability in a way that process documents never do.

We tagged each drift event with:

The last-known modifier (pulled from audit logs via Kubernetes API)
Time since divergence
Severity of the delta

A replica count change is low severity. A security context change is high. A resource limit change that's 4x the declared value gets a PagerDuty alert.

Within three weeks of launching the dashboard, the team had self-corrected 11 of the 14 original drifted resources without us asking. They just didn't want to see red in their namespace.

We also made it a blocker on our weekly architecture review. Any service with unresolved drift older than 72 hours got a 5-minute explanation from the owning team. Not punitive, just a forcing function for documentation and communication.

Point 4: Drift Detection Has to Be Cheap to Maintain

Here's where most homegrown solutions fall apart. You build the thing, it works, then it becomes another system someone has to babysit.

We kept ours deliberately simple:

No custom UI. A Grafana dashboard pulling from Prometheus.
The controller runs as a standard Kubernetes deployment with a ServiceAccount scoped to read-only cluster access.
The diff logic uses server-side apply dry-runs, which Kubernetes gives you for free.
Total compute overhead is negligible.

We've been running it for eight months. It's needed exactly two bug fixes and one config update when we migrated Helm chart versions. That's it.

Complexity is debt. Every additional feature you bolt on is another thing that can fail or get abandoned.

The Takeaway

Drift detection isn't a Kubernetes problem. It's a systems problem. The cluster just happens to be where the drift lives.

If you're running GitOps and you've never run a diff between your declared manifests and your live cluster state, you probably have drift. You just don't know what it looks like yet.

The question worth sitting with: what decisions are you currently making based on cluster state that you think matches Git, but doesn't?

Building a Secure GCP Foundation From an AWS Engineer's Perspective

Jay French — Fri, 20 Mar 2026 03:00:57 +0000

Building a Secure GCP Foundation: An AWS Engineer's First Lab

I have two AWS certifications and essentially zero GCP experience. So I set a constraint for myself: build a security-first GCP environment from scratch, using only the console (ClickOps), in a single sitting. No tutorials. Just apply what I know about cloud security principles and see how GCP implements them.

Here's exactly what I built, how it maps to AWS, and the security decisions I made at every step.

Starting Point: Project Isolation

In AWS, the highest-level security boundary is the AWS Account. In GCP, that equivalent is a Project. Before touching anything else, I created a new project called secure-app-foundation.

"I like starting with project isolation because the project boundary is a fundamental security and billing boundary in GCP. Every resource lives inside a project, and IAM policies, API enablement, and billing are all scoped to it."

This is the same reason you wouldn't deploy a production workload into your personal AWS account — isolation is the first control, not an afterthought.

From there, I enabled only the APIs I needed. GCP requires you to explicitly enable services (Compute, Storage, etc.) — a nice security-by-default posture that AWS doesn't enforce at the same level.

Network: Custom VPC, Not the Default

GCP creates a "default" VPC in every new project — just like AWS creates a default VPC in every new region. And just like in AWS, you should never use it for anything real.

"I used a custom VPC instead of the default to avoid inherited permissive behavior and to make network intent explicit."

I created:

A VPC named secure-vpc with custom mode (manual subnet creation)
A subnet app-subnet in us-central1 with CIDR 10.10.1.0/24

The AWS equivalent: creating a custom VPC with a private subnet instead of using the default. In both clouds, the default network has overly permissive firewall/security-group defaults that you'd spend time undoing anyway.

Identity: Workload Service Account

In AWS, you attach an IAM Role to an EC2 instance to give it permissions without embedding credentials. In GCP, the equivalent is a Service Account attached to a VM.

I created a service account named app-runtime-sa — and critically, I did not grant it any permissions at creation time.

"I separated workload identity from human access and attached a service account to the VM rather than relying on user credentials. Permissions get added only when the workload has a specific, documented need."

This is least privilege in practice. It's easy to add permissions later. It's very hard to audit and remove over-provisioned permissions after the fact.

AWS ↔ GCP Concept Mapping

Concept	AWS	GCP
Workload Identity	IAM Role → EC2 Instance Profile	Service Account → VM
Human Access	IAM User / SSO Role	Google Account / Workforce Identity
Network Boundary	VPC / Security Groups	VPC / Firewall Rules
Isolated Environment	AWS Account	GCP Project
Posture Monitoring	AWS Security Hub	Security Command Center
Object Storage	S3 + Bucket Policy	GCS + Uniform Access

Compute: Hardened VM with No Public IP

I deployed a VM (app-vm-01) into app-subnet and applied several hardening decisions immediately.

1. No External IP Address
Removed the default public IP entirely. The VM has no direct internet-facing interface. Connectivity is only possible through explicitly defined firewall rules and internal network paths.

2. Disabled Project-Wide SSH Keys
GCP has a feature that propagates SSH keys across all VMs in a project. Convenient — and terrible for security. Disabled it so this VM only accepts keys explicitly assigned to it.

3. Attached the app-runtime-sa Service Account
The VM runs as a dedicated service account with no permissions, rather than the default compute service account (which often has broader access than needed).

"I intentionally removed the public IP to reduce the attack surface and treated connectivity as an explicitly controlled path rather than the default."

In AWS terms: this is like launching an EC2 instance in a private subnet with no Elastic IP, assigned to a minimal IAM role. You only reach it through a bastion, SSM, or VPN — never directly from the internet.

Firewall Rules: Explicit Allowlist, Nothing Else

GCP firewall rules work differently from AWS Security Groups in one important way: they're defined at the VPC level and applied via target tags or service accounts, not attached directly to instances. This is more flexible, but requires more intentional design.

I created two ingress rules:

allow-ssh-admin
Allows TCP:22 from a specific home IP address only. Priority 1000. No broad /0 ranges.

allow-internal
Allows SSH from within the 10.10.1.0/24 subnet range for internal workload communication.

"I constrained ingress to only what was required and avoided the common mistake of using overly broad allow rules like 0.0.0.0/0 on sensitive ports."

Storage: Private Bucket with Uniform Access

I created a Cloud Storage bucket (secure-app-logs-657483678438) as the log sink for this environment. Two non-negotiable settings:

Uniform bucket-level access — disables per-object ACLs so all access is controlled through IAM only. No accidental public objects through legacy ACLs.
Public access prevention: Enforced — prevents any configuration from making this bucket publicly accessible, even if a future IAM binding would otherwise allow it.

"For storage, I enforced uniform bucket-level access and public access prevention to reduce accidental exposure through object ACLs."

The S3 equivalent: enabling "Block all public access" at the bucket level and using bucket policies instead of ACLs. The principle is identical.

Role Design: Operator, Auditor, Workload

Even in a small lab environment, I structured IAM roles as if this were a real production system:

"I would normally split roles by operator, auditor, and workload identity. For the lab, I modeled that pattern even if the environment was small — because the habit matters more than the scale."

Operator — Humans who manage infrastructure. Scoped, not project-owner.
Auditor — Read-only access to logs, IAM, and resource configs. Never write access.
Workload Identity — The app-runtime-sa. Only gets permissions the application actually needs, added on demand.

What I Didn't Get To: Security Command Center

I wanted to explore Security Command Center (GCP's equivalent to AWS Security Hub) for centralized findings and misconfiguration detection — but SCC requires an organizational account, not just a standalone project.

That's actually an important lesson: at enterprise scale, you're always working within an organization hierarchy, and tools like SCC, org policies, and VPC Service Controls only make sense in that context. In a real role, this is where most of the interesting work lives — building guardrails at the org level that automatically apply to every project, rather than configuring each one manually.

Key Takeaways

Cloud security principles transfer. Least privilege, explicit allowlists, no public exposure by default — these aren't AWS concepts or GCP concepts. They're cloud security fundamentals.

The mental model maps cleanly. Once you understand AWS deeply, GCP is largely a translation exercise. Projects ↔ Accounts, Service Accounts ↔ IAM Roles, Firewall Rules ↔ Security Groups.

ClickOps first, IaC second. Doing this manually first forces you to understand every decision. Converting it to Terraform afterward makes you a better IaC author because you know what each resource actually does.

Design for org scale from day one. The patterns you use in a 1-project lab — role separation, custom VPCs, no default credentials — are exactly the patterns you need when managing 200 projects.

What's Next

Convert the entire setup to Terraform — codify every decision as Infrastructure as Code
Set up a CI/CD pipeline (GitHub Actions or Cloud Build) with security checks before any infra change is applied
Explore VPC Service Controls to build a data perimeter around sensitive resources
Get access to an org account and dig into Security Command Center and org-level policies

The fundamentals are the same across clouds. The vocabulary is different. If you understand why security controls exist, picking up a new platform is mostly learning new names for familiar ideas.