Nerav Doshi

Posted on Jul 1 • Edited on Jul 2 • Originally published at pipelineandprompts.com

Managed OpenShift, Lost State, and Daily Drift Checks

#terraform #openshift #rosa #aro

Byte Size Summary

Red Hat OpenShift Service on AWS (ROSA), Azure Red Hat OpenShift (ARO), and OpenShift Dedicated on GCP (OSD) are mature, SRE-operated platforms — same OpenShift surface, three clouds, control plane work you do not own. When prerequisites are met, installs complete in the documented window. Enterprise timelines stretch when governance approval tracks and Terraform state hygiene are treated as afterthoughts, not when the platform fails. This article separates the solid platform from the enterprise wrapper, then covers what actually prevents pain on the IaC side: remote state before the first resource, scheduled drift detection, and platform-specific recovery when partial applies leave residue behind.

The Story

I was standing up a ROSA Classic cluster in a private governed enterprise environment. The installation documentation says 45 minutes. That estimate assumes a clean AWS account, permissive IAM, and a single team with full control over their own infrastructure.

None of those conditions existed.

The environment had AWS Organizations Service Control Policy (SCP) restrictions, a shared VPC owned by a separate networking team, and a corporate cloud governance team managing a separate approval track for every permission category. The cross-account Security Token Service (STS) assumed-role setup required trust policies across three account boundaries simultaneously. I was also new to Terraform. I had forked someone else's configuration and was running it without fully understanding what it created.

The first apply failed on an SCP block. I fixed the permission — or thought I did — and ran it again. It failed again, at a different point, on a different permission. Each failure left resources behind. OpenID Connect (OIDC) providers, IAM roles, partial VPC associations. I did not know Terraform was blind to anything not in its state file. I thought starting fresh was the same as starting clean.

It is not.

By the time my AWS account admin flagged unusual IAM activity in my account, I had accumulated OIDC providers across multiple restart cycles and could not fully account for what the forked code had created. I had to dig into the code, get a colleague to walk me through what it was doing, and spend time manually hunting through the console for resources I had created but never tracked.

The governance approval tracks — marketplace SCP on a separate high-risk timeline, VPC policies, networking policies, EC2 instance creation permissions, IAM edit permissions — were each running independently with different reviewers and different response times. Two weeks was a typical cycle for a single approval. The marketplace SCP alone, classified higher-risk than the others, had its own queue.

What was scoped as a 45-minute installation took 4 weeks. Roughly 40% of that — two weeks — was avoidable operational chaos that better Terraform practice and a different understanding of the governance relationship would have prevented. The other 60% was a governance process that no automation shortens.

The customer's perception at the end: this is very complicated.

That perception was accurate for their environment, not for managed OpenShift in general. Self-managed OpenShift in the same SCP-constrained account would have hit identical approval cycles plus etcd backups, upgrade orchestration, and control plane incident response on the team. A significant portion of the complexity was self-inflicted wrapper problems — lost state, restart cycles, Terraform before governance — wrapped around a platform that worked once permissions cleared.

This article covers both: why ROSA, ARO, and OSD are solid choices, and how to operate Terraform around them without creating drift, orphans, or a false "platform is broken" narrative.

Why ROSA, ARO, and OSD Are Solid

Not marketing. Not "zero incidents ever." Solid means:

Predictable control plane — API server, scheduler, etcd, and OpenShift operators are operated by Red Hat SREs with SLAs, not your on-call rotation.
Same OpenShift, three clouds — Routes, SCCs, Operators, GitOps patterns, and the console experience carry over from on-prem or self-managed.
Documented install paths that work — CLI, portal, and OpenShift Cluster Manager (OCM) flows complete in advertised timeframes when the account is ready.
Production track record — Regulated industries run workloads on managed OpenShift because the operational model holds up under audit.

The complexity split

Source of delay	Share	Owner
Governance approval tracks (SCP, marketplace, networking, IAM)	~60%	Enterprise process
Avoidable ops mistakes (local state, restart cycles, no pre-flight)	~40%	Team practice

Managed OpenShift does not shorten a two-week marketplace SCP queue. No cloud product does. It does eliminate operating the Kubernetes control plane while you wait.

What each platform delivers

ROSA (AWS) — Native STS, OIDC, PrivateLink, and BYO VPC patterns for enterprise AWS landing zones. Red Hat SREs own control plane upgrades, API availability, and platform operator lifecycle. Supported fast path:

rosa create cluster --cluster-name=my-cluster --region=us-east-1

HCP (Hosted Control Plane) further reduces customer-managed infrastructure for teams that want minimal AWS surface area.

ARO (Azure) — First-party Azure resource type, co-operated by Microsoft and Red Hat. Entra ID integration, ARM lifecycle via Portal or az aro create / az aro delete, and a managed resource group (MRG) pattern that bounds blast radius for auditors — intentional isolation, not accidental complexity.

OSD (GCP) — OCM-centric fleet management with Red Hat SRE operations. GCP IAM and Workload Identity integration for regulated orgs. Persistent disks that survive workload deletion are often reclaim policy by design, not platform failure — production stateful services depend on that behavior.

Side-by-side

Dimension	ROSA (AWS)	ARO (Azure)	OSD (GCP)
Control plane ops	Red Hat SRE	Red Hat + Microsoft	Red Hat SRE
Primary install path	`rosa` CLI / OCM	Portal / `az aro create`	OCM subscription flow
Cloud-native identity	STS / IAM / OIDC	Entra ID / Azure RBAC	GCP IAM / WIF
Enterprise networking	PrivateLink, BYO VPC	Private ARO, custom VNet	Private cluster, org constraints
What you stop doing	etcd backup, CP upgrades, API SRE	Same	Same
What you still own	Workloads, app RBAC, landing zone	Workloads, app RBAC, landing zone	Workloads, app RBAC, landing zone

All three are the same OpenShift product surface with cloud-specific landing gear — not three different quality tiers.

Where complexity actually lives

Layer	Examples	Platform fault?
Enterprise governance	SCP approval, marketplace tracks, shared VPC change windows	No
Landing zone design	Cross-account trust, hub-spoke DNS, egress filtering	No
IaC practice	Local state, partial applies, destroy without inventory	No
Application operations	ImagePullBackOff, NetworkPolicy gaps, secret sprawl	No
Platform incidents	Control plane outage, failed platform upgrade	Yes — SRE-owned with SLAs

Managed OpenShift headlines almost always come from the first three rows. That is equally true for EKS, GKE Autopilot, and AKS in the same organization.

The Drift Problem (Wrapper, Not Platform)

Enterprise Terraform for managed services differs from tutorial workloads. Each platform creates resources across multiple ownership boundaries, integrates with cloud-native identity systems, and requires permissions that look broad to a governance team seeing them for the first time — the same class of vendor SRE access model as any managed Kubernetes on a hyperscaler.

When a Terraform apply fails mid-way through standing up a managed OpenShift cluster — and in governed enterprise environments, it will — the residue is significant and platform-specific at the cloud layer, not because the OpenShift control plane is unreliable:

ROSA: OIDC providers, operator roles, account roles
ARO: App registrations, managed resource group resources
OSD: Persistent disks, load balancers, IAM service accounts

Terraform does not clean up what it did not finish. Provider lifecycle gaps mean terraform destroy may not fully remove what a partial apply left behind — use supported deprovision paths (rosa delete, az aro delete, OCM cluster lifecycle) as source of truth.

The result is drift and orphans: resources running, billing, holding IAM permissions, and invisible to Terraform because state no longer tracks them. In a governed enterprise — where a cluster can cost hundreds of dollars a day and IAM is compliance-reviewed — that is a financial, security, and governance problem simultaneously.

Why Existing Approaches Fall Short

Standard advice is correct as far as it goes: remote state, terraform plan before apply, separated environments. Most tutorials cover that well. They do not cover governed enterprise context.

Local state is the default — and the root cause of orphan drift. Engineers configure the backend last, if at all. When an apply fails, local state reflects partial reality. When a new run starts for a "clean slate," previous resources keep running untracked.

Destroy is not a complete cleanup tool for IaC-wrapped managed services. ARO app registrations and MRG resources, ROSA security groups with VPC dependencies, OSD disks with retain policies — cleanup is always automated plus manual inventory. Drift detection surfaces the gap; it does not automatically close it.

Governance is the critical path. Teams that treat it as a checkbox to pass — rather than a relationship to build before the first apply — spend weeks in approval cycles a better starting posture shortens. Remote state and drift detection are irrelevant if required permissions are never approved.

The Architecture

Three parallel state management architectures — one per managed OpenShift platform — converge on a common drift detection pipeline. Each platform has its own remote backend, its own cloud-layer residue profile if state breaks, and its own governance surface. The drift layer is platform-agnostic: scheduled terraform plan -detailed-exitcode runs against each environment, alerting on detected changes before the gap becomes too large to reconcile.

Key design decision: state isolation by platform and environment is not optional. A single state file spanning ROSA, ARO, and OSD is a single point of failure.

Platform comparison — operations view

Platform	Cloud	State backend	Drift signal sources	Governance surface
ROSA	AWS	S3 + DynamoDB	Console IAM edits, partial OIDC/role applies	AWS Organizations SCP, marketplace track
ARO	Azure	Azure Blob Storage	MRG changes, Entra app edits, portal drift	Azure Policy, subscription RBAC, Entra ID
OSD	GCP	GCS bucket	Disk/LB retention, WIF binding changes	GCP org constraints, Workload Identity approval

Cloud-layer residue from partial applies (OIDC accumulation on ROSA, provider destroy gaps on ARO, retained PVs on OSD) is a state and process problem — observed across enterprise IaC engagements, not an indictment of managed OpenShift control planes.

Implementation

Prerequisites

Before the first terraform apply — hard stops, not guidelines:

Governance relationship established. Schedule a meeting before writing a resource block. Ask: "How can we structure our Architecture Decision Records to make your review process as easy as possible?" Covered in Step 0.
Marketplace SCP approved (ROSA). Separate high-risk track with its own timeline. Prerequisite that unlocks everything else.
Instance type capacity confirmed in the mandated region. Verify actual regional capacity, not just quota limits.
Shared VPC / networking permissions validated with the networking team — not assumed from documentation.
STS assumed-role trust policy confirmed across all account boundaries (ROSA/ARO).
Remote state backend provisioned with versioning confirmed enabled.
terraform plan -out=tfplan reviewed before any apply — use the saved plan as a governance artifact.

Step 0 — The Governance Team Is Your Primary User

No Terraform commands. The most important step.

In a governed enterprise, the governance team controls whether infrastructure reaches production. They are not a checkpoint to pass. They are your primary user.

Before technical work begins, schedule a meeting. Bring: "How can we structure our Architecture Decision Records to make your review process as easy as possible?"

Governance teams verify policy compliance, not code elegance. Map each resource to the policy it satisfies. Structure documents for verification, not explanation.

Questions every governance team asks about managed OpenShift:

How does a third-party vendor access our private network and cloud account?
What is the precise scope of the IAM permissions being requested?
Who controls trust relationships between the managed service and our account?
What happens to those permissions when the cluster is decommissioned?

Answer with each platform's Shared Responsibility Matrix:

ROSA: Red Hat ROSA Shared Responsibility Matrix — Customer, Red Hat SRE, and AWS demarcation.
ARO: Microsoft/Red Hat ARO responsibility documentation — Entra ID, control plane access, MRG ownership.
OSD: Red Hat OSD Shared Responsibility Matrix — GCP project boundaries, Workload Identity scope, SRE access paths.

For the hardest permissions — EC2 + IAM editing (AWS), VM creation + role assignments (Azure), Compute Engine + service accounts (GCP) — plan sandbox demonstration and vendor involvement. Verification by the governance team's own assessors produces confidence; documentation tells them what to verify.

Build exception justification packages that survive reviewer rotation — self-contained documents that re-establish rationale without requiring prior relationships.

Step 1 — Remote State Before the First Resource

Configure the remote backend before writing a single resource block.

ROSA on AWS — S3 + DynamoDB:

# backend.tf
terraform {
  backend "s3" {
    bucket         = "my-org-terraform-state"
    key            = "rosa/production/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-state-lock"
    # kms_key_id = "arn:aws:kms:us-east-1:ACCOUNT:key/KEY-ID"  # required in regulated environments using CMK
  }
}

ARO on Azure — Azure Blob Storage:

# backend.tf
terraform {
  backend "azurerm" {
    resource_group_name  = "rg-terraform-state"
    storage_account_name = "myorgterraformstate"
    container_name       = "tfstate"
    key                  = "aro/production/terraform.tfstate"
  }
}

OSD on GCP — GCS Bucket:

# backend.tf
terraform {
  backend "gcs" {
    bucket = "my-org-terraform-state"
    prefix = "osd/production"
  }
}

Bootstrap the state backend once, manually, before the first cluster deployment:

# bootstrap/main.tf — ROSA
resource "aws_s3_bucket" "terraform_state" {
  bucket = "my-org-terraform-state"

  lifecycle {
    prevent_destroy = true
  }
}

resource "aws_s3_bucket_versioning" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id
  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_dynamodb_table" "terraform_lock" {
  name         = "terraform-state-lock"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"

  attribute {
    name = "LockID"
    type = "S"
  }
}

Azure Blob — versioning and soft delete:

# bootstrap/azure/main.tf
resource "azurerm_resource_group" "terraform_state" {
  name     = "rg-terraform-state"
  location = "uksouth"
}

resource "azurerm_storage_account" "terraform_state" {
  name                     = "myorgterraformstate"
  resource_group_name      = azurerm_resource_group.terraform_state.name
  location                 = azurerm_resource_group.terraform_state.location
  account_tier             = "Standard"
  account_replication_type = "GRS"

  blob_properties {
    versioning_enabled = true
    delete_retention_policy {
      days = 30
    }
  }
}

resource "azurerm_storage_container" "tfstate" {
  name                  = "tfstate"
  storage_account_name  = azurerm_storage_account.terraform_state.name
  container_access_type = "private"
}

# Note: in private ARO environments with no public egress, configure a private
# endpoint for the storage account so the Terraform runner can reach it.

GCS — object versioning:

# bootstrap/gcp/main.tf
resource "google_storage_bucket" "terraform_state" {
  name          = "my-org-terraform-state"
  location      = "US"
  force_destroy = false

  versioning {
    enabled = true
  }

  lifecycle_rule {
    action {
      type = "Delete"
    }
    condition {
      num_newer_versions = 10
    }
  }
}

Versioning is not optional — it is the difference between recoverable state and unrecoverable state. Ask "Can you show me the bucket configuration?" before the first apply.

In SCP-restricted production accounts, bootstrap the state backend in a less-restricted environment first and get production approval before it is needed under time pressure.

Step 2 — Drift Detection in CI

Remote state prevents the lost file problem. It does not prevent drift from console changes, partial applies, or governance-driven modifications. Run terraform plan on a schedule — not only on push.

# .github/workflows/drift-detection.yml
name: Managed OpenShift Terraform Drift Detection

on:
  schedule:
    - cron: '0 6 * * *'
  workflow_dispatch:

jobs:
  detect-drift:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        environment: [rosa-production, aro-production, osd-production]

    steps:
      - uses: actions/checkout@v4

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: "~1.7"

      - name: Configure AWS credentials (ROSA)
        if: matrix.environment == 'rosa-production'
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.AWS_PLAN_ROLE_ARN }}
          aws-region: us-east-1

      - name: Configure Azure credentials (ARO)
        if: matrix.environment == 'aro-production'
        uses: azure/login@v2
        with:
          creds: ${{ secrets.AZURE_CREDENTIALS }}

      - name: Configure GCP credentials (OSD)
        if: matrix.environment == 'osd-production'
        uses: google-github-actions/auth@v2
        with:
          credentials_json: ${{ secrets.GCP_CREDENTIALS }}

      - name: Terraform init
        working-directory: environments/${{ matrix.environment }}
        run: terraform init

      - name: Terraform plan — detect drift
        id: plan
        working-directory: environments/${{ matrix.environment }}
        run: |
          set +e
          terraform plan \
            -detailed-exitcode \
            -out=plan.tfplan \
            -no-color 2>&1 | tee plan.txt
          TF_EXIT=${PIPESTATUS[0]}
          echo "exitcode=${TF_EXIT}" >> $GITHUB_OUTPUT
          exit ${TF_EXIT}

      - name: Alert on drift detected
        if: steps.plan.outputs.exitcode == '2'
        run: |
          pip install requests -q
          python scripts/alert.py \
            "DRIFT DETECTED in ${{ matrix.environment }} — review plan output"
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}

      - name: Upload plan for review
        if: steps.plan.outputs.exitcode == '2'
        uses: actions/upload-artifact@v4
        with:
          name: drift-plan-${{ matrix.environment }}
          path: environments/${{ matrix.environment }}/plan.txt
          retention-days: 7

      - name: Fail if drift detected
        if: steps.plan.outputs.exitcode == '2'
        run: |
          echo "Drift detected in ${{ matrix.environment }}"
          echo "Review the uploaded plan artifact and remediate before next apply"
          exit 1

GCS state locking is native since Terraform 1.1 — no separate lock resource unlike DynamoDB for S3.

For retry and alerting patterns, see Retry Logic and Tiered Alerting in GitHub Actions.

How drift detection works: -detailed-exitcode returns exit code 2 when changes are detected — that is the drift signal. Without it, terraform plan returns 0 whether or not changes exist. Daily runs across ROSA, ARO, and OSD catch manual console changes within 24 hours, not weeks later when a pipeline fails or governance flags an untracked modification.

Drift detection surfaces problems. It does not resolve them. Remediation may still require networking coordination and governance tickets — but you find the gap early, while the platform control plane keeps running reliably underneath.

Step 3 — Recovering Orphaned Resources

Recovery reduces residue; it does not guarantee a perfectly clean account. Inventory first, use supported platform deprovision paths, then clean cloud-layer leftovers.

ROSA — OIDC providers, operator roles, account roles:

aws rosa list-clusters --output json | \
  jq '.clusters[] | {id: .id, name: .name, state: .state}'

aws iam list-open-id-connect-providers | \
  jq '.OpenIDConnectProviderList[].Arn'

rosa delete cluster --cluster=my-orphaned-cluster --yes
rosa delete oidc-provider -c my-orphaned-cluster --yes
rosa delete operator-roles -c my-orphaned-cluster --yes

rosa list clusters --output json | jq -r '.[].aws.sts.role_arn' | grep my-prefix
rosa delete account-roles --prefix my-prefix --yes

Check IAM directly for OIDC providers referencing clusters that no longer exist.

ARO — prefer native deprovision, then inventory:

az aro delete --name my-cluster --resource-group my-rg --yes

az ad app list \
  --filter "startswith(displayName,'aro-')" \
  --query "[].{name:displayName,id:appId}" \
  --output table

az group list --query "[?starts_with(name, 'aro-')]" --output table
az group delete --name <resource-group-name> --yes --no-wait

Confirm MRGs are genuinely orphaned before deletion — no running cluster references them.

OSD on GCP:

gcloud compute disks list \
  --filter="NOT users:*" \
  --format="table(name,zone,sizeGb,status)"

gcloud compute forwarding-rules list \
  --filter="description~<infra-id-prefix>" \
  --format="table(name,region,IPAddress)"

gcloud iam service-accounts list \
  --filter="email~osd" \
  --format="table(email,displayName,disabled)"

gcloud iam service-accounts disable <service-account-email>
gcloud iam service-accounts delete <service-account-email> --quiet

Import back into state when the resource should remain Terraform-managed:

terraform import \
  rhcs_cluster_rosa_classic.production \
  my-rosa-cluster-id

terraform import \
  azurerm_resource_group.aro_cluster \
  /subscriptions/<subscription-id>/resourceGroups/<resource-group-name>

Run terraform plan after import. Review diffs before applying.

Step 4 — Directory Structure That Prevents Blast Radius

04-terraform-managed-openshift-state/
├── bootstrap/
│   ├── aws/
│   ├── azure/
│   └── gcp/
├── environments/
│   ├── rosa-production/
│   ├── rosa-staging/
│   ├── aro-production/
│   ├── aro-staging/
│   ├── osd-production/
│   └── osd-staging/
├── modules/
│   ├── rosa-cluster/
│   ├── aro-cluster/
│   └── osd-cluster/
├── scripts/
│   └── recovery/
└── .github/workflows/
    └── drift-detection.yml

One state file per platform × environment. A broken rosa-staging state does not affect aro-production.

Security Considerations

Orphaned IAM resources are compliance exposure. OIDC providers, operator roles, and Workload Identity bindings sitting untracked in a governed environment become security findings. Automated scanning will flag them; cross-account remediation can take weeks.

Governance trust compounds. Orphan accumulation makes subsequent permission requests harder — felt as scrutiny, not always stated. Precise documentation on every follow-up request is the recovery path.

Exception fatigue across rotating rosters. Build self-contained justification packages that survive reviewer rotation without requiring institutional memory.

Permission scope must be mapped, not assumed. EC2 + IAM editing (AWS), VM + role assignment (Azure), Compute + service accounts (GCP) look like broad privilege to first-time reviewers. The Shared Responsibility Matrix plus sandbox verification closes the gap — the permissions are vendor SRE access models, scoped and standard for managed Kubernetes, not arbitrary admin keys.

Tradeoffs

Remote state and drift detection give you: Recoverable state versions. Drift visibility within 24 hours. Locking against concurrent apply corruption. Saved plans as governance artifacts.

They cost you: Bootstrap time and possibly its own approval cycle. Network path from the Terraform runner to S3, Azure Blob, or GCS in private environments.

Drift detection does not give you: Automatic remediation. It tells you state and reality diverged — fixing that may still be manual.

The honest limit: Governance relationship breaks first. Technical hygiene only matters after permissions and landing zone are stable. The platforms themselves — ROSA, ARO, OSD — are the stable layer underneath.

What I'd Do Differently

Prove the platform with the supported path before Terraform. rosa create cluster, az aro create, OCM wizard — then encode what worked.

Remote state before the first resource block. Bootstrap takes 15 minutes. Recovering lost state across three account boundaries takes days.

Treat every failed apply as a state review trigger. Audit what Terraform created and whether state reflects it before re-running. Thirty minutes saves hours of orphan hunting.

Never start install before governance is aligned. Parallel permission requests and applies guarantee partial state and billing residue during approval waits.

Build exception packages before the first governance meeting. Map permissions to policy, scope, and decommission behavior for reviewers who have never seen prior approvals.

Use terraform plan -out=tfplan as a governance artifact. "Here is precisely what this apply creates" beats a verbal description.

Run drift detection from day one of IaC, not after the first surprise console change.

GitHub Repo

agentic-devops/pipelineandprompts-labs — terraform-managed-openshift-state

Bootstrap for S3, Azure Blob, and GCS backends. Environment directories for ROSA, ARO, and OSD. Drift detection workflow. Recovery scripts and governance checklist.

What's Next

Next in Pipelines in the Wild: secrets management rotation automation across multi-cloud managed OpenShift — one layer above state and drift, with the same governance surface. Foundation: Secrets Management Across Multi-Cloud Pipelines.

Red Hat docs: ROSA, ARO, OSD.

Working configurations are in the GitHub repo. Drift patterns, recovery notes, and platform-specific runbooks — repo issues are the right place to compare notes.

DEV Community