π οΈ Pipelines in the Wild #4
Byte Size Summary
Red Hat OpenShift Service on AWS (ROSA), Azure Red Hat OpenShift (ARO), and OpenShift Dedicated on GCP (OSD) are mature, SRE-operated platforms β same OpenShift surface, three clouds, control plane work you do not own. When prerequisites are met, installs complete in the documented window. Enterprise timelines stretch when governance approval tracks and Terraform state hygiene are treated as afterthoughts, not when the platform fails. This article separates the solid platform from the enterprise wrapper, then covers what actually prevents pain on the IaC side: remote state before the first resource, scheduled drift detection, and platform-specific recovery when partial applies leave residue behind.
The Story
I was standing up a ROSA Classic cluster in a private governed enterprise environment. The installation documentation says 45 minutes. That estimate assumes a clean AWS account, permissive IAM, and a single team with full control over their own infrastructure.
None of those conditions existed.
The environment had AWS Organizations Service Control Policy (SCP) restrictions, a shared VPC owned by a separate networking team, and a corporate cloud governance team managing a separate approval track for every permission category. The cross-account Security Token Service (STS) assumed-role setup required trust policies across three account boundaries simultaneously. I was also new to Terraform. I had forked someone else's configuration and was running it without fully understanding what it created.
The first apply failed on an SCP block. I fixed the permission β or thought I did β and ran it again. It failed again, at a different point, on a different permission. Each failure left resources behind. OpenID Connect (OIDC) providers, IAM roles, partial VPC associations. I did not know Terraform was blind to anything not in its state file. I thought starting fresh was the same as starting clean.
It is not.
By the time my AWS account admin flagged unusual IAM activity in my account, I had accumulated OIDC providers across multiple restart cycles and could not fully account for what the forked code had created. I had to dig into the code, get a colleague to walk me through what it was doing, and spend time manually hunting through the console for resources I had created but never tracked.
The governance approval tracks β marketplace SCP on a separate high-risk timeline, VPC policies, networking policies, EC2 instance creation permissions, IAM edit permissions β were each running independently with different reviewers and different response times. Two weeks was a typical cycle for a single approval. The marketplace SCP alone, classified higher-risk than the others, had its own queue.
What was scoped as a 45-minute installation took 4 weeks. Roughly 40% of that β two weeks β was avoidable operational chaos that better Terraform practice and a different understanding of the governance relationship would have prevented. The other 60% was a governance process that no automation shortens.
The customer's perception at the end: this is very complicated.
That perception was accurate for their environment, not for managed OpenShift in general. Self-managed OpenShift in the same SCP-constrained account would have hit identical approval cycles plus etcd backups, upgrade orchestration, and control plane incident response on the team. A significant portion of the complexity was self-inflicted wrapper problems β lost state, restart cycles, Terraform before governance β wrapped around a platform that worked once permissions cleared.
This article covers both: why ROSA, ARO, and OSD are solid choices, and how to operate Terraform around them without creating drift, orphans, or a false "platform is broken" narrative.
Why ROSA, ARO, and OSD Are Solid
Not marketing. Not "zero incidents ever." Solid means:
- Predictable control plane β API server, scheduler, etcd, and OpenShift operators are operated by Red Hat SREs with SLAs, not your on-call rotation.
- Same OpenShift, three clouds β Routes, SCCs, Operators, GitOps patterns, and the console experience carry over from on-prem or self-managed.
- Documented install paths that work β CLI, portal, and OpenShift Cluster Manager (OCM) flows complete in advertised timeframes when the account is ready.
- Production track record β Regulated industries run workloads on managed OpenShift because the operational model holds up under audit.
The complexity split
| Source of delay | Share | Owner |
|---|---|---|
| Governance approval tracks (SCP, marketplace, networking, IAM) | ~60% | Enterprise process |
| Avoidable ops mistakes (local state, restart cycles, no pre-flight) | ~40% | Team practice |
Managed OpenShift does not shorten a two-week marketplace SCP queue. No cloud product does. It does eliminate operating the Kubernetes control plane while you wait.
What each platform delivers
ROSA (AWS) β Native STS, OIDC, PrivateLink, and BYO VPC patterns for enterprise AWS landing zones. Red Hat SREs own control plane upgrades, API availability, and platform operator lifecycle. Supported fast path:
rosa create cluster --cluster-name=my-cluster --region=us-east-1
HCP (Hosted Control Plane) further reduces customer-managed infrastructure for teams that want minimal AWS surface area.
ARO (Azure) β First-party Azure resource type, co-operated by Microsoft and Red Hat. Entra ID integration, ARM lifecycle via Portal or az aro create / az aro delete, and a managed resource group (MRG) pattern that bounds blast radius for auditors β intentional isolation, not accidental complexity.
OSD (GCP) β OCM-centric fleet management with Red Hat SRE operations. GCP IAM and Workload Identity integration for regulated orgs. Persistent disks that survive workload deletion are often reclaim policy by design, not platform failure β production stateful services depend on that behavior.
Side-by-side
| Dimension | ROSA (AWS) | ARO (Azure) | OSD (GCP) |
|---|---|---|---|
| Control plane ops | Red Hat SRE | Red Hat + Microsoft | Red Hat SRE |
| Primary install path |
rosa CLI / OCM |
Portal / az aro create
|
OCM subscription flow |
| Cloud-native identity | STS / IAM / OIDC | Entra ID / Azure RBAC | GCP IAM / WIF |
| Enterprise networking | PrivateLink, BYO VPC | Private ARO, custom VNet | Private cluster, org constraints |
| What you stop doing | etcd backup, CP upgrades, API SRE | Same | Same |
| What you still own | Workloads, app RBAC, landing zone | Workloads, app RBAC, landing zone | Workloads, app RBAC, landing zone |
All three are the same OpenShift product surface with cloud-specific landing gear β not three different quality tiers.
Where complexity actually lives
| Layer | Examples | Platform fault? |
|---|---|---|
| Enterprise governance | SCP approval, marketplace tracks, shared VPC change windows | No |
| Landing zone design | Cross-account trust, hub-spoke DNS, egress filtering | No |
| IaC practice | Local state, partial applies, destroy without inventory | No |
| Application operations | ImagePullBackOff, NetworkPolicy gaps, secret sprawl | No |
| Platform incidents | Control plane outage, failed platform upgrade | Yes β SRE-owned with SLAs |
Managed OpenShift headlines almost always come from the first three rows. That is equally true for EKS, GKE Autopilot, and AKS in the same organization.
Recommended posture
-
Prove the platform first β use the supported install path (
rosa,az aro create, OCM), validate networking and a representative workload, measure time-to-cluster after approvals. - Add Terraform for repeatability β once governance and landing zone are stable, encode a proven path with remote state and drift detection.
- Operate workloads, not the control plane β invest in pipelines, GitOps, and app SLOs, not etcd runbooks.
The Drift Problem (Wrapper, Not Platform)
Enterprise Terraform for managed services differs from tutorial workloads. Each platform creates resources across multiple ownership boundaries, integrates with cloud-native identity systems, and requires permissions that look broad to a governance team seeing them for the first time β the same class of vendor SRE access model as any managed Kubernetes on a hyperscaler.
When a Terraform apply fails mid-way through standing up a managed OpenShift cluster β and in governed enterprise environments, it will β the residue is significant and platform-specific at the cloud layer, not because the OpenShift control plane is unreliable:
- ROSA: OIDC providers, operator roles, account roles
- ARO: App registrations, managed resource group resources
- OSD: Persistent disks, load balancers, IAM service accounts
Terraform does not clean up what it did not finish. Provider lifecycle gaps mean terraform destroy may not fully remove what a partial apply left behind β use supported deprovision paths (rosa delete, az aro delete, OCM cluster lifecycle) as source of truth.
The result is drift and orphans: resources running, billing, holding IAM permissions, and invisible to Terraform because state no longer tracks them. In a governed enterprise β where a cluster can cost hundreds of dollars a day and IAM is compliance-reviewed β that is a financial, security, and governance problem simultaneously.
Why Existing Approaches Fall Short
Standard advice is correct as far as it goes: remote state, terraform plan before apply, separated environments. Most tutorials cover that well. They do not cover governed enterprise context.
Local state is the default β and the root cause of orphan drift. Engineers configure the backend last, if at all. When an apply fails, local state reflects partial reality. When a new run starts for a "clean slate," previous resources keep running untracked.
Destroy is not a complete cleanup tool for IaC-wrapped managed services. ARO app registrations and MRG resources, ROSA security groups with VPC dependencies, OSD disks with retain policies β cleanup is always automated plus manual inventory. Drift detection surfaces the gap; it does not automatically close it.
Governance is the critical path. Teams that treat it as a checkbox to pass β rather than a relationship to build before the first apply β spend weeks in approval cycles a better starting posture shortens. Remote state and drift detection are irrelevant if required permissions are never approved.
The Architecture
Three parallel state management architectures β one per managed OpenShift platform β converge on a common drift detection pipeline. Each platform has its own remote backend, its own cloud-layer residue profile if state breaks, and its own governance surface. The drift layer is platform-agnostic: scheduled terraform plan -detailed-exitcode runs against each environment, alerting on detected changes before the gap becomes too large to reconcile.
Key design decision: state isolation by platform and environment is not optional. A single state file spanning ROSA, ARO, and OSD is a single point of failure.
Platform comparison β operations view
| Platform | Cloud | State backend | Drift signal sources | Governance surface |
|---|---|---|---|---|
| ROSA | AWS | S3 + DynamoDB | Console IAM edits, partial OIDC/role applies | AWS Organizations SCP, marketplace track |
| ARO | Azure | Azure Blob Storage | MRG changes, Entra app edits, portal drift | Azure Policy, subscription RBAC, Entra ID |
| OSD | GCP | GCS bucket | Disk/LB retention, WIF binding changes | GCP org constraints, Workload Identity approval |
Cloud-layer residue from partial applies (OIDC accumulation on ROSA, provider destroy gaps on ARO, retained PVs on OSD) is a state and process problem β observed across enterprise IaC engagements, not an indictment of managed OpenShift control planes.
Implementation
Prerequisites
Before the first terraform apply β hard stops, not guidelines:
- Governance relationship established. Schedule a meeting before writing a resource block. Ask: "How can we structure our Architecture Decision Records to make your review process as easy as possible?" Covered in Step 0.
- Marketplace SCP approved (ROSA). Separate high-risk track with its own timeline. Prerequisite that unlocks everything else.
- Instance type capacity confirmed in the mandated region. Verify actual regional capacity, not just quota limits.
- Shared VPC / networking permissions validated with the networking team β not assumed from documentation.
- STS assumed-role trust policy confirmed across all account boundaries (ROSA/ARO).
- Remote state backend provisioned with versioning confirmed enabled.
-
terraform plan -out=tfplanreviewed before any apply β use the saved plan as a governance artifact.
Step 0 β The Governance Team Is Your Primary User
No Terraform commands. The most important step.
In a governed enterprise, the governance team controls whether infrastructure reaches production. They are not a checkpoint to pass. They are your primary user.
Before technical work begins, schedule a meeting. Bring: "How can we structure our Architecture Decision Records to make your review process as easy as possible?"
Governance teams verify policy compliance, not code elegance. Map each resource to the policy it satisfies. Structure documents for verification, not explanation.
Questions every governance team asks about managed OpenShift:
- How does a third-party vendor access our private network and cloud account?
- What is the precise scope of the IAM permissions being requested?
- Who controls trust relationships between the managed service and our account?
- What happens to those permissions when the cluster is decommissioned?
Answer with each platform's Shared Responsibility Matrix:
- ROSA: Red Hat ROSA Shared Responsibility Matrix β Customer, Red Hat SRE, and AWS demarcation.
- ARO: Microsoft/Red Hat ARO responsibility documentation β Entra ID, control plane access, MRG ownership.
- OSD: Red Hat OSD Shared Responsibility Matrix β GCP project boundaries, Workload Identity scope, SRE access paths.
For the hardest permissions β EC2 + IAM editing (AWS), VM creation + role assignments (Azure), Compute Engine + service accounts (GCP) β plan sandbox demonstration and vendor involvement. Verification by the governance team's own assessors produces confidence; documentation tells them what to verify.
Build exception justification packages that survive reviewer rotation β self-contained documents that re-establish rationale without requiring prior relationships.
Step 1 β Remote State Before the First Resource
Configure the remote backend before writing a single resource block.
ROSA on AWS β S3 + DynamoDB:
# backend.tf
terraform {
backend "s3" {
bucket = "my-org-terraform-state"
key = "rosa/production/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-state-lock"
# kms_key_id = "arn:aws:kms:us-east-1:ACCOUNT:key/KEY-ID" # required in regulated environments using CMK
}
}
ARO on Azure β Azure Blob Storage:
# backend.tf
terraform {
backend "azurerm" {
resource_group_name = "rg-terraform-state"
storage_account_name = "myorgterraformstate"
container_name = "tfstate"
key = "aro/production/terraform.tfstate"
}
}
OSD on GCP β GCS Bucket:
# backend.tf
terraform {
backend "gcs" {
bucket = "my-org-terraform-state"
prefix = "osd/production"
}
}
Bootstrap the state backend once, manually, before the first cluster deployment:
# bootstrap/main.tf β ROSA
resource "aws_s3_bucket" "terraform_state" {
bucket = "my-org-terraform-state"
lifecycle {
prevent_destroy = true
}
}
resource "aws_s3_bucket_versioning" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
versioning_configuration {
status = "Enabled"
}
}
resource "aws_dynamodb_table" "terraform_lock" {
name = "terraform-state-lock"
billing_mode = "PAY_PER_REQUEST"
hash_key = "LockID"
attribute {
name = "LockID"
type = "S"
}
}
Azure Blob β versioning and soft delete:
# bootstrap/azure/main.tf
resource "azurerm_resource_group" "terraform_state" {
name = "rg-terraform-state"
location = "uksouth"
}
resource "azurerm_storage_account" "terraform_state" {
name = "myorgterraformstate"
resource_group_name = azurerm_resource_group.terraform_state.name
location = azurerm_resource_group.terraform_state.location
account_tier = "Standard"
account_replication_type = "GRS"
blob_properties {
versioning_enabled = true
delete_retention_policy {
days = 30
}
}
}
resource "azurerm_storage_container" "tfstate" {
name = "tfstate"
storage_account_name = azurerm_storage_account.terraform_state.name
container_access_type = "private"
}
# Note: in private ARO environments with no public egress, configure a private
# endpoint for the storage account so the Terraform runner can reach it.
GCS β object versioning:
# bootstrap/gcp/main.tf
resource "google_storage_bucket" "terraform_state" {
name = "my-org-terraform-state"
location = "US"
force_destroy = false
versioning {
enabled = true
}
lifecycle_rule {
action {
type = "Delete"
}
condition {
num_newer_versions = 10
}
}
}
Versioning is not optional β it is the difference between recoverable state and unrecoverable state. Ask "Can you show me the bucket configuration?" before the first apply.
In SCP-restricted production accounts, bootstrap the state backend in a less-restricted environment first and get production approval before it is needed under time pressure.
Step 2 β Drift Detection in CI
Remote state prevents the lost file problem. It does not prevent drift from console changes, partial applies, or governance-driven modifications. Run terraform plan on a schedule β not only on push.
# .github/workflows/drift-detection.yml
name: Managed OpenShift Terraform Drift Detection
on:
schedule:
- cron: '0 6 * * *'
workflow_dispatch:
jobs:
detect-drift:
runs-on: ubuntu-latest
strategy:
matrix:
environment: [rosa-production, aro-production, osd-production]
steps:
- uses: actions/checkout@v4
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
with:
terraform_version: "~1.7"
- name: Configure AWS credentials (ROSA)
if: matrix.environment == 'rosa-production'
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.AWS_PLAN_ROLE_ARN }}
aws-region: us-east-1
- name: Configure Azure credentials (ARO)
if: matrix.environment == 'aro-production'
uses: azure/login@v2
with:
creds: ${{ secrets.AZURE_CREDENTIALS }}
- name: Configure GCP credentials (OSD)
if: matrix.environment == 'osd-production'
uses: google-github-actions/auth@v2
with:
credentials_json: ${{ secrets.GCP_CREDENTIALS }}
- name: Terraform init
working-directory: environments/${{ matrix.environment }}
run: terraform init
- name: Terraform plan β detect drift
id: plan
working-directory: environments/${{ matrix.environment }}
run: |
set +e
terraform plan \
-detailed-exitcode \
-out=plan.tfplan \
-no-color 2>&1 | tee plan.txt
TF_EXIT=${PIPESTATUS[0]}
echo "exitcode=${TF_EXIT}" >> $GITHUB_OUTPUT
exit ${TF_EXIT}
- name: Alert on drift detected
if: steps.plan.outputs.exitcode == '2'
run: |
pip install requests -q
python scripts/alert.py \
"DRIFT DETECTED in ${{ matrix.environment }} β review plan output"
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
- name: Upload plan for review
if: steps.plan.outputs.exitcode == '2'
uses: actions/upload-artifact@v4
with:
name: drift-plan-${{ matrix.environment }}
path: environments/${{ matrix.environment }}/plan.txt
retention-days: 7
- name: Fail if drift detected
if: steps.plan.outputs.exitcode == '2'
run: |
echo "Drift detected in ${{ matrix.environment }}"
echo "Review the uploaded plan artifact and remediate before next apply"
exit 1
GCS state locking is native since Terraform 1.1 β no separate lock resource unlike DynamoDB for S3.
For retry and alerting patterns, see Retry Logic and Tiered Alerting in GitHub Actions.
How drift detection works: -detailed-exitcode returns exit code 2 when changes are detected β that is the drift signal. Without it, terraform plan returns 0 whether or not changes exist. Daily runs across ROSA, ARO, and OSD catch manual console changes within 24 hours, not weeks later when a pipeline fails or governance flags an untracked modification.
Drift detection surfaces problems. It does not resolve them. Remediation may still require networking coordination and governance tickets β but you find the gap early, while the platform control plane keeps running reliably underneath.
Step 3 β Recovering Orphaned Resources
Recovery reduces residue; it does not guarantee a perfectly clean account. Inventory first, use supported platform deprovision paths, then clean cloud-layer leftovers.
ROSA β OIDC providers, operator roles, account roles:
aws rosa list-clusters --output json | \
jq '.clusters[] | {id: .id, name: .name, state: .state}'
aws iam list-open-id-connect-providers | \
jq '.OpenIDConnectProviderList[].Arn'
rosa delete cluster --cluster=my-orphaned-cluster --yes
rosa delete oidc-provider -c my-orphaned-cluster --yes
rosa delete operator-roles -c my-orphaned-cluster --yes
rosa list clusters --output json | jq -r '.[].aws.sts.role_arn' | grep my-prefix
rosa delete account-roles --prefix my-prefix --yes
Check IAM directly for OIDC providers referencing clusters that no longer exist.
ARO β prefer native deprovision, then inventory:
az aro delete --name my-cluster --resource-group my-rg --yes
az ad app list \
--filter "startswith(displayName,'aro-')" \
--query "[].{name:displayName,id:appId}" \
--output table
az group list --query "[?starts_with(name, 'aro-')]" --output table
az group delete --name <resource-group-name> --yes --no-wait
Confirm MRGs are genuinely orphaned before deletion β no running cluster references them.
OSD on GCP:
gcloud compute disks list \
--filter="NOT users:*" \
--format="table(name,zone,sizeGb,status)"
gcloud compute forwarding-rules list \
--filter="description~<infra-id-prefix>" \
--format="table(name,region,IPAddress)"
gcloud iam service-accounts list \
--filter="email~osd" \
--format="table(email,displayName,disabled)"
gcloud iam service-accounts disable <service-account-email>
gcloud iam service-accounts delete <service-account-email> --quiet
Import back into state when the resource should remain Terraform-managed:
terraform import \
rhcs_cluster_rosa_classic.production \
my-rosa-cluster-id
terraform import \
azurerm_resource_group.aro_cluster \
/subscriptions/<subscription-id>/resourceGroups/<resource-group-name>
Run terraform plan after import. Review diffs before applying.
Step 4 β Directory Structure That Prevents Blast Radius
04-terraform-managed-openshift-state/
βββ bootstrap/
β βββ aws/
β βββ azure/
β βββ gcp/
βββ environments/
β βββ rosa-production/
β βββ rosa-staging/
β βββ aro-production/
β βββ aro-staging/
β βββ osd-production/
β βββ osd-staging/
βββ modules/
β βββ rosa-cluster/
β βββ aro-cluster/
β βββ osd-cluster/
βββ scripts/
β βββ recovery/
βββ .github/workflows/
βββ drift-detection.yml
One state file per platform Γ environment. A broken rosa-staging state does not affect aro-production.
Security Considerations
Orphaned IAM resources are compliance exposure. OIDC providers, operator roles, and Workload Identity bindings sitting untracked in a governed environment become security findings. Automated scanning will flag them; cross-account remediation can take weeks.
Governance trust compounds. Orphan accumulation makes subsequent permission requests harder β felt as scrutiny, not always stated. Precise documentation on every follow-up request is the recovery path.
Exception fatigue across rotating rosters. Build self-contained justification packages that survive reviewer rotation without requiring institutional memory.
Permission scope must be mapped, not assumed. EC2 + IAM editing (AWS), VM + role assignment (Azure), Compute + service accounts (GCP) look like broad privilege to first-time reviewers. The Shared Responsibility Matrix plus sandbox verification closes the gap β the permissions are vendor SRE access models, scoped and standard for managed Kubernetes, not arbitrary admin keys.
Tradeoffs
Remote state and drift detection give you: Recoverable state versions. Drift visibility within 24 hours. Locking against concurrent apply corruption. Saved plans as governance artifacts.
They cost you: Bootstrap time and possibly its own approval cycle. Network path from the Terraform runner to S3, Azure Blob, or GCS in private environments.
Drift detection does not give you: Automatic remediation. It tells you state and reality diverged β fixing that may still be manual.
The honest limit: Governance relationship breaks first. Technical hygiene only matters after permissions and landing zone are stable. The platforms themselves β ROSA, ARO, OSD β are the stable layer underneath.
What I'd Do Differently
Prove the platform with the supported path before Terraform. rosa create cluster, az aro create, OCM wizard β then encode what worked.
Remote state before the first resource block. Bootstrap takes 15 minutes. Recovering lost state across three account boundaries takes days.
Treat every failed apply as a state review trigger. Audit what Terraform created and whether state reflects it before re-running. Thirty minutes saves hours of orphan hunting.
Never start install before governance is aligned. Parallel permission requests and applies guarantee partial state and billing residue during approval waits.
Build exception packages before the first governance meeting. Map permissions to policy, scope, and decommission behavior for reviewers who have never seen prior approvals.
Use terraform plan -out=tfplan as a governance artifact. "Here is precisely what this apply creates" beats a verbal description.
Run drift detection from day one of IaC, not after the first surprise console change.
GitHub Repo
agentic-devops/pipelineandprompts-labs β terraform-managed-openshift-state
Bootstrap for S3, Azure Blob, and GCS backends. Environment directories for ROSA, ARO, and OSD. Drift detection workflow. Recovery scripts and governance checklist.
What's Next
Next in Pipelines in the Wild: secrets management rotation automation across multi-cloud managed OpenShift β one layer above state and drift, with the same governance surface. Foundation: Secrets Management Across Multi-Cloud Pipelines.
Working configurations are in the GitHub repo. Drift patterns, recovery notes, and platform-specific runbooks β repo issues are the right place to compare notes.

Top comments (0)