We’ve all been there. You want to build an Internal Developer Platform (IDP). You start with good intentions: "Let's simplify infrastructure for our developers." Six months later, you have a sprawling Backstage instance that nobody likes, a fragile mountain of Terraform modules that take 40 minutes to apply, and developers who still just DM you to "fix the S3 bucket permissions."
We fell into this trap. We tried to abstract everything away until we realized we were just hiding complexity, not managing it.
This article details a different approach. We call it "Just Enough" Platform Engineering. Instead of building a portal that triggers a CI pipeline to run Terraform (the "ClickOps" anti-pattern), we moved the abstraction layer into the Kubernetes cluster itself.
Using AWS Kro (Kubernetes Resource Orchestrator) and ACK (AWS Controllers for Kubernetes), we built a self-service API that allows developers to spin up production-ready, compliant microservices in minutes. No Jenkins pipelines. No Terraform state locks. Just kubectl apply.
Here is how we solved the "Day 2" operations gap and cut provisioning time from days to minutes.
🎯 The Real Problem: The "Day 2" Gap
Most platforms nail "Day 0" (creating the hello-world app). They fail at "Day 2" (maintenance).
The Scenario: You use a Terraform module to provision an S3 bucket for a team.
The Problem:
- Drift: A developer manually changes the bucket policy in the AWS Console to debug something. Your Terraform state is now wrong.
-
Versioning: You update the Terraform module to enforce encryption. You now have to run
terraform applyacross 50 different repositories to propagate the fix. - Cognitive Load: Developers have to learn HCL just to add a queue.
We needed a solution that was actively reconciling (fixing drift automatically) and API-centric (versioned and manageable).
🛠️ The Architecture: Kubernetes as the Control Plane
We stopped treating Kubernetes as just a container scheduler and started treating it as a universal control plane.
- AWS Kro: Allows us to define custom APIs (CRDs) without writing Go code. It acts as the "glue" or the orchestrator.
- ACK (AWS Controllers for Kubernetes): Native Kubernetes controllers that talk to AWS APIs. They turn an S3 bucket into a Kubernetes object.
The Workflow:
-
Platform Team defines a
ResourceGraphDefinition(RGD). This is the blueprint. -
Kro converts that RGD into a custom Kubernetes API (e.g.,
kind: SecureMLWorkspace). - Developer applies a simple 5-line YAML file.
- Kro + ACK automatically provision the Deployment, Service, IAM Role, and S3 Bucket, wiring them all together securely.
💻 Implementation: The "Secure ML Workspace" API
Let's build a real artifact. We want a Custom Resource called MLWorkspace that gives a data scientist:
- A Jupyter Notebook (Deployment + Service).
- A private S3 Bucket for datasets.
- An IAM Role that allows only that notebook to access only that bucket.
1. The Foundation (Terraform)
We use Terraform only for the static base (EKS cluster, OIDC provider, and installing the controllers). We don't use it for the dynamic app resources.
Terraform
# Install the ACK S3 Controller via Helm
resource "helm_release" "ack_s3" {
name = "ack-s3-controller"
chart = "s3-chart"
repository = "oci://public.ecr.aws/aws-controllers-k8s"
# Crucial: Map the K8s ServiceAccount to an AWS IAM Role (IRSA)
set {
name = "serviceAccount.annotations.eks\\.amazonaws\\.com/role-arn"
value = aws_iam_role.ack_s3_controller.arn
}
}
2. The Abstraction (Kro ResourceGraphDefinition)
This is the "secret sauce." Instead of writing a complex Go Operator, we define the relationship graph in YAML.
Note: We use the ResourceGraphDefinition kind (the current standard for Kro).
apiVersion: kro.run/v1alpha1
kind: ResourceGraphDefinition
metadata:
name: ml-workspace-api
spec:
# The Interface: What the developer sees
schema:
apiVersion: v1alpha1
kind: MLWorkspace
spec:
project: string
gpu: boolean | default=false
status:
notebookUrl: "http://${notebookservice.metadata.name}.${schema.metadata.namespace}.svc.cluster.local:8888"
storage: ${s3bucket.status.ackResourceMetadata.arn}
# The Implementation: What gets created
resources:
# 1. The Private S3 Bucket (Managed by ACK)
- id: s3bucket
template:
apiVersion: s3.services.k8s.aws/v1alpha1
kind: Bucket
metadata:
name: ${schema.spec.project}-data
spec:
name: ${schema.spec.project}-data-${schema.metadata.uid} # Unique Name
encryption:
rules:
- applyServerSideEncryptionByDefault:
sseAlgorithm: AES256
# 2. The IAM Policy for Bucket Access (Managed by ACK)
- id: iampolicy
readyWhen:
- ${iampolicy.status.ackResourceMetadata.arn != null}
template:
apiVersion: iam.services.k8s.aws/v1alpha1
kind: Policy
metadata:
name: ${schema.spec.project}-s3-policy
spec:
name: ${schema.spec.project}-s3-policy-${schema.metadata.uid}
policyDocument: |
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:PutObject", "s3:ListBucket"],
"Resource": [
"arn:aws:s3:::${schema.spec.project}-data-${schema.metadata.uid}",
"arn:aws:s3:::${schema.spec.project}-data-${schema.metadata.uid}/*"
]
}]
}
# 3. The IAM Role for K8s Service Account/IRSA (Managed by ACK)
- id: iamrole
readyWhen:
- ${iamrole.status.ackResourceMetadata.arn != null}
template:
apiVersion: iam.services.k8s.aws/v1alpha1
kind: Role
metadata:
name: ${schema.spec.project}-role
spec:
name: ${schema.spec.project}-role-${schema.metadata.uid}
policies:
- ${iampolicy.status.ackResourceMetadata.arn}
assumeRolePolicyDocument: |
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": { "Federated": "arn:aws:iam::ACCOUNT_ID:oidc-provider/OIDC_URL" },
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"OIDC_URL:sub": "system:serviceaccount:${schema.metadata.namespace}:${schema.spec.project}-sa"
}
}
}]
}
# 4. The Kubernetes Service Account
- id: serviceaccount
template:
apiVersion: v1
kind: ServiceAccount
metadata:
name: ${schema.spec.project}-sa
namespace: ${schema.metadata.namespace}
annotations:
eks.amazonaws.com/role-arn: ${iamrole.status.ackResourceMetadata.arn}
# 5. The Notebook Service
- id: notebookservice
template:
apiVersion: v1
kind: Service
metadata:
name: ${schema.spec.project}-notebook
namespace: ${schema.metadata.namespace}
spec:
selector:
app: ${schema.spec.project}
ports:
- port: 8888
targetPort: 8888
# 6. The Notebook Deployment (Wait for bucket to be ready)
- id: notebook
readyWhen:
- ${notebook.status.availableReplicas == 1}
template:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ${schema.spec.project}-notebook
namespace: ${schema.metadata.namespace}
spec:
selector:
matchLabels:
app: ${schema.spec.project}
template:
spec:
serviceAccountName: ${schema.spec.project}-sa
containers:
- name: jupyter
image: "jupyter/scipy-notebook:latest"
env:
# AUTOMATIC WIRING: Inject the Bucket ARN directly
- name: DATA_BUCKET
value: ${s3bucket.status.ackResourceMetadata.arn}
resources:
limits:
# Conditional Logic in CEL
nvidia.com/gpu: "${schema.spec.gpu? '1' : '0'}"
3. The Developer Experience
The developer doesn't care about IAM policies, encryption rules, or Pod selectors. They just want a workspace
apiVersion: v1alpha1
kind: MLWorkspace
metadata:
name: fraud-detection-dev
spec:
project: fraud-detection
gpu: true
That's it. When they apply this, Kro creates the bucket, waits for the ARN to be generated by AWS, injects that ARN into the Pod's environment variables, and spins up the compute.
💰 The Cost Reality
Is running this expensive? We analyzed the costs of running the control plane (ACK + Kro) versus the operational savings.
The "Tax" (Infrastructure Cost):
- Kro Controller: Runs as a standard Pod on your existing EKS nodes. Costs nothing beyond the base EC2/Fargate compute required (which is negligible).
- ACK Controllers: Also run as Pods on your existing nodes. Minimal resource usage.
- Total "Platform Tax": Essentially $0 in additional licensing or managed service fees. You only pay for the standard EKS cluster and the underlying compute nodes you are already running.
The Savings (Operational):
- Drift Remediation: $0 (Automatic).
- Wait Time: Reduced from days (ticketing queue) to seconds.
-
Security Audits: The RGD acts as a policy. You can verify that every
MLWorkspaceuses AES256 encryption just by checking the single RGD file.
🧠 My Individual Conclusion
After migrating our core data services to this model, here is my honest take:
1. The "Leaky Abstraction" Risk is Real
When an ACK resource fails (e.g., AWS rejects the bucket name because it's taken), the error bubbles up to the Kubernetes status. Your developers will need to know how to read kubectl describe output. You cannot hide the cloud entirely.
2. Portability vs. Integration
Kro creates a tight coupling with the underlying CRDs (ACK). If you move to Google Cloud, you have to rewrite your RGDs to use Config Connector (Google's equivalent). This is not a "write once, run anywhere" solution like pure Helm charts might claim to be, but the operational stability you gain on your primary cloud is worth the lock-in.
3. The Verdict
Use Kro if you are a platform team that wants to provide golden paths without building a massive software project. It sits perfectly in the sweet spot between "raw YAML" and "heavy enterprise portal."
📚 Resources & References
- Official Project:(https://github.com/kubernetes-sigs/kro)
- Cloud Controllers:(https://aws-controllers-k8s.github.io/community/docs/community/overview/)
- Deep Dive:(https://www.cncf.io/blog/2025/12/15/building-platforms-using-kro-for-composition/)
- Syntax Guide: CEL (Common Expression Language) Introduction
Top comments (0)