"Just Enough" Platform Engineering: Replacing Terraform with Kubernetes APIs

#platformengineering #kubernetes #aws #devops

We’ve all been there. You want to build an Internal Developer Platform (IDP). You start with good intentions: "Let's simplify infrastructure for our developers." Six months later, you have a sprawling Backstage instance that nobody likes, a fragile mountain of Terraform modules that take 40 minutes to apply, and developers who still just DM you to "fix the S3 bucket permissions."

We fell into this trap. We tried to abstract everything away until we realized we were just hiding complexity, not managing it.

This article details a different approach. We call it "Just Enough" Platform Engineering. Instead of building a portal that triggers a CI pipeline to run Terraform (the "ClickOps" anti-pattern), we moved the abstraction layer into the Kubernetes cluster itself.

Using AWS Kro (Kubernetes Resource Orchestrator) and ACK (AWS Controllers for Kubernetes), we built a self-service API that allows developers to spin up production-ready, compliant microservices in minutes. No Jenkins pipelines. No Terraform state locks. Just kubectl apply.

Here is how we solved the "Day 2" operations gap and cut provisioning time from days to minutes.

🎯 The Real Problem: The "Day 2" Gap

Most platforms nail "Day 0" (creating the hello-world app). They fail at "Day 2" (maintenance).

The Scenario: You use a Terraform module to provision an S3 bucket for a team.

The Problem:

Drift: A developer manually changes the bucket policy in the AWS Console to debug something. Your Terraform state is now wrong.
Versioning: You update the Terraform module to enforce encryption. You now have to run terraform apply across 50 different repositories to propagate the fix.
Cognitive Load: Developers have to learn HCL just to add a queue.

We needed a solution that was actively reconciling (fixing drift automatically) and API-centric (versioned and manageable).

🛠️ The Architecture: Kubernetes as the Control Plane

We stopped treating Kubernetes as just a container scheduler and started treating it as a universal control plane.

AWS Kro: Allows us to define custom APIs (CRDs) without writing Go code. It acts as the "glue" or the orchestrator.
ACK (AWS Controllers for Kubernetes): Native Kubernetes controllers that talk to AWS APIs. They turn an S3 bucket into a Kubernetes object.

The Workflow:

Platform Team defines a ResourceGraphDefinition (RGD). This is the blueprint.
Kro converts that RGD into a custom Kubernetes API (e.g., kind: SecureMLWorkspace).
Developer applies a simple 5-line YAML file.
Kro + ACK automatically provision the Deployment, Service, IAM Role, and S3 Bucket, wiring them all together securely.

💻 Implementation: The "Secure ML Workspace" API

Let's build a real artifact. We want a Custom Resource called MLWorkspace that gives a data scientist:

A Jupyter Notebook (Deployment + Service).
A private S3 Bucket for datasets.
An IAM Role that allows only that notebook to access only that bucket.

1. The Foundation (Terraform)

We use Terraform only for the static base (EKS cluster, OIDC provider, and installing the controllers). We don't use it for the dynamic app resources.

Terraform

# Install the ACK S3 Controller via Helm
resource "helm_release" "ack_s3" {
  name       = "ack-s3-controller"
  chart      = "s3-chart"
  repository = "oci://public.ecr.aws/aws-controllers-k8s"

  # Crucial: Map the K8s ServiceAccount to an AWS IAM Role (IRSA)
  set {
    name  = "serviceAccount.annotations.eks\\.amazonaws\\.com/role-arn"
    value = aws_iam_role.ack_s3_controller.arn
  }
}

2. The Abstraction (Kro ResourceGraphDefinition)

This is the "secret sauce." Instead of writing a complex Go Operator, we define the relationship graph in YAML.

Note: We use the ResourceGraphDefinition kind (the current standard for Kro).

apiVersion: kro.run/v1alpha1
kind: ResourceGraphDefinition
metadata:
  name: ml-workspace-api
spec:
  # The Interface: What the developer sees
  schema:
    apiVersion: v1alpha1
    kind: MLWorkspace
    spec:
      project: string
      gpu: boolean | default=false
    status:
      notebookUrl: "http://${notebookservice.metadata.name}.${schema.metadata.namespace}.svc.cluster.local:8888"
      storage: ${s3bucket.status.ackResourceMetadata.arn}

  # The Implementation: What gets created
  resources:
    # 1. The Private S3 Bucket (Managed by ACK)
    - id: s3bucket
      template:
        apiVersion: s3.services.k8s.aws/v1alpha1
        kind: Bucket
        metadata:
          name: ${schema.spec.project}-data
        spec:
          name: ${schema.spec.project}-data-${schema.metadata.uid} # Unique Name
          encryption:
            rules:
              - applyServerSideEncryptionByDefault:
                  sseAlgorithm: AES256

    # 2. The IAM Policy for Bucket Access (Managed by ACK)
    - id: iampolicy
      readyWhen:
        - ${iampolicy.status.ackResourceMetadata.arn != null}
      template:
        apiVersion: iam.services.k8s.aws/v1alpha1
        kind: Policy
        metadata:
          name: ${schema.spec.project}-s3-policy
        spec:
          name: ${schema.spec.project}-s3-policy-${schema.metadata.uid}
          policyDocument: |
            {
              "Version": "2012-10-17",
              "Statement": [{
                "Effect": "Allow",
                "Action": ["s3:GetObject", "s3:PutObject", "s3:ListBucket"],
                "Resource": [
                  "arn:aws:s3:::${schema.spec.project}-data-${schema.metadata.uid}",
                  "arn:aws:s3:::${schema.spec.project}-data-${schema.metadata.uid}/*"
                ]
              }]
            }

    # 3. The IAM Role for K8s Service Account/IRSA (Managed by ACK)
    - id: iamrole
      readyWhen:
        - ${iamrole.status.ackResourceMetadata.arn != null}
      template:
        apiVersion: iam.services.k8s.aws/v1alpha1
        kind: Role
        metadata:
          name: ${schema.spec.project}-role
        spec:
          name: ${schema.spec.project}-role-${schema.metadata.uid}
          policies:
            - ${iampolicy.status.ackResourceMetadata.arn}
          assumeRolePolicyDocument: |
            {
              "Version": "2012-10-17",
              "Statement": [{
                "Effect": "Allow",
                "Principal": { "Federated": "arn:aws:iam::ACCOUNT_ID:oidc-provider/OIDC_URL" },
                "Action": "sts:AssumeRoleWithWebIdentity",
                "Condition": {
                  "StringEquals": {
                    "OIDC_URL:sub": "system:serviceaccount:${schema.metadata.namespace}:${schema.spec.project}-sa"
                  }
                }
              }]
            }

    # 4. The Kubernetes Service Account
    - id: serviceaccount
      template:
        apiVersion: v1
        kind: ServiceAccount
        metadata:
          name: ${schema.spec.project}-sa
          namespace: ${schema.metadata.namespace}
          annotations:
            eks.amazonaws.com/role-arn: ${iamrole.status.ackResourceMetadata.arn}

    # 5. The Notebook Service
    - id: notebookservice
      template:
        apiVersion: v1
        kind: Service
        metadata:
          name: ${schema.spec.project}-notebook
          namespace: ${schema.metadata.namespace}
        spec:
          selector:
            app: ${schema.spec.project}
          ports:
            - port: 8888
              targetPort: 8888

    # 6. The Notebook Deployment (Wait for bucket to be ready)
    - id: notebook
      readyWhen:
        - ${notebook.status.availableReplicas == 1}
      template:
        apiVersion: apps/v1
        kind: Deployment
        metadata:
          name: ${schema.spec.project}-notebook
          namespace: ${schema.metadata.namespace}
        spec:
          selector:
            matchLabels:
              app: ${schema.spec.project}
          template:
            spec:
              serviceAccountName: ${schema.spec.project}-sa
              containers:
                - name: jupyter
                  image: "jupyter/scipy-notebook:latest"
                  env:
                    # AUTOMATIC WIRING: Inject the Bucket ARN directly
                    - name: DATA_BUCKET
                      value: ${s3bucket.status.ackResourceMetadata.arn}
                  resources:
                    limits:
                      # Conditional Logic in CEL
                      nvidia.com/gpu: "${schema.spec.gpu? '1' : '0'}"

3. The Developer Experience

The developer doesn't care about IAM policies, encryption rules, or Pod selectors. They just want a workspace

apiVersion: v1alpha1
kind: MLWorkspace
metadata:
  name: fraud-detection-dev
spec:
  project: fraud-detection
  gpu: true

That's it. When they apply this, Kro creates the bucket, waits for the ARN to be generated by AWS, injects that ARN into the Pod's environment variables, and spins up the compute.

💰 The Cost Reality

Is running this expensive? We analyzed the costs of running the control plane (ACK + Kro) versus the operational savings.

The "Tax" (Infrastructure Cost):

Kro Controller: Runs as a standard Pod on your existing EKS nodes. Costs nothing beyond the base EC2/Fargate compute required (which is negligible).
ACK Controllers: Also run as Pods on your existing nodes. Minimal resource usage.
Total "Platform Tax": Essentially $0 in additional licensing or managed service fees. You only pay for the standard EKS cluster and the underlying compute nodes you are already running.

The Savings (Operational):

Drift Remediation: $0 (Automatic).
Wait Time: Reduced from days (ticketing queue) to seconds.
Security Audits: The RGD acts as a policy. You can verify that every MLWorkspace uses AES256 encryption just by checking the single RGD file.

🧠 My Individual Conclusion

After migrating our core data services to this model, here is my honest take:

1. The "Leaky Abstraction" Risk is Real

When an ACK resource fails (e.g., AWS rejects the bucket name because it's taken), the error bubbles up to the Kubernetes status. Your developers will need to know how to read kubectl describe output. You cannot hide the cloud entirely.

2. Portability vs. Integration

Kro creates a tight coupling with the underlying CRDs (ACK). If you move to Google Cloud, you have to rewrite your RGDs to use Config Connector (Google's equivalent). This is not a "write once, run anywhere" solution like pure Helm charts might claim to be, but the operational stability you gain on your primary cloud is worth the lock-in.

3. The Verdict

Use Kro if you are a platform team that wants to provide golden paths without building a massive software project. It sits perfectly in the sweet spot between "raw YAML" and "heavy enterprise portal."