pius-chungath

Posted on Mar 2 • Edited on Mar 3

We Hosted 100 Isolated Drupal Sites on One EKS Cluster — Here's Every Problem We Hit

The goal was straightforward on paper: run roughly 100 isolated Drupal websites on a single Amazon EKS cluster. One cluster to rule them all, with each tenant completely isolated from the others — separate namespaces, separate databases, separate secrets, separate network policies. No shared anything that could become a blast radius.

We started with three sites to validate every fundamental before scaling. This article is the honest account of what we built, why we made each decision, and — most valuably — the ten real problems we ran into and how we solved them.

The full platform repo is public. Everything described here — Terraform, Helm chart, Argo CD ApplicationSet, OTel config, Percona MySQL CRDs — is at github.com/YOUR_GITHUB_ORG/drupal-platform. Three placeholders (YOUR_AWS_ACCOUNT_ID, YOUR_GITHUB_ORG, YOUR_AWS_REGION) need replacing before you run anything; a terraform.tfvars.example covers what's needed.

The Architecture

Overview

This is the high-level shape of the system: traffic enters through AWS, lands in a single EKS cluster, and is routed to one of 100 isolated tenant namespaces. Shared platform services support all tenants, while node groups handle scheduling and scaling.

Key idea: one cluster, many namespaces, strong isolation boundaries.

Tenant Isolation

Each tenant lives in its own namespace and has its own Drupal deployment, database, and persistent storage. Isolation is enforced using Kubernetes-native controls rather than separate clusters.

Each namespace has its own database, its own persistent volume, its own IAM role (via IRSA), and network and resource boundaries. Blast radius is limited to a single tenant.

Platform & AWS Integration

Shared platform components handle GitOps deployments, secrets management, monitoring, and AWS integrations. These services support all tenants without breaking isolation.

Platform services deploy and manage tenants via GitOps, pull secrets securely from AWS Secrets Manager, provide centralised observability, and integrate with AWS IAM and EBS.

Every site lives in its own namespace (drupal-site-<name>), gets its own MySQL instance via the Percona Operator, pulls secrets from AWS Secrets Manager via External Secrets Operator, and is deployed by a single Argo CD ApplicationSet that auto-discovers new sites by watching a gitops/sites/ directory in the platform repo.

Why Not Multi-Cluster?

The first question anyone asks is "why not one cluster per client?" We considered it. The operational overhead of managing 100 clusters — EKS upgrades, IAM management, Argo CD fleet management, cost — made it a non-starter for Phase 1. A single cluster with strong namespace isolation gives us 95% of the isolation properties at a fraction of the ops burden.

Multi-cluster is explicitly in the backlog for clients who contractually require it, but the default answer is: one cluster, hard namespace isolation, Pod Security Standards enforced at restricted, default-deny NetworkPolicies, and separate IAM roles via IRSA.

The Two-Values Pattern: Platform vs. Client

The most important design decision we made early on is how values are split across two files per site. Every site has:

platform.values.yaml — owned by the platform team, committed to the platform repo, contains infrastructure bindings
client.values.yaml — owned by the client team, committed to their own repo (drupal-site-<name>), contains app config

# gitops/sites/site-alpha/platform.values.yaml (platform team owns)
site:
  name: site-alpha
  tier: medium

serviceAccount:
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789:role/drupal-platform-site-site-alpha

db:
  secretRef: site-alpha-db
  percona:
    enabled: true
    storage: 10Gi

image:
  appPath: /opt/drupal    # standard drupal:10.x image layout
  webRoot: /var/www/html  # Apache DocumentRoot

ingress:
  ingressGroup: drupal-platform

# client.values.yaml (client team owns, auto-updated by CI)
image:
  repository: docker.io/library/drupal
  tag: "10.3-php8.3-apache"

ingress:
  host: alpha.example.com

replicaCount: 2

cron:
  enabled: true
  schedule: "*/15 * * * *"

The Argo CD ApplicationSet stitches these together using the multi-source feature (requires Argo CD >= 2.6):

sources:
  - repoURL: https://github.com/org/drupal-platform
    path: gitops/charts/drupal-site
    helm:
      releaseName: '{{ .path.basename }}'
      valueFiles:
        - $platform/gitops/sites/{{ .path.basename }}/platform.values.yaml
        - $client/client.values.yaml

  - repoURL: https://github.com/org/drupal-platform
    path: gitops/sites/{{ .path.basename }}
    ref: platform

  - repoURL: https://github.com/org/drupal-site-{{ .path.basename }}
    targetRevision: HEAD
    ref: client

Adding a new site is: create gitops/sites/<new-site>/platform.values.yaml in the platform repo, and Argo CD auto-discovers it. No ApplicationSet editing required.

The rule is absolute: secrets never go in values files. Secrets flow from AWS Secrets Manager → External Secrets Operator → Kubernetes Secret → envFrom.secretRef in the pod.

Database: Percona Server for MySQL Operator

We picked the Percona Server for MySQL Operator (PSM) over vanilla MySQL operators for a few reasons: better Kubernetes-native lifecycle management, built-in backup hooks, and ProxySQL support for when we scale to primary+replica per site.

We explicitly avoided Percona XtraDB Cluster (PXC/Galera). Drupal's DDL operations (ALTER TABLE during drush updb) cause cluster-wide locking in Galera clusters, and the minimum 3-node requirement would cost us 3x the MySQL instances for zero benefit at this scale.

Each site gets a PerconaServerMySQL CR:

apiVersion: ps.percona.com/v1alpha1
kind: PerconaServerMySQL
metadata:
  name: site-alpha-mysql
  namespace: drupal-site-alpha
spec:
  crVersion: 0.9.0
  unsafeFlags:
    mysqlSize: true      # allows size=1 (single node)
    orchestrator: true   # bypasses Orchestrator requirement
    proxy: true
  mysql:
    size: 1
    resources:
      requests:
        cpu: 200m
        memory: 512Mi
      limits:
        cpu: 1
        memory: 1Gi
    volumeSpec:
      persistentVolumeClaim:
        storageClassName: gp3
        resources:
          requests:
            storage: 10Gi
    configuration: |
      [mysqld]
      max_allowed_packet=256M
      innodb_buffer_pool_size=256M
      sql_mode=STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_ENGINE_SUBSTITUTION
  secretsName: site-alpha-mysql-secrets

The unsafeFlags are essential. Without them, PSM 0.9.0 enforces size >= 2 and requires Orchestrator for async replication, neither of which we want for a single-node Phase 1 install.

Secrets: ESO and the Standard Contract

Every site's DB credentials live at drupal/<site-name>/db in AWS Secrets Manager with a fixed key schema: host, port, name, username, password, driver.

The ExternalSecret CR syncs them into a Kubernetes secret with the DB_* env var convention Drupal's settings.php expects:

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: site-alpha-db
  namespace: drupal-site-alpha
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secrets-store
    kind: ClusterSecretStore
  target:
    name: site-alpha-db
  data:
    - secretKey: DB_HOST
      remoteRef:
        key: drupal/site-alpha/db
        property: host
    - secretKey: DB_USER
      remoteRef:
        key: drupal/site-alpha/db
        property: username
    - secretKey: DB_PASSWORD
      remoteRef:
        key: drupal/site-alpha/db
        property: password
    # ... DB_NAME, DB_PORT, DB_DRIVER

The secret is injected into the Drupal deployment via envFrom.secretRef. The settings.php (either mounted from a ConfigMap for standard images, or baked into the custom image) reads credentials with getenv('DB_HOST') etc.

Two Types of Drupal Sites

Standard Sites (Most Clients)

Standard sites use the official drupal:10.3-php8.3-apache image directly. The Helm chart generates a settings.php from a ConfigMap template and mounts it at /var/www/html/sites/default/settings.php. Drush is installed at runtime during the initial install Job via composer require drush/drush.

The flat image layout means appPath and webRoot are both /var/www/html — no subdirectory to worry about.

Custom Image Sites (Clients with Custom Modules)

When a client has their own module development, they get a separate Git repo (drupal-site-<name>) containing their full Drupal codebase. CI builds a custom Docker image and pushes it to ECR. The key differences:

drupal/recommended-project layout: web/ is the subdirectory DocumentRoot
appPath: /var/www/html, webRoot: /var/www/html/web
settings.php is baked into the image (not mounted from ConfigMap)
Drush is already in vendor/bin/drush — no runtime composer install
CI updates image.tag in client.values.yaml and commits; Argo CD picks up the change

The Dockerfile uses a multi-stage build:

# Stage 1: Composer build (Alpine-based composer:2.7 image)
FROM composer:2.7 AS composer-build
WORKDIR /app
COPY composer.json composer.lock ./
RUN composer install \
    --no-dev \
    --no-interaction \
    --prefer-dist \
    --optimize-autoloader \
    --ignore-platform-reqs   # <-- see Problem #2 below

COPY . .

# Stage 2: Runtime
FROM drupal:10.3-php8.3-apache
RUN rm -rf /var/www/html/*
COPY --from=composer-build --chown=www-data:www-data /app /var/www/html

The Helm chart templates detect which type of site they're dealing with using this flag:

{{- $isCustomImage := ne (.Values.image.appPath | default "/opt/drupal") "/opt/drupal" }}

When $isCustomImage is true, the chart skips: the ConfigMap settings.php volume, the volumeMount for settings.php, and the composer require drush step in the install Job.

Resource Isolation

Every namespace gets a ResourceQuota and LimitRange. We defined small/medium/large tiers — site.tier in platform.values.yaml controls which tier applies:

small:   2 CPU / 4Gi RAM total;  500m/1Gi per container
medium:  4 CPU / 8Gi RAM total;  1/2Gi per container
large:   8 CPU / 16Gi RAM total; 2/4Gi per container

NetworkPolicies use a default-deny pattern with explicit allows:

# default-deny all
spec:
  podSelector: {}
  policyTypes: [Ingress, Egress]
---
# allow ingress from ALB controller only
ingress:
  - from:
      - namespaceSelector:
          matchLabels:
            kubernetes.io/metadata.name: kube-system
        podSelector:
          matchLabels:
            app.kubernetes.io/name: aws-load-balancer-controller
---
# allow egress to own MySQL pod only
egress:
  - to:
      - podSelector:
          matchLabels:
            app.kubernetes.io/instance: site-alpha-mysql
    ports:
      - port: 3306
---
# allow egress to DNS
egress:
  - to:
      - namespaceSelector:
          matchLabels:
            kubernetes.io/metadata.name: kube-system
        podSelector:
          matchLabels:
            k8s-app: kube-dns
    ports:
      - port: 53
        protocol: UDP

The install Job and CronJob pods get an extra NetworkPolicy allowing port 443 egress, because the standard site install Job needs to reach packagist.org and github.com for composer require drush/drush.

Pod Security Standards are set at the namespace level with restricted enforced. Every container in the platform must declare:

securityContext:
  allowPrivilegeEscalation: false
  capabilities:
    drop: [ALL]
  runAsNonRoot: true
  seccompProfile:
    type: RuntimeDefault

Node placement is controlled by placement.dedicated in platform.values.yaml. Default (false) lands pods on the shared node group with taint dedicated=shared:NoSchedule. Setting it to true targets a dedicated node group with taint dedicated=client-<name>:NoSchedule — used for clients who require physical compute isolation.

Node Groups: Shared vs Dedicated

Not all 100 clients have the same isolation requirements. Most are fine sharing compute with other tenants — namespace isolation, NetworkPolicies, and ResourceQuotas are sufficient. A small number need contractual physical compute isolation (regulated industries, security-conscious clients).

We handle this with two node group types and Kubernetes taints:

Shared node group (dedicated=shared:NoSchedule) — default for all sites. Cost-efficient, bin-packed. All standard tenants land here unless overridden.

Dedicated node groups (dedicated=client-<name>:NoSchedule) — one per client who needs it. The node group has a unique taint so only that client's pods can tolerate it. No other tenant's pods will ever be scheduled there.

Node placement is controlled by a single flag in platform.values.yaml:

# Shared (default — most clients)
placement:
  dedicated: false

# Dedicated node group for this client
placement:
  dedicated: true

The Helm chart translates this into the correct nodeSelector and tolerations automatically. The platform team controls placement — clients never touch it.

# What the chart generates for a dedicated placement
nodeSelector:
  role: client-acme
tolerations:
  - key: dedicated
    operator: Equal
    value: client-acme
    effect: NoSchedule

# What the chart generates for shared placement
nodeSelector:
  role: shared
tolerations:
  - key: dedicated
    operator: Equal
    value: shared
    effect: NoSchedule

The taint on shared nodes (dedicated=shared:NoSchedule) is intentional — it prevents pods without an explicit toleration from landing there. Every pod in the platform declares its toleration. No pod accidentally ends up on the wrong node group.

At 100 sites, the economics roughly work out as: 90+ clients on the shared group (2–3 nodes, well bin-packed), 5–10 clients with dedicated nodes (1 node each). The dedicated nodes are slightly under-utilised by design — that's the cost of the isolation guarantee.

Observability

The monitoring namespace runs kube-prometheus-stack (Prometheus + Alertmanager + Grafana), Loki, and Tempo. The OpenTelemetry Operator manages all collector deployments via CRDs.

We run two collector tiers:

OTel Agent (DaemonSet) — one per node, collects:

Container logs via filelog receiver from /var/log/pods
Host metrics via hostmetrics receiver
Traces via OTLP receiver (pods send traces to http://$(NODE_IP):4318)

OTel Gateway (Deployment) — fan-out to:

Loki (logs via OTLP)
Tempo (traces via OTLP)
Prometheus (metrics via remote write)

Every Drupal pod is annotated for Apache HTTPD auto-instrumentation. The OTel Operator injects the apache-httpd module which instruments at the HTTP request level — method, URL, status code, and latency — without requiring any PHP SDK changes:

annotations:
  instrumentation.opentelemetry.io/inject-apache-httpd: "monitoring/drupal-instrumentation"

Each site namespace also gets a mysqld-exporter Deployment with a ServiceMonitor for per-site MySQL metrics in Prometheus.

The OTel operator injects several fields (spec.args, spec.configVersions, spec.ipFamilyPolicy, etc.) that aren't in git, which causes perpetual sync churn in Argo CD. We handle this with ignoreDifferences on the Argo CD Application for the collectors:

ignoreDifferences:
  - group: opentelemetry.io
    kind: OpenTelemetryCollector
    jqPathExpressions:
      - .spec.args
      - .spec.configVersions
      - .spec.ipFamilyPolicy
      - .spec.managementState

Drupal Lifecycle via Kubernetes Jobs

Site install runs as a Kubernetes Job annotated as an Argo CD PostSync hook:

annotations:
  argocd.argoproj.io/hook: PostSync
  argocd.argoproj.io/hook-delete-policy: HookSucceeded

The Job copies the Drupal codebase to a temporary directory (working around the read-only root filesystem and file permission issues — more on this in the problems section), then runs drush site:install.

Regular Drupal maintenance (drush updb, drush cim, drush cr) runs as on-demand Jobs via GitOps. Drupal cron is a Kubernetes CronJob per site. The schedule is client-configurable via client.values.yaml.

The 10 Real Problems We Hit

This is the section you actually came here for.

Problem 1: StatefulSet Pod Stuck on Old Taint Toleration

We updated the taint toleration in a PerconaServerMySQL CR (changing from a dedicated node taint to the shared node taint). The operator updated the StatefulSet spec, but the existing pod kept running on the old node with the old toleration. Kubernetes does not evict StatefulSet pods when the pod template changes.

Fix: kubectl delete pod site-alpha-mysql-0 -n drupal-site-alpha

The operator recreates the pod with the new spec. This is expected StatefulSet behavior — worth knowing before you spend 20 minutes wondering why your operator "isn't working."

Problem 2: composer:2.7 Alpine Missing ext-gd

drupal/core-recommended declares a requirement for ext-gd. The composer:2.7 image is Alpine-based and does not have ext-gd (or ext-imagick, ext-zip, etc.). Running composer install without flags fails immediately.

The fix that doesn't work: installing php-gd into the Alpine build stage. It's possible but fragile and bloats the build.

Fix: --ignore-platform-reqs

RUN composer install \
    --no-dev \
    --prefer-dist \
    --optimize-autoloader \
    --ignore-platform-reqs

The runtime stage is drupal:10.3-php8.3-apache which ships with all the PHP extensions Drupal needs. Composer's platform requirement check is irrelevant for the build stage — we only care whether the runtime has the extensions, and it does.

Problem 3: Apache RewriteBase in VirtualHost Crashes Apache

We had a custom Apache VirtualHost config for the web/ subdirectory DocumentRoot. It included:

<IfModule mod_rewrite.c>
  RewriteEngine On
  RewriteBase /
  RewriteRule ^index\.php$ - [L]
  RewriteCond %{REQUEST_FILENAME} !-f
  RewriteCond %{REQUEST_FILENAME} !-d
  RewriteRule . /index.php [L]
</IfModule>

Apache crashed on startup. The pod went into CrashLoopBackOff. kubectl logs showed: RewriteBase: only valid in per-directory config.

RewriteBase is a .htaccess directive — it's only valid in <Directory> context, not directly inside a <VirtualHost>. Putting it inside <IfModule mod_rewrite.c> at VirtualHost scope doesn't change that.

Fix: Remove the entire <IfModule mod_rewrite.c> block from the VirtualHost. Set AllowOverride All on the <Directory> block and let Drupal's own .htaccess handle rewrites:

<VirtualHost *:80>
    DocumentRoot /var/www/html/web
    <Directory /var/www/html/web>
        Options Indexes FollowSymLinks
        AllowOverride All
        Require all granted
    </Directory>
</VirtualHost>

Drupal ships a .htaccess in its web/ directory that handles the clean URL rewrites correctly. No need to duplicate it in the VirtualHost.

Problem 4: settings.php Served as Plain Text

We created settings.php inside a Dockerfile using shell echo commands. The pod deployed, but when we checked the site, everything came back as plain text — literally the PHP code rendered as a string in the browser.

The file started with a bare echo of PHP code, but the very first character wasn't <?php. PHP's Apache module passes non-PHP content straight through to the response without processing it.

Fix: Always start with printf '<?php\n':

RUN printf '<?php\n' \
         > /var/www/html/web/sites/default/settings.php && \
    echo "\$settings['trusted_host_patterns'] = ['^.*$'];" \
         >> /var/www/html/web/sites/default/settings.php && \
    echo "if (file_exists(\$app_root . '/' . \$site_path . '/settings.platform.php')) {" \
         >> /var/www/html/web/sites/default/settings.php && \
    echo "  include \$app_root . '/' . \$site_path . '/settings.platform.php';" \
         >> /var/www/html/web/sites/default/settings.php && \
    echo "}" >> /var/www/html/web/sites/default/settings.php && \
    chown www-data:www-data /var/www/html/web/sites/default/settings.php

Use printf for the first line (reliable \n handling), echo for subsequent appends. And always chown at the end of the same RUN layer — more on why in Problem 6.

Problem 5: Wrong PVC Mount Path for Custom Image

Standard drupal:10.x-apache images have a flat layout: /var/www/html/sites/default/files is the public files directory. For the custom image using drupal/recommended-project, the DocumentRoot is /var/www/html/web/, so the correct path is /var/www/html/web/sites/default/files.

We initially mounted the PVC at the standard path. Drupal started, but file uploads silently failed because the actual writable directory the app was looking at (web/sites/default/files) was inside the container image — not on the PVC.

Fix: Add image.webRoot as an explicit Helm value, and use it as the PVC mount path:

# platform.values.yaml for custom image sites
image:
  appPath: /var/www/html
  webRoot: /var/www/html/web   # <- PVC mounted here/sites/default/files

In the Helm deployment template:

{{- $webRoot := .Values.image.webRoot | default "/var/www/html" }}
# ...
volumeMounts:
  - name: drupal-files
    mountPath: {{ $webRoot }}/sites/default/files

Problem 6: settings.php Not Writable During drush site:install

Dockerfile RUN instructions run as root (unless you've already set USER). We COPY'd the codebase with --chown=www-data:www-data but then created settings.php in a separate RUN step which ran as root, leaving settings.php owned by root.

The container runs as www-data (UID 33). When drush site:install tried to write database configuration to settings.php, it got permission denied.

There were two things to fix:

Fix 1: chown settings.php at the end of the same RUN block that creates it:

RUN printf '<?php\n' > settings.php && \
    # ... other echo commands ... \
    chown www-data:www-data settings.php

Fix 2: Even with correct ownership, the Pod Security restricted profile causes issues with drush trying to write to the image filesystem. We work around this in the install Job by copying the codebase to a writable temp directory:

WORK=$(mktemp -d)
cp -r /var/www/html/web ${WORK}/web
cp /var/www/html/vendor -r ${WORK}/vendor
# The copy is owned by the current user (www-data), so it's writable
${WORK}/vendor/bin/drush --root=${WORK}/web site:install standard --yes

This pattern also plays nicely with readOnlyRootFilesystem: false — we can't fully enforce a read-only root because Drupal writes Twig cache to the filesystem, but we limit writable surface area to what's intentional.

Problem 7: MySQL super_read_only Blocks drush site:install

This one took the longest to debug. We had Percona MySQL running, the PerconaServerMySQL CR was healthy, the ExternalSecret had synced credentials, and the drush install Job started — but it immediately failed with:

SQLSTATE[HY000]: General error: 1290 The MySQL server is running with the
--read-only option so it cannot execute this statement

Digging into the PSM operator behavior: PSM sets super_read_only=ON in node.cnf. On a properly configured Percona cluster, Group Replication would detect the primary and release super_read_only. But on a single-node setup without group_replication_group_name injected, Group Replication is OFFLINE. Nothing ever turns off super_read_only.

We tried creating a custom drupal user with GRANT ALL ON drupal.* TO 'drupal'@'%'. That user still can't write when super_read_only=ON, because GRANT ALL on a specific database doesn't include CONNECTION_ADMIN or SUPER which would bypass read_only.

Fix — Part 1: Use the operator user that PSM creates automatically. It has CONNECTION_ADMIN privilege, which bypasses read_only=ON. We store the operator user's password (from the PSM-created secret site-alpha-mysql-secrets) as the DB_PASSWORD in AWS Secrets Manager.

Fix — Part 2: Before running drush site:install on a fresh single-node install, manually disable super_read_only:

kubectl exec -it site-alpha-mysql-0 -n drupal-site-alpha -- \
  mysql -u root -p"$ROOT_PASSWORD" \
  -e "SET GLOBAL super_read_only=0;"

The root password is in the PSM-created secret:

kubectl get secret site-alpha-mysql-secrets \
  -n drupal-site-alpha \
  -o jsonpath='{.data.root}' | base64 -d

Fix — Part 3 (Permanent): The manual override only lasts until the next MySQL pod restart — node.cnf re-applies super_read_only=ON every time. To make the fix permanent, add the flags directly to the configuration block in the PerconaServerMySQL CR (via the Helm chart's perconamysql.yaml template):

configuration: |
  [mysqld]
  max_allowed_packet=256M
  innodb_buffer_pool_size=256M
  sql_mode=STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_ENGINE_SUBSTITUTION
  # PSM sets super_read_only=ON at startup. On a single-node cluster Group
  # Replication is OFFLINE so the node is never promoted to primary and the
  # flag is never cleared automatically. Override here so Drupal can write.
  super_read_only=0
  read_only=0

This lives in git alongside the rest of the chart — no manual intervention needed after restarts

Problem 8: kubectl run Blocked by Restricted Pod Security Standard

Early in debugging the MySQL connection, we tried to spin up a quick throwaway pod:

kubectl run -it --rm debug --image=mysql:8.0 \
  -n drupal-site-alpha -- bash

The pod was immediately rejected by the PodSecurity admission controller:

Error from server (Forbidden): pods "debug" is forbidden:
violates PodSecurity "restricted:latest": allowPrivilegeEscalation != false
(container "debug" must set securityContext.allowPrivilegeEscalation=false),
unrestricted capabilities ...

kubectl run does not set any security context by default. The restricted PSS blocks any pod that doesn't explicitly declare the full security context.

Fix Option 1 (preferred): Use kubectl exec on an existing running pod in the namespace:

kubectl exec -it deployment/drupal -n drupal-site-alpha -- bash

Fix Option 2: Pass the full overrides with kubectl run:

kubectl run -it --rm debug \
  --image=mysql:8.0 \
  -n drupal-site-alpha \
  --overrides='{"spec":{"securityContext":{"runAsNonRoot":true,"runAsUser":1000,"seccompProfile":{"type":"RuntimeDefault"}},"containers":[{"name":"debug","image":"mysql:8.0","securityContext":{"allowPrivilegeEscalation":false,"capabilities":{"drop":["ALL"]}}}]}}' \
  -- bash

In practice, we always use kubectl exec on an existing pod now. Option 2 works but the JSON override is cumbersome.

Problem 9: Git Push Rejected After CI Commits

Our CI pipeline (GitHub Actions with GitHub OIDC — no static AWS keys) builds the custom image, pushes to ECR, then commits the new image.tag back to the client's client.values.yaml in their repo. After the CI commit, the next time a developer tries to push their own branch, they get:

 ! [rejected]        main -> main (fetch first)
error: failed to push some refs to 'github.com:...'
hint: Updates were rejected because the remote contains work that you do not have locally.

The CI commit is ahead of the developer's local HEAD.

Fix: Always git pull --rebase && git push. Add it to your .gitconfig as the default:

git config pull.rebase true

Or make the CI pipeline create commits on a separate branch and open a PR, then merge. We went with the git pull --rebase convention for Phase 1 since the client repos are small teams.

Problem 10: Percona Operator Telemetry Loop in Logs

After deploying everything and getting all three sites healthy, we noticed the Percona operator was flooding its logs with repeated failed connection attempts:

time="..." level=error msg="Failed to get component versions from Percona"
url="https://check.percona.com/versions/v1/ps-operator/..."
error="Get https://check.percona.com/...: context deadline exceeded"

This is the Percona telemetry/version-check service. The operator is trying to reach check.percona.com on port 443, but our default-deny NetworkPolicy in the percona-operator namespace blocks all egress that isn't explicitly allowed.

Fix: This is cosmetic. The operator reconciliation loop for your actual MySQL instances works perfectly — the error only affects Percona's version-check telemetry. The loop will continue indefinitely but causes no functional impact.

If you want to silence it, add an egress NetworkPolicy in the percona-operator namespace allowing TCP 443 to check.percona.com. Or disable telemetry if the operator version supports it via --disable-telemetry in the operator Deployment args. We left it logging for now — it's on the backlog, not the critical path.

Problem 11: Running Out of Pod IPs — VPC CNI Custom Networking

Once the cluster had two or three sites running we did the IP arithmetic for 100 sites. Each site has 2 Drupal replicas + 1 MySQL pod + 1 cron pod + occasional Job pods. Call it ~8 pods per site across 100 sites: ~800 pods plus system and platform pods. Add the prefix delegation pre-allocation (WARM_PREFIX_TARGET=1, which reserves a /28 = 16 IPs per node), and our three /24 node subnets (~250 IPs each, 750 total) would be exhausted well before we hit 100 sites.

The instinctive response is "switch to Calico." We considered it and decided against it. Replacing aws-cni means losing ENI-based security groups, IRSA at the pod level becomes trickier to reason about, and the migration risk on a running cluster is real. The cheaper fix is to keep aws-cni and just give pod ENIs their own, much larger subnets.

The fix: VPC CNI custom networking with dedicated /19 pod subnets.

When AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG=true is set on the vpc-cni addon, IPAMD reads ENIConfig CRDs (one per AZ) and attaches pod ENIs to those subnets rather than the node's primary subnet. Node IPs stay on the original /24s. Pod IPs move to /19s.

/24 node subnets (10.0.1-3.0/24)  →  node ENIs stay here
/19 pod subnets  (10.0.32/64/128.0/19) →  pod ENIs moved here

8190 usable IPs per /19, three AZs: 24,570 pod IPs total. Problem solved.

The Terraform to create the subnets and ENIConfig objects per AZ:

# Dedicated pod subnets — one per AZ
resource "aws_subnet" "pod" {
  count             = length(var.azs)
  vpc_id            = module.vpc.vpc_id
  cidr_block        = var.pod_subnets[count.index]  # ["10.0.32.0/19", "10.0.64.0/19", "10.0.128.0/19"]
  availability_zone = var.azs[count.index]
}

resource "aws_route_table_association" "pod" {
  count          = length(var.azs)
  subnet_id      = aws_subnet.pod[count.index].id
  route_table_id = module.vpc.private_route_table_ids[0]
}

# ENIConfig CRDs — one per AZ, named after the AZ so the label
# topology.kubernetes.io/zone matches automatically
resource "null_resource" "eni_config" {
  count = length(var.azs)
  provisioner "local-exec" {
    command = <<-EOF
      kubectl apply -f - <<'YAML'
      apiVersion: crd.k8s.amazonaws.com/v1alpha1
      kind: ENIConfig
      metadata:
        name: ${var.azs[count.index]}
      spec:
        subnet: ${aws_subnet.pod[count.index].id}
        securityGroups:
          - ${module.eks.cluster_primary_security_group_id}
      YAML
    EOF
  }
  depends_on = [module.eks, aws_subnet.pod]
}

Three non-obvious pitfalls we hit:

Pitfall 1 — CIDR alignment. /19 boundaries must fall on multiples of 32 in the third octet: 10.0.0.0/19, 10.0.32.0/19, 10.0.64.0/19, 10.0.96.0/19, etc. We initially tried 10.0.16.0/19 — AWS rejected it with a CIDR block overlap error. The fix is mechanical: pick third-octet values that are multiples of 32.

Pitfall 2 — Apply ENIConfig objects BEFORE enabling custom networking. If you enable AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG=true on the addon first, IPAMD on each node immediately looks for its AZ's ENIConfig object. If it doesn't exist, IPAMD crashes into CrashLoopBackOff and the addon update times out after 20 minutes. The right sequence is: apply ENIConfig objects → then update the addon. With null_resource Terraform handles this via depends_on, but if you're doing it manually, use kubectl apply for the ENIConfigs first.

Pitfall 3 — Wrong CRD API group. The ENIConfig CRD API group is crd.k8s.amazonaws.com/v1alpha1, not cni.aws.com/v1alpha1 or cni.k8s.amazonaws.com/v1alpha1. If you get no matches for kind ENIConfig from kubectl, verify the exact group with:

kubectl get crd | grep -i eni
# → eniconfigs.crd.k8s.amazonaws.com

Pitfall 4 — kubernetes_manifest fails at plan time. The kubernetes_manifest Terraform resource validates the CRD schema against the live cluster API at terraform plan. ENIConfig is only registered after the VPC CNI addon finishes initialising custom networking — which hasn't happened yet on the first apply. The resource fails at plan time on a fresh cluster. Solution: use null_resource + local-exec kubectl instead (shown above), which has no plan-time API validation.

After terraform apply, existing nodes must be recycled (drain + terminate) for the new networking to take effect. New node replacements launch with the updated CNI config and pods get IPs from the /19 subnets. Argo CD re-syncs and everything comes back up cleanly.

What's Next

Phase 1 validated every fundamental. Here's the Phase 2 roadmap:

Autoscaling. HPA is already supported by the Helm chart's replicaCount. Karpenter for node autoscaling replaces managed node groups and reduces over-provisioning at scale.

S3 for files. PVC-backed files work for Phase 1, but S3 + s3fs-fuse or a Drupal S3 module (with IAM via IRSA per site) is the right answer at 100 sites. Network egress policies will need an S3 gateway endpoint entry.

Vault. External Secrets Operator + AWS Secrets Manager works well, but at 100 sites the secret path conventions get unwieldy. HashiCorp Vault with the Vault Operator gives us dynamic credentials, lease rotation, and a better audit trail.

Primary+replica MySQL. PSM makes scaling from size: 1 to size: 2 (primary + replica) trivial — just change a single value. ProxySQL (bundled with PSM) handles read/write splitting transparently.

WAF. AWS WAF on the shared ALB with OWASP managed rules and per-site rate limiting. It's configured but not enforced in Phase 1.

Restore testing. Velero backs up PVCs cluster-wide. Restore testing is mandatory before Phase 2 goes live — we have runbooks but haven't done a full site restore drill yet.

Key Takeaways

The architecture is sound. The Helm chart, two-values split, ApplicationSet, and ESO/Secrets Manager pattern work exactly as designed. Every problem we hit was operational — MySQL permission model subtleties, container image build quirks, Apache config syntax, security context requirements.

The two things that saved the most time were: (1) reading the Percona operator source to understand the operator user's CONNECTION_ADMIN privilege, and (2) understanding that RewriteBase is a per-directory directive. Neither of these is in the Kubernetes docs.

If you're building something similar, the PSM unsafeFlags + super_read_only combination is the most non-obvious pitfall. Save yourself the debugging time: use the operator user, and manually run SET GLOBAL super_read_only=0 before your first drush site:install on a single-node PSM instance.

The full platform repo is public: github.com/YOUR_GITHUB_ORG/drupal-platform. Every file referenced in this article — the Helm chart templates, Argo CD ApplicationSet, Terraform modules, Percona MySQL CRDs, OTel collector configs — is there in working form.

To use it yourself, clone the repo and do a global search-and-replace on three placeholders before running anything:

YOUR_AWS_ACCOUNT_ID  → your 12-digit AWS account number
YOUR_GITHUB_ORG      → your GitHub organisation or username
YOUR_AWS_REGION      → your target region (e.g. eu-west-1, us-east-1)

Copy terraform/terraform.tfvars.example to terraform/terraform.tfvars (which is gitignored), fill in the values, and you're ready to terraform init && terraform apply. Start with two or three sites, validate every fundamental, then the path to 100 is just git mkdir.

DEV Community