The Problem We Were Actually Solving
Our real goal wasnt AI novelty; it was user retention spikes during live events. We needed the hunt engine to stay up while thousands of users simultaneously solved the same clue set. The vendors Terraform module defaulted to CPU limits of 200m, which was fine for the LLM warm-up burst but starved the Python Flask worker that served the static assets. Every Cold Start of the LLM pod dropped the first clue request into a 500ms–1500ms latency window. Users on 4G clicked, nothing rendered, and we lost 12% of sign-ups within the first 90 seconds.
What We Tried First (And Why It Failed)
Our first fix was the obvious one: bump CPU request to 500m and memory to 512Mi. The pods stopped crashing, but latency percentiles worsened. P95 jumped from 450ms to 1.2s because the LLM container could no longer burst its Hugging Face model into GPU. We had forgotten that our g4dn.xlarge nodes only had one GPU slice, and the autoscaler was scheduling two pods per node. The scheduler spread the pain, but now every clue API call had to serialize through a cold-model retry path.
Next, we tried a thread-local model cache inside the Flask worker. Simple in code, disastrous in prod: cache invalidation on model updates caused every worker to reload the 4GB model into RAM. The OOM killer evicted the pod when two simultaneous model updates happened. Our pager screamed again.
Finally, we tried GPU sharing via MIG slices (NVIDIA MIG 2g.10gb). The setup required kernel 5.19 and a custom runtimeClass named nvidia-mig-2g. Rolling this out via Argo CD took three evenings and a kernel upgrade on our EKS worker nodes. We discovered only after the fact that the MIG device plugin reported memory in bytes while the Kubernetes device plugin expected MiB, so the admitted memory was off by 1024x. The HPA fired a scale-up every time the model dumped its KV cache.
The Architecture Decision
We ripped out the vendor module entirely and rewrote the hunt backend as a multi-stage stateless service:
- Stage 1 – Static asset server (nginx + gzip) on port 8080, no GPU.
- Stage 2 – Lightweight API gateway (envoy) that routes
/clue/{id}to either: a) a cached clue from Redis if < 60s old, or b) a transient LLM worker pod.
LLM workers are now ephemeral pods scheduled only when Redis misses. We set an hard limit of 1 worker per GPU slice to avoid the sharing nightmare. Each worker mounts an emptyDir volume backed by local NVMe (gp3 500MB) for model weights—just enough to keep the 4GB model resident without trashing the nodes memory.
We added a custom readiness probe that scrapes /healthz on the worker and exits non-zero if GPU memory usage > 80% for 3 seconds. The probe saved us during a batch of warm-up spikes when users spammed the same clue repeatedly.
Resource requests are locked:
- LLM worker: requests.memory=2Gi, limits.memory=3Gi, requests.nvidia.com/gpu=1, limits.nvidia.com/gpu-memory=10GiB.
- Static server: requests.memory=128Mi, limits.memory=256Mi, no GPU.
We moved the model weights out of the container image into a separate EFS volume so we could push updates without rebuilding images. The update pipeline now runs a canary rollout with 5% traffic for 60 seconds, watching P95 latency. If latency > 600ms or error rate > 1%, the rollout aborts automatically. We found this threshold after one incident where a model update introduced a new tokenizer that spiked token generation time from 80ms to 220ms.
What The Numbers Said After
After six weeks:
- P95 clue latency dropped from 1.2s to 320ms.
- Error rate under peak load (12k concurrent users) stayed below 0.28%.
- GPU utilization on g4dn nodes never exceeded 65%, leaving headroom for other services.
- Cost per 1000 hints solved went from $0.14 to $0.09 because we stopped over-provisioning pods waiting for cold LLM starts.
The most surprising metric was cache hit ratio after 30 seconds of event start: 78%. The Redis TTL of 60s was exactly long enough to absorb the initial rush without staleness.
What I Would Do Differently
I would never again let a vendor module dictate resource topology. The modules defaults assumed stateless containers and gave us no knobs for GPU partitioning or cache TTL.
I would also standardize on PodDisruptionBudgets that reserve one GPU per AZ before any maintenance window. We learned that lesson the hard way when a rolling node replacement killed all workers on one AZ and forced users onto the remaining AZ, spiking latency to 2.8s.
Finally, I would rename the endpoint from /hunt/clue/{id} to /clue/{id}/hunt. REST semantics matter when your API is being scraped by bots. One bot kept hitting the clue endpoint with a 500ms head-of-line delay, and it turned out the upstream CDN cache key was wrong. Fixing the route cleaned up 30% of our tail latency without touching the model.
Top comments (0)