Amit Kayal

Posted on Apr 20

Building a Practical Lambda Capacity Provider Platform: Lessons Learned from Warm Pools, Version Hygiene, and CI/CD Reality

#aws #serverless #lambda

Building a Practical Lambda Capacity Provider Platform: Lessons Learned from Warm Pools, Version Hygiene, and CI/CD Reality

There is a big difference between a slide-deck architecture and an operating system you can trust on a Monday morning.

This implementation captures that difference well. On paper, the idea is simple: create a shared AWS Lambda Managed Instances capacity provider, run latency-sensitive workloads on ARM64, keep the pool warm with EventBridge, prune old Lambda versions before they become operational debt, and wrap the whole thing in a GitHub Actions plus CodeBuild delivery model. In practice, each of those choices changes how you think about performance, cost, blast radius, and developer discipline.

What follows is not a generic cloud post. It is the kind of write-up you produce after actually building and living with the system.

The Real Problem We Were Solving

Traditional Lambda is excellent when you want abstraction and convenience. It becomes less elegant when your workload is sensitive to startup time, carries heavier dependencies, or needs more predictable execution behavior under bursty load.

That is where a Lambda capacity provider changes the discussion.

In this implementation, the platform is built around a shared aws_lambda_capacity_provider that uses ARM64 Graviton instances and auto scaling. The core idea is straightforward: instead of leaving execution placement entirely to the default Lambda fleet, we deliberately provide a managed compute pool that multiple functions can share. That gives us more control over cost-performance characteristics and lets us design around cold-start pain rather than merely complain about it.

The choice is visible in the Terraform:

The provider runs on arm64
Allowed instance types are constrained to m6g.large, m6g.xlarge, m7g.large, and m7g.xlarge
Scaling is set to Auto
The maximum pool ceiling is set to 64 vCPU
The capacity provider is placed in the default VPC, with unsupported Availability Zones filtered out

That last point matters more than it first appears. The code explicitly excludes unsupported AZs such as us-east-1e, which is a good example of operational maturity: the happy path is not enough when the service itself has placement constraints.

How We Actually Created the Capacity Provider

One thing I wanted this platform to avoid was "concept architecture" with no implementation backbone. So the capacity provider here is not described abstractly. It is provisioned directly in Terraform and wired into the Lambda lifecycle in a fairly intentional way.

The build starts in terraform_file/agent_core_sync_cp.tf.

First, the capacity provider itself is created with aws_lambda_capacity_provider. The naming pattern ties it to the service and environment, which is the right instinct for multi-environment operation. The provider is tagged as shared compute for agent workloads, which matters later for discoverability and platform governance.

Second, the provider is placed inside the default VPC, but not blindly. In terraform_file/data.tf, the code:

discovers the default VPC
fetches the default subnets
inspects subnet Availability Zones one by one
excludes unsupported zones such as us-east-1e
optionally caps how many subnets are used

This is a subtle but important design choice. Lambda Managed Instances often create one placement footprint per subnet or AZ. If you do not control subnet spread, you can end up creating more infrastructure surface area than you intended.

Third, the provider uses a dedicated security group rather than inheriting something vague and accidental. The current implementation keeps outbound traffic fully open and allows inbound HTTPS. That is permissive, but it is at least explicit and repeatable. Early-stage platforms benefit from that kind of clarity.

Fourth, the capacity provider gets its own operator role through AWSLambdaManagedEC2ResourceOperator. That is a critical detail. Capacity providers are not just Lambda resources; they need AWS to manage the EC2-backed execution infrastructure on your behalf. If you miss that role, the platform does not really exist no matter how nice your Terraform looks.

Fifth, the instance requirements are opinionated. The code forces arm64 and narrows the fleet to supported Graviton M-family instance types. That is one of the better engineering decisions in this implementation because it converts an architectural preference into an enforceable runtime rule.

Finally, the Lambda function is attached to the capacity provider in terraform_file/lambda_clm_router_agent.tf through capacity_provider_config. That is where the abstraction becomes real. We are not just provisioning a pool and hoping someone uses it later. We are explicitly binding a published Lambda to that pool and tuning:

memory GiB per vCPU
max concurrency per execution environment
ARM64 runtime alignment
published versioning through Lambda aliases

That is the full loop: provision shared compute, constrain placement, grant AWS the operator role it needs, attach live functions to the pool, and then manage the resulting version sprawl with automation. That is what makes this feel like a platform artifact rather than a loose Terraform experiment.

Lesson 1: A Capacity Provider Is Not a Tuning Knob. It Is an Operating Model.

Teams often talk about capacity providers as if they are just a performance optimization. That framing is too shallow.

The moment you move Lambda onto managed instances, you are no longer only buying faster startup. You are adopting a new operating model with very clear implications:

You now care about instance family compatibility
You need to think about subnet strategy and AZ support
You have to reason about pool scaling ceilings, concurrency, and memory per vCPU
You are effectively blending serverless ergonomics with infrastructure accountability

This implementation shows that transition clearly. The CLM router Lambda is not just declared with a runtime and handler. It is attached to the shared capacity provider and explicitly tuned with:

execution_environment_memory_gib_per_vcpu
per_execution_environment_max_concurrency
publish = true
architectures = ["arm64"]

That is the tell. Once we start specifying how execution environments should behave, we are no longer simply "deploying a Lambda." We are shaping compute economics.

The practical lesson here is simple: if you adopt Lambda Managed Instances, treat it like platform engineering, not like a runtime checkbox.

Lesson 2: ARM64 Delivers Real Value, but Only if You Respect Service Constraints

One of the strongest decisions in this implementation is the bias toward Graviton. For Python-heavy agent workloads, ARM64 is usually the right default. The economics are better, and the performance-per-dollar story is often compelling.

But there is an important nuance that the Terraform comments correctly capture: not every EC2 family you might expect is supported in the way you assume. This implementation explicitly avoids unsupported combinations and narrows the fleet to supported M-family Graviton instances.

That is a good lesson in cloud architecture generally: cloud products market flexibility, but production systems survive on constraint management.

The teams that do well with modern AWS services are not the ones that assume every SKU works. They are the ones that encode the service's real boundaries in Terraform so no one has to rediscover them during an incident window.

Lesson 3: Warmup Is Not a Hack. It Is a Deliberate Control Loop.

There is a tendency in engineering circles to treat "warming" as a slightly embarrassing workaround. I think that is the wrong mindset.

This implementation schedules the CLM router Lambda every five minutes through EventBridge. The handler itself is intentionally lightweight and effectively acts as a keep-alive mechanism. That is not laziness. It is an explicit decision to keep the shared pool alive for latency-sensitive traffic.

More specifically, the warmer exists to reduce the probability that the capacity provider has to spin up fresh managed instance capacity for a new invocation path after a quiet period. That is the practical point of the EventBridge rule in terraform_file/eventbridge_cp_arm.tf. By invoking the Lambda on a steady rate(5 minutes) schedule, the platform keeps the execution path warm enough that the shared capacity provider is less likely to fall all the way back to a cold, scale-from-zero posture right before a real request arrives.

The important insight is this: once you care about cold-start predictability, you need a control loop.

That control loop can be:

Provisioned concurrency
Scheduled warmers
Request shaping
A shared managed instance pool

In this design, the team chose scheduled warm invocation plus a shared capacity provider. That is a sensible middle ground. It is cheaper and simpler than overcommitting always-on infrastructure, while still materially reducing the first-hit penalty.

In plain English: the EventBridge warmer is being used here so the capacity provider does not need to spin up a brand-new server footprint every time traffic reappears after idle time. For interactive or latency-sensitive agent workloads, that is a very practical optimization.

The strategic lesson is that warmup should be measured against business latency, not ideological purity. If a five-minute EventBridge schedule protects user experience and keeps cost acceptable, it is doing its job.

Lesson 4: Shared Pools Create Efficiency, but They Also Create Coupling

The capacity provider here is intentionally shared across platform agents and automation services. That is the right move early in a platform journey because it improves utilization and prevents every Lambda from inventing its own isolated infrastructure story.

But shared pools always introduce two forms of coupling:

Technical coupling, because multiple workloads compete for the same execution substrate
Organizational coupling, because one team's deployment patterns can affect another team's cost and performance envelope

That is why the concurrency controls here matter. The CLM router function uses a per-execution-environment concurrency setting, and the environment-specific .tfvars files pin that concurrency to 4. That is more than a performance number. It is a fairness policy.

If I were advising a platform team scaling this pattern, I would say this clearly: shared capacity providers are excellent, but they need quota thinking from day one. Otherwise the first successful workload becomes the first noisy neighbor.

Lesson 5: If You Publish Versions Aggressively, You Need Lifecycle Hygiene on Day One

This implementation makes another good call: the Lambda functions are published, aliased, and then cleaned up with an automated version pruner.

That matters because version sprawl is one of those quiet operational problems that teams ignore until it becomes annoying enough to disrupt deployments. Published versions accumulate quickly when CI/CD is active. If you do not manage them, you eventually pay in clutter, confusion, or hard service limits.

The lambda_version_pruner implementation is stronger than a simplistic cleanup script because it preserves what actually matters:

It scans all Lambda functions
It filters only functions associated with the target capacity provider
It lists all aliases and protects aliased versions
It keeps the latest N published versions
It deletes everything older that is neither current nor aliased

This is exactly the kind of automation mature teams invest in. Not glamorous. Very valuable.

There is also an understated platform principle here: rollback is not just about keeping artifacts. It is about keeping the right artifacts. By preserving aliased versions, the pruner respects deployment intent rather than blindly optimizing for tidiness.

There is also a more practical capacity-provider reason for doing this, and it deserves to be stated directly.

When you run a shared Lambda Managed Instances pool, you want the platform to spend its effort on the versions that are actually serving traffic, warming correctly, or remaining available for safe rollback. If old published versions keep accumulating forever, three unhealthy things tend to happen:

operators lose clarity on which versions are still meaningful
rollback and alias management become noisier than they should be
the shared platform carries more deployment residue than useful runtime intent

Strictly speaking, deleting old Lambda versions does not magically increase CPU on the capacity provider. What it does do is improve platform hygiene around the shared pool. It ensures that the versions attached to aliases, warmup patterns, and deployment workflows remain deliberate and limited. In other words, it improves capacity-provider utilization indirectly by reducing version sprawl around the workloads that consume that shared capacity.

That matters in real operations. The healthier the deployment surface is, the easier it is to reason about what is warming, what is active, what can be rolled back, and what should no longer influence the platform at all.

So the version pruner is not just a cleanup utility. It is part of making the shared capacity provider operationally efficient. Not by adding raw compute, but by reducing noise, protecting the versions that matter, and keeping the platform focused on live execution paths instead of historical leftovers.

Lesson 6: GitHub Actions Should Orchestrate. CodeBuild Should Execute.

Architecturally, the CI/CD model here is sensible.

GitHub Actions is used as the control plane:

branch-based triggering
security scanning
environment selection
AWS credential injection
build orchestration

AWS CodeBuild is used as the execution plane:

Terraform install
terraform init
terraform validate
terraform plan
terraform apply

I like this split. It keeps GitHub Actions lightweight and makes AWS the place where the actual infrastructure mutation happens. That usually gives better access control, cleaner auditability, and fewer surprises around long-running plan or apply steps.

The buildspecs pin Terraform 1.12.2, install the CLI explicitly, and then execute plan/apply flows with environment-specific variable files. That is exactly the kind of boring repeatability you want in infrastructure delivery.

This is one of the most practical lessons from the implementation: do not force GitHub Actions to be your full deployment runtime if AWS-native execution gives you better control.

Lesson 7: CI/CD Maturity Is Not About Having a Pipeline. It Is About Where the Gates Actually Are.

The implementation also reveals a harder truth: CI/CD design is won or lost not by YAML volume, but by trigger discipline.

There are some good instincts here:

Dev deployment is chained off a successful security workflow
Security scanning runs on push and PR for dev
PR security review is scoped only to actual code and infrastructure changes
Environment-specific secrets are used for AWS access

That said, the current implementation also shows the kinds of issues every fast-moving team encounters:

The dev deploy workflow is triggered by Security Checks (Push), not by a broader quality gate such as tests plus security plus static analysis
The QA workflow is currently triggered on pull_request to qa, yet it also includes an apply stage, which is a risky combination
The sanity workflow references a different CodeBuild project naming pattern, which looks like copy-forward drift from another implementation
One dev apply step mixes generic and environment-specific secrets in a way that deserves tightening

This is not a criticism of the team. It is actually the most authentic part of the system.

Real pipelines evolve through reuse, renaming, urgency, and partial migration. The useful engineering habit is not pretending they are pristine. It is recognizing that pipeline drift is itself a production concern.

My blunt lesson here is this: CI/CD is software. It needs the same review rigor as application code.

Lesson 8: Documentation Drift Is a Reliability Signal

The README here is ambitious and useful, but parts of it clearly describe a broader or earlier architecture than the exact files currently present. That mismatch is more important than most teams realize.

When documentation and implementation diverge, three things happen:

new engineers learn the wrong system
reviewers approve changes with outdated mental models
incidents take longer to resolve because operators trust stale diagrams

One of the best engineering habits is to treat documentation drift as an operational bug, not as a cosmetic issue.

This implementation makes that case well. The code is the source of truth. The docs are directionally strong, but some names, workflow descriptions, and file references have clearly moved over time. That is normal. What matters is catching it before the next engineer builds decisions on old assumptions.

Lesson 9: The Default VPC Is Fine for Speed, but It Should Be a Conscious Temporary Convenience

The Terraform intentionally uses the default VPC and default subnets, then layers in filtering and a custom security group. For early velocity, that is an acceptable choice. It removes friction and makes the first deployment much easier.

But teams should be honest about the tradeoff.

Using the default VPC accelerates setup. It does not provide the same clarity, segmentation, or policy hygiene that a dedicated workload VPC eventually should. The inbound HTTPS rule from 0.0.0.0/0 is another example of where a practical early-stage decision should later be revisited with a more opinionated security posture.

My view is simple: default VPC usage is fine when it is a speed decision. It becomes dangerous when it silently hardens into architecture.

Lesson 10: Least Privilege Usually Loses the First Battle. Do Not Let It Lose the War.

The Lambda IAM policy for the router function is broad. Very broad.

That is common when a platform team is trying to unblock integration work quickly across S3, SQS, SNS, DynamoDB, Bedrock, AppSync, logs, X-Ray, and secrets. The version pruner is noticeably tighter, which is encouraging. But the broader pattern remains familiar: the first version of a system usually over-grants.

The lesson is not "never do that." The lesson is "know when you are doing it, and schedule the hardening work while the platform is still comprehensible."

Security debt compounds. The longer a wide-open policy survives, the more invisible it becomes.

What This Repo Gets Right

If I strip away the drift and focus on the platform instincts, this implementation gets a lot right:

It treats capacity provider infrastructure as shared platform capability, not one-off function plumbing
It optimizes for ARM64 economics instead of defaulting to x86 out of habit
It acknowledges cold starts as a business problem and addresses them operationally
It preserves rollback safety with aliases while still pruning version sprawl
It separates orchestration from execution in CI/CD
It encodes AWS service constraints in Terraform comments and defaults, which reduces tribal knowledge

That is a strong foundation.

What I Would Improve Next

If I were turning this into the next version of a production-grade internal platform, I would prioritize the following:

Tighten naming consistency across the implementation.
The capacity provider name appears in slightly different forms across resources. That is how automation misses its target. Shared naming locals should eliminate this class of error.
Make QA and production promotion rules stricter.
A PR-triggered apply path should be removed. Plan on PR, apply on protected branch or approved environment gate is the cleaner model.
Run Terraform from a single explicit working directory.
The current layout places Terraform under terraform_file/, while some buildspec commands read like root-level execution. That ambiguity should be eliminated.
Move from broad IAM toward intent-based policies.
Especially for the router Lambda, policy scope should narrow as the workload stabilizes.
Revisit networking posture.
The default VPC is fine for speed; a dedicated VPC model is better for longevity, auditability, and controlled ingress.
Add stronger deployment quality gates.
Security review is useful, but infrastructure promotion should also hang off validation, tests, linting, and explicit approval where appropriate.
Add platform observability as code.
CloudWatch alarms, dashboarding, and cost visibility for the capacity provider should be treated as first-class Terraform resources, not follow-up tasks.

The Bigger Technical Lesson

The biggest takeaway from this implementation is not about Lambda specifically.

It is about how modern platform teams should build.

We should absolutely chase better cost-performance curves. We should use managed primitives aggressively. We should automate the boring work. But we also need the discipline to encode what we learn while the system is still small enough to reason about.

What makes this useful is that it shows both halves of real engineering:

the architectural intent
the implementation scars

That combination is where credible engineering judgment comes from.

Anyone can present a clean target state. The harder and more useful skill is building systems that survive contact with deployment friction, service constraints, naming drift, and operational reality.

That is what this implementation is doing. And that is why the lessons here matter.

Closing Thought

Capacity providers, warmers, version pruning, and GitHub-driven delivery are not separate topics. They are all answers to the same technical question:

How do we make cloud systems faster, cheaper, safer, and more repeatable without turning every application team into a specialized infrastructure group?

In this implementation, the answer was to centralize the hard platform decisions, automate the hygiene, keep the runtime warm where it matters, and stay honest about the places where the system still needs tightening.

That is not just good infrastructure work.

That is good engineering practice.

DEV Community

Building a Practical Lambda Capacity Provider Platform: Lessons Learned from Warm Pools, Version Hygiene, and CI/CD Reality

Building a Practical Lambda Capacity Provider Platform: Lessons Learned from Warm Pools, Version Hygiene, and CI/CD Reality

The Real Problem We Were Solving

How We Actually Created the Capacity Provider

Lesson 1: A Capacity Provider Is Not a Tuning Knob. It Is an Operating Model.

Lesson 2: ARM64 Delivers Real Value, but Only if You Respect Service Constraints

Lesson 3: Warmup Is Not a Hack. It Is a Deliberate Control Loop.

Lesson 4: Shared Pools Create Efficiency, but They Also Create Coupling

Lesson 5: If You Publish Versions Aggressively, You Need Lifecycle Hygiene on Day One

Lesson 6: GitHub Actions Should Orchestrate. CodeBuild Should Execute.

Lesson 7: CI/CD Maturity Is Not About Having a Pipeline. It Is About Where the Gates Actually Are.

Lesson 8: Documentation Drift Is a Reliability Signal

Lesson 9: The Default VPC Is Fine for Speed, but It Should Be a Conscious Temporary Convenience

Lesson 10: Least Privilege Usually Loses the First Battle. Do Not Let It Lose the War.

What This Repo Gets Right

What I Would Improve Next

The Bigger Technical Lesson

Closing Thought

Top comments (0)