From Vertical Scaling to True Elasticity
At a pivotal point in my career as a Consulting Cloud Solutions / Technical Product Engineer, I was handed a mandate that sounded simple on paper but was far from trivial in execution:
"Enable horizontal autoscaling for our B2B customers."
At the time, customers could only scale vertically, resizing their virtual machines whenever they needed more computers. It worked, but only up to a point. Beyond that, the limitations became obvious.
What made it even more interesting was the environment: this wasn't on any of the well-known hyperscalers. It was within a startup cloud compute company, where elasticity wasn't a built-in convenience, it had to be intentionally engineered.
They couldn't:
Automatically scale out based on load
Scale back in when utilization dropped
Distribute traffic intelligently
Optimize cost through elasticity
Maintain performance during unpredictable spikes
For production B2B workloads, that model was limiting.
We needed real elasticity.
The Real Problem: Vertical Scaling Isn't Elastic
Vertical scaling solves capacity constraints temporarily. But it introduces structural issues:
Downtime during resize operations
Hard upper limits on instance sizes
Paying peak cost even during off-peak hours
No automated response to traffic bursts
For APIs and transactional B2B systems, that's fragile.
What customers needed wasn't "bigger machines." They needed adaptive infrastructure.
The Architecture I Designed
The solution introduced:
Autoscaling groups
Metric-driven scaling policies (CPU + RAM)
Application-aware health monitoring
Layer 7 traffic distribution
Conditional TLS termination
Fully parameterized Infrastructure-as-Code
The core of the design was an autoscaling group resource that dynamically created and managed instances based on scaling signals.
But more importantly, it was built to be reusable and controlled.
Design Philosophy: Parameterization Over Hardcoding
Every meaningful variable was externalized:
Image
Instance size
Network and subnet
Security rules
Scaling thresholds
Cooldowns
Load balancing algorithm
TLS configuration
This allowed the platform team to provide a blueprint, while tenants retained safe tuning control.
Guardrails prevented runaway scaling. Flexibility enabled elasticity.
Dual-Metric Scaling: CPU + RAM
Most autoscaling implementations rely solely on CPU.
That's incomplete.
CPU-Based Scaling
CPU scaling ensured instance-size awareness:
High CPU → scale out
Sustained low CPU → scale in
RAM-Based Scaling
Why memory?
Because CPU-only logic fails for:
Memory-heavy APIs
Cache-intensive services
Background workers
Stateful processes
Memory pressure doesn't always spike CPU. But it degrades performance quietly.
By combining both signals, scaling decisions became more accurate and less reactive.
Controlled Scaling Policies
Scaling adjustments were incremental:
Scale up by +1
Scale down by -1
Cooldown period enforced
Cooldowns were non-negotiable.
Without them, you get:
Oscillation
Thrashing
Cost spikes
Instability
Elasticity must be disciplined.
Intelligent Traffic Distribution
Horizontal scaling without intelligent routing is incomplete.
Each instance was automatically registered behind a managed Layer 7 load balancer configured with:
Listener configuration
Traffic pools
Algorithm selection (round robin, least connections, source affinity)
Health checks
This ensured:
Even traffic spread
Automatic removal of unhealthy instances
Seamless scale-out integration
New nodes joined the pool automatically. Terminated nodes exited cleanly. No manual intervention.
Conditional TLS Termination
Security couldn't be an afterthought.
The template supported conditional TLS termination at the load balancer:
Secure edge termination when required
Certificate injection via parameter
HTTP communication retained internally
This allowed secure external exposure without unnecessary internal overhead.
One template. Multiple behaviors. Clean logic.
Application-Aware Health Monitoring
Metrics alone are not enough.
Health checks were configurable:
Protocol type (HTTP, HTTPS, TCP, PING)
Custom endpoint paths
Expected response codes
Retry logic
Scaling decisions were therefore not just metric-driven. They were application-aware.
If an instance failed health validation, it was removed from rotation, even if CPU looked fine.
Clean Separation of Concerns
Compute provisioning lived in a nested template.
Each instance:
Booted with the required configuration
Attached to the correct network
Applied security policies
Registered automatically as a pool member
This separation improved:
Maintainability
Debuggability
Reusability
Scaling logic stayed isolated from compute logic.
Clean architecture scales better than tangled configuration.
Observability by Design
The stack exposed:
Load balancer endpoint
Active instance IPs
Current autoscaling group size
Scaling thresholds
TLS state
This made validation straightforward and troubleshooting faster.
Elasticity without visibility is chaos.
How the Feature Was Exposed to Customers
Internally, the autoscaling logic was implemented as infrastructure orchestration.
But for customers, it had to feel simple.
The orchestration layer was triggered by a backend service written in Python. Customer-facing APIs were exposed via a Node.js frontend layer.
This separation allowed:
Clean API contracts for B2B customers
Secure backend orchestration execution
Controlled feature consumption
Clear abstraction of infrastructure complexity
Customers interacted with an API. Behind the scenes, the platform translated that into autoscaling behavior.
No platform details exposed. Just capability.
What This Enabled for B2B Customers
After deployment, customers could:
Automatically scale from x → x + n instances
Scale down during low traffic
Avoid manual resizing
Maintain SLA during demand spikes
Secure endpoints with TLS
Optimize infrastructure cost
Infrastructure shifted from:
Static capacity to Adaptive elasticity.
Key Lessons
- Autoscaling is not just CPU-based Memory pressure breaks systems quietly.
- Cooldowns prevent instability Elastic does not mean erratic.
- Parameterization is platform maturity Reusable blueprints reduce operational risk.
- Health checks are mandatory Metrics without application validation are incomplete. Final Thoughts This wasn't just about writing an Infrastructure-as-Code template. It was about: Platform enablement Elastic workload design Cost optimization SLA resilience Production-grade automation
Horizontal autoscaling isn't a feature toggle.
It's an operational philosophy.
And once B2B customers experience real elasticity, going back to vertical-only scaling isn't an option anymore.
Top comments (0)