Mads Quist for All Quiet

Posted on May 11 • Originally published at allquiet.app

AWS Elastic IP failover with Keepalived: how we keep self-managed loadbalancers redundant

#aws #keepalived #devops #redundancy

Originally published on 10 May 2026 on the All Quiet Tech Blog.

At All Quiet we build incident management: alerting, on-call rotations, escalation, status pages, and integrations with the monitoring stacks teams already run. A meaningful slice of my job is keeping the boring edges boring, especially ingress, when something breaks.

In parts of our stack we run loadbalancers ourselves on EC2 instead of putting every path behind an AWS-managed balancer. We do that in part to avoid leaning too hard on higher-level AWS abstractions for those tiers: we still rely on EC2 for reliable virtual machines, and we keep the design close to portable building blocks so we could run the same pattern in another data center or provider without a ground-up redesign. Once we made that choice, we still had a plain high availability (HA) problem for the active-passive pair: keep the public edge redundant.

For those tiers we use a small pattern: a stable Elastic IP (EIP), the address we publish in DNS and the stand-in for a floating IP on a traditional network; Keepalived running Virtual Router Redundancy Protocol (VRRP) between peers; and the EC2 API, mainly AssignPrivateIpAddresses and AssociateAddress, to move that EIP when mastership changes. We wire this with Ansible and the AWS Cloud Development Kit (CDK) in our infrastructure repo.

The problem in AWS terms

I grew up with patterns where a “floating IP” moves at layer 2 (L2) with gratuitous Address Resolution Protocol (ARP). Amazon Virtual Private Cloud (VPC) doesn’t work like your favorite rack fabric: public routing for Elastic IPs is enforced by AWS’s control plane, tied to a specific Elastic Network Interface (ENI) and private address on an instance.

So we split responsibilities deliberately:

Between our servers, we use Keepalived / VRRP, almost always unicast, to decide which node is primary.
Against AWS, we run a script on notify_master that calls the command-line interface (CLI) or API so the EIP actually attaches to the winner.

If we did only VRRP virtual-address tricks without AssociateAddress, we would not fix customer-visible public routing for that EIP. If we did only API moves without Keepalived, we’d lack a clean distributed agreement story on the pair. We need both layers.

Architecture at a glance

                    Elastic IP (stable in DNS)
                              │
                              ▼
              ┌───────────────────────────────┐
              │  EC2: EIP associated here     │
              │  (AssociateAddress, etc.)     │
              └───────────────────────────────┘
                              │
                  Our LB tier (e.g. HAProxy / nginx)
                              │
                           backends

On both nodes we run Keepalived with unicast peers, priorities, and a vrrp_script that reflects whether our LB process is actually alive (systemctl, curl to localhost, or whatever probe matches reality). When a node becomes MASTER, notify_master runs our failover shell script: ensure a secondary private IP, then associate the allocation.

Implementation sketch

Kernel: we often enable ip_forward / ip_nonlocal_bind where our HAProxy or nginx layout needs it. We validate per role, not globally.

Security groups: protocol 112 (Virtual Router Redundancy Protocol, VRRP) allowed between peers, not the open internet.

Keepalived: notify_master logs to a rotated file; credentials via an Identity and Access Management (IAM) instance profile where we can.

Instance identity: in production we fetch instance metadata using Instance Metadata Service version 2 (IMDSv2).

Example Keepalived skeleton (placeholders only, not a drop-in):

global_defs {
    enable_script_security
    script_user root
    vrrp_startup_delay 1
}

vrrp_script check_service {
    script "/usr/bin/systemctl is-active --quiet nginx"
    interval 2
    weight 2
}

vrrp_instance VI_1 {
    state MASTER
    interface eth0
    unicast_src_ip 10.0.0.10
    unicast_peer {
        10.0.0.11
    }
    virtual_router_id 51
    priority 200
    advert_int 1
    track_script {
        check_service
    }
    notify_master "/etc/keepalived/aws-failover.sh >> /var/log/keepalived/aws-failover.log"
}

Example failover script shape (replace IDs and IPs; use IMDSv2 in production):

#!/usr/bin/env bash
set -euo pipefail

ALLOCATION_ID="eipalloc-REPLACE_ME"
PRIVATE_IP_SECONDARY="10.0.0.50"
INTERFACE="eth0"

INSTANCE_ID="$(curl -sf http://169.254.169.254/latest/meta-data/instance-id)"

ip addr add "${PRIVATE_IP_SECONDARY}/32" dev "${INTERFACE}" || true

NI_ID="$(aws ec2 describe-instances \
  --instance-ids "${INSTANCE_ID}" \
  --query 'Reservations[0].Instances[0].NetworkInterfaces[0].NetworkInterfaceId' \
  --output text)"

aws ec2 assign-private-ip-addresses \
  --network-interface-id "${NI_ID}" \
  --private-ip-addresses "${PRIVATE_IP_SECONDARY}"

aws ec2 associate-address \
  --allocation-id "${ALLOCATION_ID}" \
  --instance-id "${INSTANCE_ID}" \
  --private-ip-address "${PRIVATE_IP_SECONDARY}"

Tradeoffs of managing the edge ourselves

When we self-manage load balancer tiers instead of defaulting to AWS-managed front doors, we still need to evaluate the usual architectures: application or network load balancers, DNS failover, Kubernetes ingress, or the Elastic IP + Keepalived pattern this post describes.

Application Load Balancer (ALB)

Pros / cons (for anyone choosing ALB):

Upside: AWS-managed HA, health checks, Transport Layer Security (TLS) with AWS Certificate Manager (ACM), AWS Web Application Firewall (WAF), and a clear scaling story for HTTP.

Downside: cost at scale, less hands-on control over every packet and knob than raw EC2.

At All Quiet: we rely on managed load balancing for paths where we want AWS to own HA end-to-end, including customer-facing HTTP. We treat ALB-class tooling as the default when we do not want to operate the edge ourselves.

Network Load Balancer (NLB)

Pros / cons (for anyone choosing NLB):

Upside: TCP/UDP transparency, static IPs per Availability Zone (AZ), low listener overhead compared to full layer 7 (L7).

Downside: fewer HTTP-specific features than ALB; still another billable and operated AWS component.

At All Quiet: when we need AWS-managed HA but not full layer 7 (L7) termination at the edge, NLB-style fits better than ALB; we don’t replace every self-managed tier with NLB, but it’s on the same “managed edge” side of the spectrum as ALB.

DNS failover (Route 53)

Pros / cons (for anyone using DNS failover):

Upside: no instance-side EIP choreography; health-checked routing policies let AWS steer names.

Downside: DNS time to live (TTL) and caching stretch failover and failback; client stacks behave inconsistently; not a drop-in substitute for “one stable IP, instant swing.”

At All Quiet: DNS steering can complement other designs; we don’t rely on it alone when our mental model is exactly one Elastic IP jumping between two known EC2 nodes. That is what Keepalived plus the API covers.

Kubernetes / gateways (e.g. Amazon EKS)

Pros / cons (for anyone on Kubernetes):

Upside: HA via Services, Ingress / Gateway API, and cloud LB integration, which gives different primitives than a bare-metal pair.

Downside: cluster operational tax; not every workload belongs there.

At All Quiet: this article describes a pair pattern centered on virtual machines (VMs) because we still run meaningful tiers that way; where we use Amazon Elastic Kubernetes Service (EKS) or similar, ingress HA follows Kubernetes, not Keepalived on two fixed hosts.

Elastic IP + Keepalived + EC2 API (this post)

Pros / cons (for anyone building like this):

Upside: one stable public address in DNS; relatively few moving AWS objects; full control over timers and failover scripts.

Downside: you own IAM, idempotent scripts, logging, monitoring, and ambiguous states deserve runbooks and checks such as DescribeAddresses. Compared with a managed load balancer, cutover is not instantaneous on abrupt failure: traffic follows wherever the EIP is still associated until VRRP agrees on a new master, your health logic runs, and notify_master finishes calling AWS. The gap depends on advert_int, vrrp_script intervals, preempt settings, and API behavior. Those knobs trade sensitivity against stability; sub‑millisecond failover is not what this pattern promises.

At All Quiet: this is what we actually implemented for specific self-managed loadbalancer tiers: Ansible-deployed Keepalived, scripts on notify_master, AWS Cloud Development Kit (CDK) and infrastructure as code (IaC) for the EIP and IAM. That is the same stack this article walks through at a pattern level.

How we pick among them

Internally we ask: does this path’s service level objective (SLO) and budget justify a managed LB? Do we need layer 7 (L7) features only ALB gives us? Does scripted EIP failover fit this path’s resilience expectations (see the EIP downside above)? If not, we promote the tier (ALB/NLB or another design); we don’t stretch EIP+Keepalived past where it fits.

Takeaways

We run self-managed LBs in some slices of our infra; we needed explicit public HA there, and EIP + Keepalived + EC2 API is our compact answer.
VRRP decides who leads; AssociateAddress decides where the EIP points.
Managed ALB/NLB remain strong defaults when we want AWS to own HA at that layer.

Closing

We touch incident paths every day; when ingress misbehaves, people notice fast. If you operate similar edges, use this framing to decide when the pattern fits and when to promote that tier to managed load balancing instead.

DEV Community