Why Valtixs Policy Engine Exploded at 1,800 Routes (And How We Fixed It)

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

At Valtix we sold a multi-tenant cloud firewall and service insertion platform. Each tenant got a dedicated AWS VPC with its own route table. Our policy engine compiled tenant-specific firewall rules into AWS Network Firewall stateful rules and pushed them via the Valtix control plane.

We designed the system for 250 tenants each with ≈ 200 routes (mostly CIDR aggregates). That was 50 k routes in the control planes in-memory graph. The AWS Network Firewall backend handled that easily.

Then one enterprise customer decided to migrate 1,200 legacy /24s into a single VPC. Their IPAM folks had never used CIDR aggregation; every /24 had its own active route. Overnight the tenants route table jumped to 1,200 entries, and our control-plane memory graph exploded to 1.4 million nodes because the policy engine built a full forwarding tree for every leaf prefix. At 1,800 routes the control planes Go runtime hit the GC pause ceiling (≈ 800 ms) and the gRPC push to the firewall endpoint timed out.

The worst part: the Valtix CLIs show limits command only reported AWS Network Firewall limits (6 k rules), not the internal control-plane limits. We didnt even know we had an internal limit until prod broke.

What We Tried First (And Why It Failed)

First, we tweaked the garbage collector. We set GOGC=50 and tuned GOMEMLIMIT to 2 GB. That shaved the pause to 600 ms but didnt fix the underlying algorithm: the policy engine was still building a full radix tree for every /32 injected by kube-proxy.

Next, we tried to trim the route table with kubectl patch to force CIDR aggregation. The customers cluster had 1,184 /24s and 422 /32s from kube-proxy. We wrote a simple aggregation script that collapsed any /24 whose /16 parent was 80 % empty. That ran in 12 minutes and reduced the route count to 312. The policy engine GC pauses dropped to 40 ms, and the 503s vanished.

But the script needed root on every worker node, and the customers security policy forbade that. When we tried to run it via a privileged DaemonSet, the kubelets Seccomp profile rejected the iptables-restore calls we needed to update iptables-save.

Our last resort was to bypass the control plane completely and push the aggregated route table directly to the AWS Network Firewall endpoint using the AWS SDK. We built a thin shim in Rust that read the new aggregated table and called CreateNetworkFirewallRuleGroup with a single stateful 1,500-rule group. The shim ran in a Lambda behind an SQS queue; the queue depth grew to 128 messages during the cutover, but the Lambda concurrency limit of 1,000 kept us from melting the queue.

The shim worked—that is, no more 503s—but it introduced a new inconsistency: the control planes view of the route table lagged the firewall by up to 30 seconds. We started seeing tenant logs where traffic that should have hit Deny rules slipped through because the control plane hadnt yet pushed the updated rule.

The Architecture Decision

We decided to change the fundamental assumption in the policy engine: stop building a full forwarding tree in memory. Instead, we switched to a two-layer architecture.

Layer 1: A lightweight in-memory cache that stores only the netip.Prefix and the tenant ID. That cache is built from AWS Route53 Resolver query logs and Kubernetes EndpointSlice delta watches, not from the route table itself.

Layer 2: A background aggregator that runs every 30 seconds (adjustable) and compiles the cache into CIDR aggregates using a greedy algorithm. The aggregator outputs a single JSON file with no more than 250 prefixes per tenant. It writes the file to S3, then triggers a Lambda that calls CreateNetworkFirewallRuleGroup with the aggregated set.

We also introduced a new metric: policy_engine_route_aggregation_ratio. Its the quotient of the raw route count over the aggregated prefix count. When the ratio exceeds 5, we alert the tenant with a link to the aggregation dashboard. At 1,800 raw routes the ratio was 7.2, which immediately triggered the alert.

To keep the control plane consistent with the firewall, we adopted a two-phase commit:

Phase 1: Lambda pushes the new aggregated rule group to AWS Network Firewall.
Phase 2: A control-plane worker listens to CloudWatch Events for the NETWORK_FIREWALL_UPDATE_COMPLETE event, then invalidates the tenants in-memory cache and rebuilds it from the new rule group.

The control plane now never sees more than 250 prefixes per tenant, so its memory graph stays under 62 k nodes. The policy engines heap never exceeds 200 MB, and the GC pause is consistently < 10 ms. The Lambda runtime is billed per 1 ms, so the entire aggregation job costs ≈ $0.004 per tenant per day.

We also rewrote the Valtix CLIs show limits to include three new rows: