If you work in High-Frequency Trading (HFT), AI cluster training, or 5G URLLC, you know that average latency is a vanity metric. What actually matters is determinismβbounding your P99 tail latency so your systems don't experience random, application-breaking jitter.
Recently, my team was tasked with building a zero-trust gateway and rate limiter (Aegis NanoPAM). Traditional user-space L7 proxies (like NGINX or Envoy) or standard Linux iptables were taking milliseconds to process rules under volumetric load. That wasn't going to cut it.
We wanted to see how fast we could push network security if we dropped it entirely into the Linux kernel using eBPF and XDP (eXpress Data Path).
The result? A Token Bucket rate limiter that executes in 629 nanoseconds with virtually zero OS jitter. Here is exactly how we built it.
π The Problem: The Linux Network Stack is "Too Slow"
When a packet hits a standard Linux server, it triggers an interrupt, allocates a Socket Buffer (sk_buff), traverses the Netfilter hooks (iptables/nftables), routes through the TCP/IP stack, and finally context-switches into user space.
Under heavy load, this pipeline creates massive jitter. A packet that normally takes 50 microseconds might suddenly take 5 milliseconds because the OS scheduler decided to run a background task.
β‘ The Solution: eBPF and XDP
To bypass this, we wrote our gateway in restricted C and compiled it to eBPF. We then attached this program directly to the Network Interface Card (NIC) driver using XDP Native (DRV) Mode.
By hooking in at the XDP layer, our code executes before the kernel even allocates memory for the packet.
The Architecture:
Transparent Stealth Mode: The gateway operates purely at Layer 2. It has no IP address, making it invisible to port scanners.
O(1) BPF Hash Maps: We use a composite key (IP + Port) to authorize traffic instantly.
Per-CPU Token Bucket: We built a lockless rate limiter that preserves nanosecond remainders to prevent micro-burst starvation.
Here is a simplified snippet of our in-kernel rate limit check:
C
// Inside aegis_xdp.c
__u64 now = bpf_ktime_get_ns();
struct bucket_state *state = bpf_map_lookup_elem(&rate_limits, &key);
if (!state) {
// Initialize new connection
struct bucket_state new_state = { .last_time = now, .tokens = BUCKET_SIZE - 1 };
bpf_map_update_elem(&rate_limits, &key, &new_state, BPF_ANY);
} else {
__u64 elapsed = now - state->last_time;
__u32 new_tokens = elapsed / NANOS_PER_TOKEN;
if (new_tokens > 0) {
__u32 space = BUCKET_SIZE - state->tokens;
state->tokens += (new_tokens > space) ? space : new_tokens;
// Preserve nanosecond remainder for precision
state->last_time += (__u64)new_tokens * NANOS_PER_TOKEN;
}
if (state->tokens < 1) return XDP_DROP; // Rate limit exceeded
state->tokens--;
}
π§ The Secret Sauce: Hardware Pinning
Writing fast eBPF code wasn't enough. To achieve true determinism, we had to stop the operating system from interfering with our packet processing.
We took a 5.1GHz CPU core (Core 10) and completely isolated it from the Linux OS scheduler. We routed all incoming NIC IRQs for our target interface directly to this isolated core.
By dedicating hardware exclusively to the XDP fast path, we eliminated context switching and CPU cache thrashing.
π The Results: Bounding the Tail
We ran a volumetric stress test using nping to blast the interface with TCP SYN packets and monitored the in-kernel execution time using bpf_trace_printk.
P50 Latency (Median): 629 ns
P99 Latency (Tail): 645 ns
Total Peak-to-Peak Jitter: < 41 ns
Compared to standard Netfilter configurations under load, our XDP datapath was operating roughly 8,000x faster, with a perfectly symmetrical P99 distribution curve.
π₯ The "Proof of Life" Benchmark
Seeing is believing. Here is an uncut terminal recording of the Aegis NanoPAM datapath logging deterministic 629ns execution times under active load:
https://www.loom.com/share/d91cf570882840d5be81275bb6b7608d
π οΈ Try it yourself
We have open-sourced the aegis_xdp.c kernel code, the bash testing harness, and the python P99 histogram graphing utility so you can reproduce these metrics on your own subnet.
π Check out the Aegis-XDP (NanoPAM) Repo on GitHub
Have you experimented with eBPF/XDP for performance-critical infrastructure? I'd love to hear how you are optimizing your map lookups or handling state in the comments below!
Top comments (0)