Nalin Banga

Posted on Jun 28

Building a 5G UPF That Actually Saturates a 10G Link: VPP + DPDK + Open5GS in Production

#networking #linux #performance #5g

Most writing about DPDK and VPP stays comfortably abstract. Libraries and drivers for fast packet processing. Kernel bypass. Zero-copy I/O. Poll-mode drivers. The concepts are well-documented. What's harder to find is an account of what happens when you actually build something with them — the architectural decisions, the dead ends, and what the profiler tells you when you think you're done.

This post is that account.

We needed a User Plane Function (UPF) for a 5G core built on Open5GS. The baseline implementation used a socket-based forwarding path. Under sustained load on a 10G link, it peaked at around 850 Mbps — roughly 8.5% of line rate. The hardware was not the constraint. The path the packets took through the kernel was.

The goal was to get as close to line rate as the hardware would allow, in software, on commodity x86, integrated with Open5GS's SMF via PFCP. No FPGAs. No SmartNICs. No proprietary ASICs.

We got to 8.5–9 Gbps.

What follows is a detailed walkthrough of how we built the VPP-based packet pipeline, integrated DPDK's poll-mode drivers for zero-copy I/O, designed the GTP-U encap/decap graph nodes, and wired it all to Open5GS's session management layer — along with what we learned when the bottleneck stopped being software and started being the PCIe bus.

Why socket-based forwarding hits a ceiling

The original UPF used standard Linux sockets to receive packets from the GTP-U tunnel, apply session rules, and forward them to the correct destination. On paper this works. In practice, every packet crossing from NIC to userspace carries a fixed cost: a kernel interrupt or poll cycle, a memory copy from kernel buffer to userspace buffer, a context switch if the thread was sleeping.

At low traffic rates, none of this matters. At 10G line rate, the overhead compounds fast. You're moving roughly 14.8 million packets per second for 64-byte frames. Every unnecessary memory copy, every syscall, every cache miss is multiplied by that rate. The socket path wasn't slow because of bad code — it was slow because the design requires the kernel to be in the critical path for every single packet.

The fix isn't to write better socket code. It's to remove the kernel from the path entirely.

Choosing the right foundation

The decision to build on VPP — Cisco's open-source, production-grade dataplane framework — alongside DPDK was not arbitrary.

VPP's graph-node architecture is purpose-built for exactly this problem: it processes packets in vectors rather than one at a time, amortising the fixed overhead of cache misses and branch mispredictions across large batches. Pair that with DPDK's kernel bypass and poll-mode drivers, and you have a foundation capable of matching — and in many cases exceeding — dedicated hardware forwarding.

The integration target was Open5GS, a widely used open-source 5G core implementation. The SMF (Session Management Function) in Open5GS controls session setup and teardown via PFCP — a control-plane protocol that instructs the UPF on how to handle each subscriber's packets. Our goal was a UPF that could receive those instructions from Open5GS SMF and act on them at full line rate.

Designing the graph node pipeline

VPP structures all packet processing as a directed graph of nodes. Each node receives a vector of packets, applies a specific function, and passes packets to the next node via labelled edges. This maps cleanly onto what a UPF needs to do:

Receive raw packets from the NIC (DPDK PMD node)
Parse and validate GTP-U headers (custom decap node)
Apply per-session forwarding rules from PFCP (session lookup node)
Re-encapsulate and transmit for uplink traffic (custom encap node)
Route and forward to the data network for downlink traffic

We decomposed the UPF forwarding path — GTP-U tunnel switching — into discrete graph nodes chained together. This maps cleanly onto how PFCP session rules define forwarding behaviour: each PDR (Packet Detection Rule) and FAR (Forwarding Action Rule) pair becomes a node traversal decision.

The key design choice was keeping node functions narrow. A node that does one thing well processes its vector fast and hands off. Nodes that try to do too much break the cache locality that makes VPP fast.

GTP-U encapsulation: the custom node

GTP-U (GPRS Tunnelling Protocol — User Plane) is the tunnel protocol that carries subscriber traffic between the RAN and the UPF. Every packet entering from the gNodeB is encapsulated in GTP-U. Every packet going back toward the RAN needs to be re-encapsulated with the correct TEID (Tunnel Endpoint Identifier) for that subscriber session.

We implemented this as two custom VPP graph nodes: a decap node for the uplink path and an encap node for the downlink path.

The decap node:

Strips the outer UDP/IP headers and the GTP-U header
Extracts the TEID and looks up the corresponding PFCP session entry
Passes the inner packet and session context to the forwarding node

The encap node:

Takes inner IP packets from the session lookup node
Prepends the GTP-U header with the correct TEID from the session table
Prepends the outer UDP/IP headers
Passes to the DPDK TX node

Both nodes operate on packet vectors. The session table lookup in the decap node is the hot path — we use a hash table keyed on TEID with prefetching on the next entry while processing the current one, which keeps the lookup cost from stalling the pipeline.

Integration with Open5GS: the control plane side

The control plane integration is handled via PFCP. Open5GS's SMF sends PFCP Session Establishment Requests to the UPF when a subscriber session comes up. Each request contains a set of PDRs and FARs that describe how to handle that subscriber's packets.

We implemented a PFCP message handler that translates SMF instructions into VPP runtime configuration changes — adding and removing graph-node state, updating tunnel endpoints, adjusting QoS marking — without disrupting in-flight packets. The control and data planes remain strictly separated.

The key implementation detail: VPP's thread model keeps the main thread (where control-plane operations happen) separate from the worker threads (where the packet graph runs). Session table updates go through a per-worker message queue and are applied at a safe point in the graph execution cycle. This means an SMF instruction to tear down a session never corrupts a packet that's mid-flight through the pipeline.

What the numbers taught us

Reaching 8.5–9 Gbps on a 10G link was validating, but the more useful result was understanding where the remaining headroom went.

At full load, the limiting factor shifted from software overhead to NIC descriptor ring contention and PCIe bandwidth — a hardware constraint, not a code quality issue. That distinction matters: the software had cleared the way and the physical medium was now the actual bottleneck. You can't optimise your way past PCIe bandwidth with better code.

The profiling breakdown was roughly:

~80% of CPU cycles in the DPDK PMD poll loop at full load (expected — this is the poll-mode driver doing its job, burning cycles instead of sleeping)
~12% in the GTP-U decap/encap nodes
~6% in PFCP session table lookups
~2% in everything else (IP routing, checksum offload coordination, TX completion)

The session table lookup being only 6% was a result of the prefetch strategy working. Without prefetching, it was closer to 18% and the throughput ceiling was around 6.5 Gbps.

Where the remaining headroom went

At 8.5–9 Gbps, we were hitting the NIC descriptor ring limit before we hit any CPU core ceiling. The Intel NIC we were using has a fixed descriptor ring size, and at 10G line rate with small packets, the ring fills faster than it can be drained by a single DPDK poll-mode thread.

The options at this point are:

Multi-queue with RSS (Receive Side Scaling) to spread load across multiple DPDK threads — the right approach for scaling beyond one core
A 25G NIC, which shifts the hardware ceiling up and gives more headroom for the software path
SmartNIC offload for GTP-U decap, moving that work off the CPU entirely

For our use case, the single-thread result was sufficient. But the architecture is designed to scale horizontally — each VPP worker thread handles its own set of NIC queues and its own session table shard, with no shared state between workers.

Takeaways

A few things that are obvious in hindsight but weren't at the start:

Profile before you optimise. The first two weeks of optimisation work targeted the wrong thing — we were tuning the GTP-U header parsing when the real cost was in the session table hash lookups. The profiler told a different story than the intuition.

VPP's vector model only pays off if your nodes stay narrow. The first version of the decap node tried to do session lookup and QoS marking in a single node. It was fast at low packet counts and slow at high ones, because the larger working set was killing cache locality. Splitting it into two nodes and keeping each one cache-friendly recovered about 800 Mbps.

PFCP control-plane separation is non-negotiable. The original design had session table updates happening inline with packet processing. It worked fine in testing. Under real load with concurrent session setup and teardown, it produced intermittent corruption. Moving updates to the per-worker message queue cost two days of refactoring and eliminated the problem entirely.

The kernel isn't always the enemy. DPDK bypasses the kernel for the data path, but we still use the kernel for PFCP socket handling, management interfaces, and logging. The lesson isn't "avoid the kernel" — it's "don't put the kernel in the fast path."

Try it yourself

The test environment configuration — Docker Compose setup, VPP startup.conf, DPDK NIC binding scripts, and Open5GS SMF integration config — is on GitHub.

If you're building something similar, the PFCP session handler and the GTP-U decap node are the most likely starting points to adapt for your use case.

About the author

I'm a systems software engineer working on 5G core infrastructure — UPF, AMF, VPP, DPDK, Open5GS, Linux networking. Based in Delhi/Gurgaon.

Open to senior engineering roles and consulting in 5G core, dataplane engineering, and high-performance packet processing.
Connect on LinkedIn: https://www.linkedin.com/in/jarvis8

Canonical URL: https://www.linkedin.com/pulse/from-850-mbps-9-gbps-what-actually-takes-build-fast-upf-nalin-banga-ms7qc

Originally published on LinkedIn Pulse.

DEV Community

Building a 5G UPF That Actually Saturates a 10G Link: VPP + DPDK + Open5GS in Production

Top comments (0)