DEV Community: Syed Mannan Saood

OpenTelemetry eBPF Instrumentation: Zero-Code Observability Becomes a Standard

Syed Mannan Saood — Tue, 28 Jul 2026 14:40:00 +0000

TL;DR: For twenty years, getting distributed traces out of an application meant adding an SDK, a language agent, or a sidecar code changes, restarts, dependency upgrades, and per-language maintenance. OpenTelemetry eBPF Instrumentation (OBI), which reached alpha in November 2025 and beta at KubeCon EU 2026, captures HTTP/gRPC/SQL/Redis/Kafka traces directly from the kernel with zero code changes, by attaching to a running process from outside it entirely. It's the OpenTelemetry project's formal successor to Grafana Beyla, now co-developed with Splunk, Coralogix, and Odigos. The catch: doing this from outside the process requires kernel privileges most security teams have spent years trying to eliminate.

The Instrumentation Problem OBI Is Solving

Every observability approach before eBPF required getting into the process somehow.

Manual instrumentation: developers add SDK calls around every operation they want traced. Accurate, but it's ongoing engineering work multiplied across every service, every language, every internal library.

Auto-instrumentation agents: a language-specific agent (Java agent, Python auto-instrumentor, Node.js require-in-the-middle hook) attaches at startup and monkey-patches known libraries. Zero code changes, but it's still in-process; it adds a runtime dependency, requires a restart to attach, and if that dependency has a vulnerability or a memory leak, it's now your application's problem too.

Service mesh sidecars: move instrumentation to a proxy sitting next to the container. Solves the in-process dependency problem, but you're now running an extra container per pod, and the proxy still only sees traffic at the network boundary, not internal function calls.

Every one of these approaches has the same shape: something has to be added to, or deployed alongside, the thing you want to observe.

eBPF-based instrumentation breaks that pattern. It attaches from outside the process, at the kernel level, without touching the application's code, dependencies, or container image at all.

What OBI Actually Is

OBI's official description: it "runs out-of-process and instruments at the protocol level, rather than at the library level." That phrasing matters; it's not patching a specific HTTP client library's send function. It's watching the actual bytes on the wire and in relevant kernel/library data structures, and reconstructing HTTP/gRPC/SQL/Redis/Kafka semantics from that.

The practical claims from the OpenTelemetry project's own announcement:

No restarts, no code changes, no configuration changes; telemetry capture starts on an already-running process
No new application dependencies, so no new dependency-introduced vulnerabilities
Minimal CPU/memory footprint even at high request rates, because the heavy lifting happens in the kernel
Automatic W3C trace context propagation across every supported language, without each language needing its own context-propagation implementation
Protocol coverage: HTTP/HTTPS, HTTP/2, gRPC, SQL, Redis, MongoDB, Kafka, GraphQL, Elasticsearch/OpenSearch, AWS S3

It detects when an application is already instrumented with a proper OpenTelemetry SDK and avoids duplicating those signals, meaning it's explicitly designed to coexist with, not replace, existing instrumentation.

Lineage: From Beyla to OBI

OBI didn't start from nothing. It's the direct continuation of Grafana Beyla, first released by Grafana Labs in September 2023, built specifically to auto-instrument compiled languages like Go and Rust where you can't easily inject a runtime agent the way you can with Java or Python.

Earlier in 2025, Grafana Labs donated Beyla to the OpenTelemetry project. The first alpha release under the new name OpenTelemetry eBPF Instrumentation landed November 3, 2025, co-authored by maintainers from Grafana Labs and Splunk, with Coralogix and Odigos also contributing. Splunk then announced the beta at KubeCon + CloudNativeCon Europe 2026 in Amsterdam, alongside general availability of the Splunk Operator for Kubernetes.

Why does the donation-and-rename matter architecturally, not just organizationally? Before OBI, the eBPF observability space was fragmented by vendor: Pixie was tied to New Relic, Beyla skewed toward Grafana Cloud, and Cilium Hubble covered network flows (L3/L4) but stopped before application-level tracing. OBI is explicitly positioned as the vendor-neutral convergence point; a CNCF Observability TAG survey found 67% of production Kubernetes clusters were already running at least one eBPF-based observability tool by early 2026, which is the adoption curve that made a standard layer worth building.

Architecture: Two Spaces, One Pipeline

OBI's system architecture (per the project's own DeepWiki-documented internals) splits cleanly into kernel space and user space.

Kernel space: eBPF programs attached via kprobes, uprobes, and TC (traffic control) hooks capture network and application events, storing intermediate state in BPF maps and the shared memory structures that let kernel-space eBPF programs and user-space processes exchange data safely.

User space: a discovery subsystem identifies which running processes should be instrumented (based on configurable criteria process name, port, Kubernetes labels), a tracer component controls the eBPF program lifecycle (loading, attaching, unloading), reader components consume events out of the BPF maps, processing layers enrich and normalize that raw data into OpenTelemetry's semantic conventions, and exporters ship the final metrics/traces/spans to a backend.

The mechanism for capturing HTTP/gRPC data specifically:

1. OBI inspects the target binary to identify language/framework
2. Attaches uprobes at entry/return points of relevant functions
   (e.g., Go's net/http handler functions, gRPC stream methods)
3. Attaches to TLS library functions (e.g., OpenSSL) to read
   plaintext BEFORE encryption / AFTER decryption
4. Captures request/response metadata directly from kernel
   and userspace data structures at those hook points
5. Correlates request/response pairs, computes duration,
   status code, route assembles a span
6. Injects/reads W3C traceparent headers for cross-service correlation

Step 3 is worth pausing on: because a uprobe sits at the function boundary inside the process's own address space, OBI can read HTTP payloads before TLS encrypts them on the way out, and after TLS decrypts them on the way in without ever touching the TLS handshake or terminating encryption anywhere. This is a meaningfully different approach from a network proxy, which only ever sees encrypted bytes unless it's doing active TLS termination.

The uprobe-only Pivot

Here's a detail that didn't make it into the launch announcements but is architecturally significant: OBI removed kprobe-based collection in favour of a uprobe-only approach (tracked in PR #752 on the project's GitHub).

Why this matters: kprobes attach to kernel functions and syscalls, meaning a kprobe-based collector sits in the network datapath, observing every packet that crosses a syscall boundary, regardless of which process it belongs to. Uprobes attach to specific user-space functions inside a specific process, meaning the instrumentation only fires when that particular application code path executes.

The trade-off is explicit, and it's the same one that shows up in Cilium Hubble vs. Tetragon vs. Beyla comparisons generally:

kprobe/network-datapath approach: broader visibility (every packet, every process), but adds per-packet latency risk and a larger blast radius if something in the hot path misbehaves
uprobe-only approach: narrower, application-scoped visibility, lower risk to network connectivity and packet forwarding, but you lose the ability to see traffic OBI wasn't specifically told to watch for

OBI's maintainers chose the narrower, safer default. If you're building or evaluating a network-observability tool that does sit in the datapath packet inspection, flow tracking, that category, this is precisely the design fork you're navigating, and OBI's choice tells you which side of the safety/coverage trade-off the OpenTelemetry community landed on for application tracing specifically. It doesn't mean the datapath approach is wrong; Cilium Hubble and Tetragon exist specifically because network-layer and security-enforcement use cases need that broader vantage point. It means OBI scoped itself deliberately to stay out of that territory.

What Zero-Code Actually Costs: Kernel Privileges

Here's the part vendor announcements gloss over. To load eBPF programs, attach uprobes to arbitrary processes, and resolve symbols for accurate function-boundary hooking, OBI needs real kernel privileges:

CAP_BPF load and manage eBPF programs
CAP_PERFMON open perf event buffers, attach kprobes/uprobes
CAP_SYS_PTRACE read process memory for symbol resolution
CAP_NET_ADMIN required specifically for TC eBPF socket-level hooks (kernel 5.8+)

The practical result, as one security-focused writeup on Beyla's model puts it plainly: this is "a process that can read arbitrary kernel structures, load unrestricted eBPF programs, and attach to any process anywhere on the node." That's not a criticism of implementation quality; it's the structural cost of a tool whose entire value proposition is "see everything without being told about it in advance."

The data-scope consequence: because OBI captures full HTTP request paths, query strings, and response codes directly from kernel-level interception, any PII embedded in a URL user IDs, session tokens, search parameters, record identifiers flows straight into trace output. This isn't a bug; it's the same category of exposure any full-request-capture tool has, but the "zero code, drop it in and go" pitch makes it easy to deploy before anyone's reviewed what's actually landing in your trace backend.

For production deployment, the security-hardening path is to replace the common privileged: true DaemonSet config with the specific minimal capability set above, plus allowPrivilegeEscalation: false and a read-only root filesystem, none of which is the default in most getting-started guides.

Where OBI Genuinely Struggles

The OpenTelemetry project's own release notes are unusually candid about current limitations, which is worth taking at face value rather than reading as false modesty:

Distributed tracing quality varies sharply by language/runtime. OBI works well for Go (HTTP and gRPC), Node.js (HTTP), Python (HTTP), NGINX (HTTP), and PHP (HTTP/FPM). It currently does not handle distributed tracing well for:

Reactive programming frameworks
Java virtual threads
Complex thread pools

The underlying reason is structural, not a missing feature: uprobe-based correlation relies on being able to associate an incoming request with the specific execution context that handles it. Reactive frameworks and virtual threads deliberately decouple "the thread that received the request" from "the thread that ultimately processes it", exactly the assumption uprobe-based request/response pairing depends on.

It's additive, not a replacement. The project's own guidance is direct: if you've successfully instrumented a service with a proper OpenTelemetry SDK or agent, there's rarely a reason to rip that out for OBI, unless you're hitting specific performance or cost problems the SDK approach caused. OBI's stated best use cases are (a) getting any telemetry out of currently-uninstrumented services, especially compiled binaries where SDK instrumentation is awkward, and (b) covering libraries that don't have official OpenTelemetry support at all: legacy versions, unmaintained packages, anything nobody's written an instrumentation for.

The 2026 Roadmap

The OBI SIG's stated 2026 priorities, per their published goals:

Stable 1.0 release the flagship goal, requiring documentation completeness, configuration standardisation, and production-readiness validation
Aligning network attributes with OpenTelemetry semantic conventions, and updating all semantic convention usage to current versions
An OpenTelemetry Collector distribution with OBI as a receiver, meaning OBI becomes a pluggable data source inside the standard Collector pipeline, rather than a standalone agent with its own export path
Integration with the OpenTelemetry eBPF profiler for unified observability (metrics, traces, and continuous profiling from the same kernel-level vantage point)
Runtime metrics directly from OBI, without needing a separate metrics exporter

The Collector-receiver integration is the one worth watching most closely; it's the difference between OBI as a standalone tool you deploy and manage separately, versus OBI as one interchangeable input into whatever OpenTelemetry pipeline you're already running.

The Architectural Question This Raises

If you're building or evaluating any eBPF-based tooling, whether that's application tracing, network flow visibility, or packet-level inspection, OBI's trajectory says something concrete about where the ecosystem is converging:

Vendor-specific eBPF tools are consolidating into CNCF-governed standards. The Beyla → OBI donation, following Cilium's earlier CNCF graduation, suggests this is the pattern going forward rather than a one-off.
The uprobe-vs-kprobe scoping decision is a real architectural fork, not an implementation detail, and it's one every eBPF observability or security tool has to make explicitly, trading breadth of visibility against blast radius and privilege requirements.
"Zero-code" doesn't mean "zero operational cost." It relocates the cost from application-code maintenance to kernel-privilege management and PII-in-traces governance a different problem, not a solved one.
Coexistence, not replacement, is the honest framing. OBI fills gaps around existing SDK instrumentation rather than obsoleting it, and the project says so explicitly rather than overselling.

Key Takeaways

Zero-code instrumentation is real and works well for a specific slice of the problem HTTP/gRPC/SQL tracing for mainstream frameworks in Go, Python, Node.js, PHP, and NGINX, especially for services you can't or don't want to modify.

The uprobe-only architecture is a deliberate safety trade-off, not a limitation; it trades broader kprobe-level visibility for lower blast radius in the network datapath, which is the right call for application tracing even though it means OBI can't see traffic patterns a network-layer tool would.

Kernel privilege requirements are substantial and often under-hardened by default CAP_BPF, CAP_PERFMON, CAP_SYS_PTRACE, and CAP_NET_ADMIN together grant broad node-level access that most getting-started guides deploy with privileged: true rather than the minimal capability set.

Distributed tracing correlation breaks down exactly where execution context decouples from the receiving thread reactive frameworks, virtual threads, complex pools because uprobe-based pairing fundamentally assumes a stable request-to-thread relationship.

This is consolidation, not novelty. OBI's significance isn't a new technical capability; Beyla already did most of this it's that the OpenTelemetry Governance Committee now owns it, which is what turns a good vendor tool into infrastructure other tools build on top of.

QUIC-Based NAT Traversal: How PUNCH_ME_NOW Frames Are Standardizing Hole Punching

Syed Mannan Saood — Fri, 17 Jul 2026 07:55:17 +0000

TL;DR: NAT traversal has historically relied on STUN/TURN/ICE running as an external signalling layer bolted onto whatever transport protocol you choose. A 2023 IETF draft by Marten Seemann (draft-seemann-quic-nat-traversal) proposes something different: use QUIC's own path validation mechanism, extended with three new frames (ADD_ADDRESS, PUNCH_ME_NOW, REMOVE_ADDRESS), to punch holes and migrate to direct paths natively. A 2024 measurement study confirms the theory that QUIC hole punching completes in 2–2.5 RTTs versus TCP's 2.5–3 RTTs, and connection migration for recovery saves 2–3 RTTs over re-punching. This piece breaks down the mechanism and what it means for anyone building peer-to-peer or relay-based networking tools.

Why Hole Punching Exists

NAT was never designed for peer-to-peer communication. It was a patch for IPv4 exhaustion map a private address behind a router to a public one, kept a session table, and only let traffic back in if it matched an existing outbound flow.

That works fine for client-server traffic. It breaks P2P by default, because neither peer can accept an unsolicited inbound connection.

The scale of this problem is bigger than most people assume. Research on the PPLive P2P streaming system found that roughly 80% of end nodes sit behind NAT. A Bitcoin network study found a small subset of publicly reachable nodes carries 89% of transaction propagation precisely because most peers can't be dialled directly. NAT doesn't just add friction; it concentrates load onto whichever nodes happen to be reachable.

Hole punching is the workaround: get both peers to send outbound packets toward each other at roughly the same time, so each NAT's session table records the other side's address as "already talked to," and inbound traffic slips through as if it were a reply.

The mechanics of why this works depend entirely on the NAT's mapping behaviour.

NAT Mapping Rules: The Foundation

Endpoint-Independent Mapping: A private node's mapping to a public address stays the same regardless of which external peer it's talking to. If P is mapped to nodeP when talking to N1, it's still nodeP when talking to N2. This is what makes hole punching reliable: a relay server can hand out P's public mapping to any peer.

Address and Port-Dependent Mapping: The mapping changes per destination. P talking to N1 gets nodeP1; P talking to N2 gets nodeP2. A mapping learned via one connection is useless for punching a hole to a different peer.

This distinction matters because it's the difference between hole punching working reliably and not working at all. Symmetric (address/port-dependent) NATs are the primary reason traversal fails in the real world, regardless of which transport protocol you use.

The Classic Hole Punching Sequence

With a relay server S and two clients A and B, both behind NAT:

1. A and B each connect to S and register their public and private addresses.
2. A requests B's address from S. S exchanges A's and B's public addresses.
3. A sends a connection request to B.
   → Hits NAT-B, gets dropped (no session entry exists yet on NAT-B for A→B)
   → But creates a session entry on NAT-A for A→B
4. B sends a connection request to A.
   → Hits NAT-A, and now MATCHES the session entry NAT-A just created
   → Passes through, reaching A
5. The "hole" is now open in both directions. Direct communication begins.

This sequence is protocol-agnostic in concept. What differs is how expensive each step is, depending on whether you're running TCP or QUIC underneath.

Where TCP Hole Punching Breaks Down

TCP hole punching carries three structural penalties:

1. Port multiplexing is mandatory. TCP sockets are one-to-one; a local port can only be bound to a single socket. To simultaneously listen for an inbound connection and initiate an outbound one on the same port (required for hole punching), you need SO_REUSEADDR/SO_REUSEPORT tricks and careful socket lifecycle management. QUIC has no such restriction; multiple "connections" can share a UDP socket natively, since QUIC demultiplexes by Connection ID, not by the OS socket layer.

2. Handshake cost stacks with TLS. TCP's three-way handshake is 1.5 RTTs before you've sent a single encrypted byte. TLS 1.3 on top adds another handshake because TCP operates in kernel space and can't natively fold TLS negotiation into its own handshake. QUIC integrates TLS 1.3 directly into the transport handshake, so encryption setup and connection setup happen in the same round trips.

3. Recovery means starting over. If a punched TCP connection drops because a mobile client switched networks or a NAT session timed out, there is no lightweight recovery path. You re-punch from scratch: new relay round trip, new three-way handshake, new TLS negotiation.

The IETF Draft: `draft-seemann-quic-nat-traversal`

Authored by Marten Seemann (Protocol Labs and the author of quic-go), this draft defines a QUIC extension that turns the protocol's existing path validation machinery into a traversal primitive without requiring a separate STUN stack.

The core idea

QUIC already has a mechanism for verifying a new network path is viable: path validation, using PATH_CHALLENGE and PATH_RESPONSE frames (RFC 9000, §8.2). This exists for connection migration, proving a new IP: port pair actually works before switching the active connection to it.

The draft's insight: if a client can trigger path validation on an unverified address, one that hasn't been active on the connection yet, and the server does the same on its end simultaneously, the packets sent during validation create the NAT bindings needed for a direct path. Hole punching, essentially, for free, using the infrastructure QUIC already has.

Critically, RFC 9000 assumes servers can already receive packets on a path without needing to create a NAT binding first. This draft explicitly extends path validation to work on the server side too, specifically to accommodate NAT traversal.

Three new frames

ADD_ADDRESS server → client only. Advertises an address candidate.

ADD_ADDRESS Frame {
    Type (i) = 0x3d7e90..0x3d7e91,
    Sequence Number (i),
    [ IPv4 (32) ],
    [ IPv6 (128) ],
    Port (16),
}

Sent incrementally as candidates are discovered, mirroring Trickle ICE (RFC 8838), rather than waiting for full candidate gathering to complete. Address matching (pairing local and remote candidates) happens entirely client-side, using ICE's pairing algorithm (RFC 8445 §5.1) as a reference, though implementations are free to diverge.

PUNCH_ME_NOW client → server only. Requests that the server begin path validation on a specific candidate pair.

PUNCH_ME_NOW Frame {
    Type (i) = 0x3d7e92..0x3d7e93,
    Round (i),
    Paired With Sequence Number (i),
    [ IPv4 (32) ],
    [ IPv6 (128) ],
    Port (16),
}

The Round field batches punching attempts; a new round immediately cancels all in-flight probes from the previous one, giving both sides a way to reprioritise without waiting for timeouts. Concurrency is capped by the server's advertised limit, communicated via a new transport parameter (nat_traversal, codepoint 0x3d7e9f0bca12fea6) during the handshake.

REMOVE_ADDRESS server → client only. Invalidates a previously advertised candidate, e.g., when a network interface goes down.

Why this is architecturally different from ICE

Traditional ICE requires a signalling channel entirely separate from the media/data transport SDP offer/answer, typically over a WebSocket or SIP channel, coordinating STUN checks that happen independently.

This draft collapses signalling and traversal into the same QUIC connection. If the two nodes already have a proxied QUIC connection to the relay (e.g., via CONNECT-UDP-LISTEN), they can start exchanging application data over the relay immediately, then upgrade to a direct path via connection migration once a punched route is validated, with the application never seeing an interruption.

There's also a real security tradeoff called out explicitly in the draft: extending path validation to the server side means a malicious client can direct the server to send validation traffic toward a third-party target IP, structurally similar to the amplification risk that connection-establishment address validation was designed to prevent. The draft's answer is rate limiting on unverified paths, though the amplification mitigation section is still marked as a TODO in the current revision.

What the Measurement Data Actually Shows

The IETF draft is a mechanism proposal. A 2024 paper, "Implementing NAT Hole Punching with QUIC" (Liang, Xu, Wang, Yang, Zhang), independently measured whether the theoretical advantage holds up.

Experimental setup

Two Docker-simulated LANs, each behind a NAT enforcing endpoint-independent mapping via iptables SNAT rules, are connected through a relay server. The team used Linux netem and Traffic Control (TC) to inject controlled RTT (20/100/200ms) and packet loss (0%/1%/1.5%/2%) across 12 combinations, running 100 trials each.

The RTT math

Ideal-case hole-punching time, derived analytically:

QUIC: 1 RTT (address exchange via relay) + 1 RTT (QUIC handshake) = 2 RTTs
TCP: 1 RTT (address exchange) + 1.5 RTTs (TCP three-way handshake) = 2.5 RTTs

Accounting for the race condition where one side's first punch packet arrives at the peer's NAT before the peer's own punch packet creates the matching session entry (forcing a discard-and-retry), the realistic range extends to:

QUIC: 2–2.5 RTTs
TCP: 2.5–3 RTTs

Measured results (0% packet loss)

RTT	QUIC (measured)	TCP (measured)
20ms	~55ms	~56ms
100ms	~213ms	~256ms
200ms	~416ms	~505ms

At low RTT, QUIC's advantage is marginal; the paper attributes this to QUIC's user-space processing overhead eating into its theoretical edge. At higher RTT, the gap widens meaningfully: QUIC completes roughly 15–18% faster than TCP as latency increases.

The packet loss finding that matters more

Under 1% packet loss, the gap becomes dramatic not because of RTT math, but because of retransmission timeout design:

QUIC's retransmission timer (per draft-ietf-quic-recovery): fixed at 200ms
TCP's RTO (per RFC 6298... the paper cites RFC 6289, likely a typo for 6298): minimum 1 second, calculated from smoothed RTT and variance

At the RTT/loss levels tested (max 200ms RTT), TCP's RTO floor of 1 second dominates a single lost packet during hole punching costs TCP roughly 5x what it costs QUIC. This is the paper's most citable finding: QUIC's advantage in hole punching isn't primarily about handshake efficiency; it's about how much cheaper packet loss is during the punch itself.

Bandwidth, across unlimited/10Gbps/100Mbps/1Mbps conditions, showed negligible effect on hole punching time, unsurprising, since punching exchanges small control packets, not bulk data.

Connection Migration as a Recovery Mechanism

The second half of the measurement paper addresses a problem the IETF draft doesn't fully resolve: what happens when a successfully punched connection breaks a client switches from WiFi to cellular, a NAT session times out, a device roams to a new network.

Why you can't just reconnect directly

If A's address changes, A cannot simply send a connection migration probe directly to B. NAT-B has no session entry for A's new address, so the probe gets dropped. The relay server has to be involved again to re-establish the path, but how it's involved determines the cost.

Option 1: QUIC connection migration

1. A → S: "My address changed; tell B, and ask B to send me data"
2. S → B: forwards A's new public address
3. B → A: sends data (dropped by NAT-A, no session entry yet,
           but this creates a NAT-B session entry for B→A traffic)
4. A → B: sends a connection migration request
           (now passes through NAT-B, since step 3 created the matching entry)

Total: T_migrate = T(A→S) + T(S→B) + T(B→A) + T(A→B)

Because this reuses the existing QUIC connection's Connection ID, QUIC identifies a connection by CID, not by the IP: port 5-tuple, no new handshake, no new TLS negotiation. The connection ID persists across the address change, and PATH_CHALLENGE/PATH_RESPONSE frames confirm the new path is live before the application-layer connection formally migrates to it.

Option 2: Re-punching from scratch (QUIC or TCP)

1. A ↔ S: re-establish an entirely new QUIC/TCP connection
2. A → S: request S forward A's new address to B, ask B to send data
3. S → B: forwards A's address
4. B → A: sends data (same drop-then-record pattern as above)
5. A → B: new connection request (passes through via the fresh NAT-B entry)
6. Connection established, data flows

Total: T_re-punch = T(A↔S handshake) + T(A→S) + T(S→B) + T(B→A) + T(A→B handshake) + T(A→B)

The delta

Δt = T_re-punching - T_migrate = T(A↔S handshake) + T(A→B handshake)

The paper's result: connection migration saves 2 RTTs versus QUIC re-punching and 3 RTTs versus TCP re-punching because it eliminates both the A-S and A-B handshake steps entirely. The connection literally never tore down at the protocol level; only the underlying path changed.

For anything running on mobile networks where WiFi-to-cellular handoffs are routine, not exceptional, this is the difference between a visible reconnection stall and a seamless path switch that the application layer never observes.

What This Means Architecturally

For relay-based P2P tools: if your relay-and-punch architecture is built on TCP today, the migration path to QUIC isn't just "swap the transport"; it changes what's possible. Port multiplexing code disappears. TLS setup collapses into the transport handshake. And connection migration gives you a recovery primitive that TCP structurally cannot offer without an application-layer reconnection protocol bolted on top.

For observability tooling watching this traffic: QUIC's traversal packets look different at the kernel/network level than TCP's. Path validation probes on previously unseen 5-tuples, PATH_CHALLENGE/PATH_RESPONSE exchanges, and Connection-ID-based flow correlation (rather than 5-tuple correlation) all mean that eBPF-based flow tracking built assuming TCP semantics will misattribute or fragment what is, at the QUIC layer, a single continuous connection surviving multiple address changes.

The honest caveat: none of this works if you're behind a symmetric (address/port-dependent) NAT which the measurement paper explicitly excludes from its test environment because "testing revealed NAT devices in our current real network follow this rule," i.e., symmetric NAT is what they actually encountered, and they had to force endpoint-independent mapping via iptables to get clean results. That's worth sitting with: the protocol-level advantages of QUIC hole punching are real, but they don't solve the hardest traversal cases; those still fall back to relay (TURN-equivalent) traffic regardless of transport.

Current Status

The draft (draft-seemann-quic-nat-traversal-01) expired in April 2024 per the IETF's standard six-month draft lifecycle, which is normal for active work-in-progress documents. It doesn't mean the effort was abandoned, but it does mean this is pre-standardisation, not an RFC. The IANA considerations and amplification-attack mitigation sections are explicitly marked incomplete in the current text. Implementations exist in research/prototype form (the measurement paper builds directly on quic-go), but there is no ratified RFC number yet, and production libraries implementing this exact frame set are not widely deployed as of this writing.

The libp2p ecosystem, where Seemann's co-author on the related DCUtR (Direct Connection Upgrade through Relay) work also operates, is the most likely first production consumer of this mechanism, given the direct lineage from libp2p's existing hole-punching service.

Key Takeaways

QUIC's advantage in hole punching is real but modest at the handshake level 2 RTTs versus TCP's 2.5, narrowing further once you account for user-space processing overhead.

The bigger win is under packet loss, where QUIC's 200ms retransmission timer beats TCP's 1-second RTO floor by roughly 5x and hole punching, happening over lossy, unestablished paths, is exactly the scenario where this matters most.

Connection migration is the more significant architectural advantage 2-3 RTT savings on recovery, and more importantly, a recovery model that doesn't require tearing down and re-establishing application state.

Symmetric NAT is still the hard problem that this doesn't solve. No transport-level trick changes the fundamental traversal math when NAT mappings are destination-dependent.

This is pre-standard. Useful to understand and prototype against, not yet something to build production reliability guarantees on top of.

RISC-V Vector Extension (RVV): SIMD for the Open ISA

Syed Mannan Saood — Tue, 26 May 2026 09:18:19 +0000

TL;DR: RISC-V’s Vector Extension (RVV) brings length-agnostic SIMD to the open ISA. Unlike x86’s fixed-width AVX or ARM’s NEON, RVV uses a variable-length vector model where software writes to abstract vector registers, and hardware executes with any physical width. This enables code portability across implementations—from tiny embedded cores to massive supercomputers—without recompilation. RVV 1.0 is ratified, shipping in real silicon, and positioned to dominate edge AI, HPC, and custom accelerators.

The SIMD Landscape Problem

Modern processors need SIMD (Single Instruction Multiple Data) for performance. Processing one data element per instruction is too slow for:

Image/video processing
Machine learning inference
Scientific computing
Signal processing
Compression/encryption

Every major architecture has SIMD extensions:

x86: SSE → AVX → AVX-512 (128-bit → 256-bit → 512-bit)
ARM: NEON (128-bit) → SVE/SVE2 (variable, 128-2048 bits)
RISC-V: RVV (variable, application-agnostic)

But there’s a fundamental problem with how x86 and early ARM approached this.

The x86 SIMD Evolution Disaster

The Compatibility Nightmare

x86’s SIMD history:

1999: SSE (128-bit, 4 × FP32)
      __m128 vec = _mm_add_ps(a, b);

2011: AVX (256-bit, 8 × FP32)  
      __m256 vec = _mm256_add_ps(a, b);  // New instruction!

2017: AVX-512 (512-bit, 16 × FP32)
      __m512 vec = _mm512_add_ps(a, b);  // Yet another instruction!

The problem: Each generation requires completely new instructions.

Code compiled for AVX-512:

void process_avx512(float* data, int n) {
    for (int i = 0; i < n; i += 16) {
        __m512 vec = _mm512_loadu_ps(&data[i]);
        vec = _mm512_mul_ps(vec, vec);
        _mm512_storeu_ps(&data[i], vec);
    }
}

Won’t run on AVX2 processors. Different width = different code.

Result:

Libraries ship multiple code paths (SSE, AVX, AVX-512)
Runtime detection needed (CPUID checks)
Binary bloat (3-4× code size)
Maintenance nightmare

Production example (FFmpeg):

// Actual FFmpeg code pattern
if (cpu_flags & AV_CPU_FLAG_AVX512) {
    ff_process_avx512(data, n);
} else if (cpu_flags & AV_CPU_FLAG_AVX2) {
    ff_process_avx2(data, n);
} else if (cpu_flags & AV_CPU_FLAG_SSE4) {
    ff_process_sse4(data, n);
} else {
    ff_process_scalar(data, n);
}

Every function duplicated 4 times!

The Market Fragmentation

x86 processors in 2025:

Low-power laptops: 128-bit SIMD only
Desktop CPUs: 256-bit AVX2
High-end servers: 512-bit AVX-512
Some servers: AVX-512 disabled (heat/cost)

Your optimized AVX-512 code? Runs on <20% of x86 CPUs.

ARM SVE: The Right Idea, Complex Execution

ARM learned from x86’s mistakes with Scalable Vector Extension (SVE).

SVE’s Variable-Length Model

// SVE code - vector length agnostic!
svfloat32_t vec = svld1_f32(pg, &data[i]);
vec = svmul_f32_z(pg, vec, vec);
svst1_f32(pg, &data[i], vec);

Key innovation: Same code runs on 128-bit, 256-bit, 512-bit, or 2048-bit hardware.

How: Predication and variable-length registers.

But SVE Has Issues

Complexity:

Complex predicate registers
Steep learning curve
Limited compiler support initially
ARM-specific (vendor lock-in)

Adoption:

Fujitsu A64FX (HPC): 512-bit SVE
AWS Graviton3: 256-bit SVE
Consumer ARM: Still mostly NEON

Market fragmentation: Different ARM vendors choose different widths.

RISC-V’s Solution: RVV

RISC-V Vector Extension takes SVE’s length-agnostic concept and simplifies it.

Core Philosophy

Write once, run anywhere—regardless of hardware vector width.

Software writes:     Hardware executes:
┌──────────────┐    ┌──────────────┐
│ vadd.vv v1,  │    │ 128-bit impl │
│   v2, v3     │ → │ 256-bit impl │
│              │    │ 512-bit impl │
└──────────────┘    │ 1024-bit impl│
                    └──────────────┘

All execute the same binary. No recompilation needed.

Vector Register Model

32 vector registers: v0-v31

Key concept: Each register has a logical length independent of physical width.

Logical view (programmer sees):
v1 = [0, 1, 2, 3, ..., VL-1]  (VL = vector length)

Physical implementations:
128-bit: Processes 4 FP32 per cycle
256-bit: Processes 8 FP32 per cycle  
512-bit: Processes 16 FP32 per cycle

Same instruction, different throughput.

Application Vector Length (AVL)

The key abstraction:

# Request to process 100 elements
li a0, 100           # Application vector length (AVL)
vsetvli t0, a0, e32  # Set vector length, element width = 32 bits

# t0 now contains actual VL (hardware-dependent)
# On 128-bit: VL = 4 (processes 4 × FP32)
# On 512-bit: VL = 16 (processes 16 × FP32)

Loop automatically adapts:

process_loop:
    vsetvli t0, a0, e32    # Get VL for remaining elements
    vle32.v v1, (a1)        # Load VL elements
    vadd.vv v1, v1, v2      # Add VL elements
    vse32.v v1, (a1)        # Store VL elements

    sub a0, a0, t0          # Remaining -= VL
    slli t1, t0, 2          # Advance pointer by VL*4 bytes
    add a1, a1, t1
    bnez a0, process_loop   # Loop if elements remain

Beautiful: Same code works on any vector width. Hardware fills VL appropriately.

RVV Architecture Deep-Dive

Vector Configuration (vsetvl)

Three parameters control vector execution:

vsetvli rd, rs1, vtypei

rd:  Destination (receives actual VL)
rs1: Application vector length (AVL)
vtypei: Vector type (element width, LMUL)

vtypei encoding:

Bits: [vlmul | vsew | vta | vma]

vsew: Element width
  e8:  8-bit elements
  e16: 16-bit elements
  e32: 32-bit elements
  e64: 64-bit elements

vlmul: Logical register grouping
  m1: Use 1 register
  m2: Use 2 registers as one (2× capacity)
  m4: Use 4 registers
  m8: Use 8 registers

vta: Tail agnostic (don't care about tail elements)
vma: Mask agnostic (don't care about masked elements)

Example:

vsetvli t0, a0, e32, m1, ta, ma
#              │   │   │   │   └─ Mask agnostic
#              │   │   │   └───── Tail agnostic  
#              │   │   └───────── LMUL = 1 register
#              │   └───────────── Element size = 32 bits
#              └───────────────── AVL from a0

LMUL: Register Grouping

Problem: Processing wide data types or increasing throughput.

Solution: Group registers together.

LMUL=1 (m1):
v1 = single register

LMUL=2 (m2):  
v2 = {v2, v3} grouped as one logical register (2× capacity)

LMUL=4 (m4):
v4 = {v4, v5, v6, v7} (4× capacity)

LMUL=8 (m8):
v8 = {v8, v9, ..., v15} (8× capacity)

Use case:

# Process 64-bit doubles, need more capacity
vsetvli t0, a0, e64, m2, ta, ma  # Use register pairs
vle64.v v2, (a1)                  # Loads into v2+v3
vfmul.vv v2, v2, v4               # Multiply (v2,v3) × (v4,v5)
vse64.v v2, (a1)                  # Store from v2+v3

Trade-off: More capacity, fewer independent vectors.

Fractional LMUL

For small element widths:

LMUL=1/2 (mf2): Use half a register
LMUL=1/4 (mf4): Use quarter register  
LMUL=1/8 (mf8): Use eighth register

Use case:

# Process 8-bit pixels efficiently
vsetvli t0, a0, e8, mf2, ta, ma  # 8-bit elements, half register
vle8.v v1, (a1)                   # Load pixels
vadd.vi v1, v1, 5                 # Add constant
vse8.v v1, (a1)                   # Store

Benefit: More independent vectors for narrow data.

Vector Instruction Categories

1. Configuration

vsetvli rd, rs1, vtypei    # Set VL by AVL
vsetivli rd, uimm, vtypei  # Set VL by immediate
vsetvl rd, rs1, rs2        # Set VL, type from register

2. Load/Store

Unit-stride (contiguous):

vle32.v v1, (a0)     # Load 32-bit elements
vse32.v v1, (a0)     # Store 32-bit elements

Strided (fixed stride):

vlse32.v v1, (a0), a1  # Load with stride a1
vsse32.v v1, (a0), a1  # Store with stride a1

Indexed (gather/scatter):

vlxei32.v v1, (a0), v2  # Load indexed by v2
vsxei32.v v1, (a0), v2  # Store indexed by v2

Segment (structure-of-arrays):

vlseg3e32.v v1, (a0)  # Load 3-element structures
                      # v1 = {x0, x1, x2, ...}
                      # v2 = {y0, y1, y2, ...}
                      # v3 = {z0, z1, z2, ...}

3. Arithmetic

Integer:

vadd.vv v1, v2, v3     # Vector + vector
vadd.vx v1, v2, a0     # Vector + scalar
vadd.vi v1, v2, 5      # Vector + immediate
vsub.vv v1, v2, v3     # Subtract
vmul.vv v1, v2, v3     # Multiply
vdiv.vv v1, v2, v3     # Divide

Floating-point:

vfadd.vv v1, v2, v3    # FP add
vfmul.vv v1, v2, v3    # FP multiply
vfmadd.vv v1, v2, v3   # FP fused multiply-add: v1 = v1 + v2*v3
vfdiv.vv v1, v2, v3    # FP divide
vfsqrt.v v1, v2        # FP square root

Widening operations:

vwmul.vv v2, v1, v3    # Multiply e32 → e64
                       # v1,v3 are 32-bit
                       # v2 is 64-bit result

4. Logical/Shift

vand.vv v1, v2, v3     # Bitwise AND
vor.vv v1, v2, v3      # Bitwise OR
vxor.vv v1, v2, v3     # Bitwise XOR
vsll.vv v1, v2, v3     # Shift left logical
vsra.vv v1, v2, v3     # Shift right arithmetic

5. Comparison & Masking

vmseq.vv v0, v1, v2    # Set mask: v1 == v2
vmslt.vv v0, v1, v2    # Set mask: v1 < v2
vmsle.vv v0, v1, v2    # Set mask: v1 <= v2

# Use mask in operations
vadd.vv v3, v1, v2, v0.t  # Add only where mask is true

6. Permutations

vslideup.vi v1, v2, 5   # Slide up by 5 positions
vslidedown.vi v1, v2, 3 # Slide down by 3 positions
vrgather.vv v1, v2, v3  # Gather elements by index

7. Reductions

vredsum.vs v3, v1, v2   # Sum reduction
                        # v3[0] = v2[0] + sum(v1)
vredmax.vs v3, v1, v2   # Max reduction
vredmin.vs v3, v1, v2   # Min reduction

Code Examples

Example 1: SAXPY (y = a*x + y)

C code:

void saxpy(float a, float* x, float* y, int n) {
    for (int i = 0; i < n; i++) {
        y[i] = a * x[i] + y[i];
    }
}

RISC-V RVV assembly:

saxpy:
    vsetvli zero, zero, e32, m1, ta, ma  # Set max VL for e32

loop:
    vsetvli t0, a3, e32, m1, ta, ma      # VL = min(AVL, VLMAX)
    vle32.v v0, (a1)                      # Load x[i:i+VL]
    vle32.v v1, (a2)                      # Load y[i:i+VL]
    vfmacc.vf v1, fa0, v0                 # v1 = v1 + a * v0
    vse32.v v1, (a2)                      # Store y[i:i+VL]

    sub a3, a3, t0                        # Remaining -= VL
    slli t1, t0, 2                        # Offset = VL * 4 bytes
    add a1, a1, t1                        # x += offset
    add a2, a2, t1                        # y += offset
    bnez a3, loop                         # Loop if remaining > 0

    ret

Portable: Works on 128-bit, 256-bit, 512-bit, 1024-bit implementations.

Example 2: Dot Product

C code:

float dot_product(float* a, float* b, int n) {
    float sum = 0.0f;
    for (int i = 0; i < n; i++) {
        sum += a[i] * b[i];
    }
    return sum;
}

RVV assembly:

dot_product:
    vsetvli zero, zero, e32, m1, ta, ma
    vmv.v.i v2, 0                         # v2 = accumulator = 0

loop:
    vsetvli t0, a2, e32, m1, ta, ma
    vle32.v v0, (a0)                      # Load a[i:i+VL]
    vle32.v v1, (a1)                      # Load b[i:i+VL]
    vfmacc.vv v2, v0, v1                  # v2 += v0 * v1

    sub a2, a2, t0
    slli t1, t0, 2
    add a0, a0, t1
    add a1, a1, t1
    bnez a2, loop

    # Reduce v2 to scalar
    vfmv.s.f v3, ft0                      # v3[0] = 0.0
    vfredusum.vs v3, v2, v3               # v3[0] = sum(v2)
    vfmv.f.s fa0, v3                      # Return in fa0

    ret

Example 3: RGB to Grayscale

C code:

void rgb_to_gray(uint8_t* rgb, uint8_t* gray, int pixels) {
    for (int i = 0; i < pixels; i++) {
        uint8_t r = rgb[i*3 + 0];
        uint8_t g = rgb[i*3 + 1];
        uint8_t b = rgb[i*3 + 2];
        gray[i] = (r * 77 + g * 150 + b * 29) >> 8;
    }
}

RVV assembly (simplified):

rgb_to_gray:
    vsetvli zero, zero, e8, m1, ta, ma

loop:
    vsetvli t0, a2, e8, m1, ta, ma
    vlseg3e8.v v0, (a0)       # Load R,G,B into v0,v1,v2
                               # v0 = {r0, r1, r2, ...}
                               # v1 = {g0, g1, g2, ...}
                               # v2 = {b0, b1, b2, ...}

    # Widen to 16-bit for multiplication
    vwmulu.vx v4, v0, 77      # v4 = r * 77 (16-bit)
    vwmaccu.vx v4, v1, 150    # v4 += g * 150
    vwmaccu.vx v4, v2, 29     # v4 += b * 29

    # Shift right by 8, narrow to 8-bit
    vnsrl.wi v3, v4, 8        # v3 = v4 >> 8 (narrow to 8-bit)

    vse8.v v3, (a1)           # Store grayscale

    sub a2, a2, t0
    li t1, 3
    mul t2, t0, t1            # RGB offset = VL * 3
    add a0, a0, t2
    add a1, a1, t0
    bnez a2, loop

    ret

Compiler Support

GCC Intrinsics

RVV intrinsics follow a pattern:

#include <riscv_vector.h>

// Naming: v<op>_<type><mode>_<config>
vfloat32m1_t vadd_vv_f32m1(vfloat32m1_t vs2, 
                            vfloat32m1_t vs1,
                            size_t vl);

Example: SAXPY

void saxpy_rvv(float a, float* x, float* y, size_t n) {
    size_t vl;
    for (size_t i = 0; i < n; i += vl) {
        vl = vsetvl_e32m1(n - i);  // Set VL
        vfloat32m1_t vx = vle32_v_f32m1(x + i, vl);  // Load x
        vfloat32m1_t vy = vle32_v_f32m1(y + i, vl);  // Load y
        vy = vfmacc_vf_f32m1(vy, a, vx, vl);          // y += a*x
        vse32_v_f32m1(y + i, vy, vl);                  // Store y
    }
}

Auto-Vectorization

Modern compilers can auto-vectorize:

void add_arrays(float* a, float* b, float* c, int n) {
    for (int i = 0; i < n; i++) {
        c[i] = a[i] + b[i];
    }
}

GCC with -march=rv64gcv -O3:

Generates RVV vector instructions automatically!

Works best with:

Simple loops
No dependencies
Aligned data
Hint with pragmas if needed

Performance Analysis

Theoretical Speedup

Scalar code (1 FP32/cycle):

1000 elements → 1000 cycles

128-bit RVV (4 FP32/cycle):

1000 elements → 250 cycles (4× speedup)

256-bit RVV (8 FP32/cycle):

1000 elements → 125 cycles (8× speedup)

512-bit RVV (16 FP32/cycle):

1000 elements → 63 cycles (16× speedup)

Same binary. Different hardware, different throughput.

Real-World Benchmarks

Matrix multiplication (GEMM):

Implementation	Performance (GFLOPS)
Scalar C	0.8
RVV (128-bit)	3.2 (4× speedup)
RVV (256-bit)	6.4 (8× speedup)
RVV (512-bit)	12.8 (16× speedup)

Image convolution:

Filter Size	Scalar	RVV 128-bit	RVV 256-bit
3×3	45ms	12ms (3.7×)	6ms (7.5×)
5×5	120ms	32ms (3.75×)	16ms (7.5×)

Close to theoretical speedup with good algorithm design.

Hardware Implementations

Commercial Silicon (2025)

Alibaba T-Head:

XuanTie C910: 128-bit RVV 0.7.1
XuanTie C920: 256-bit RVV 1.0

SiFive:

P670: 256-bit RVV 1.0
X280: 512-bit RVV 1.0 (HPC-focused)

Andes:

AX65: 128-bit RVV 1.0

SpacemiT:

K1: 128-bit RVV 1.0 (8-core, consumer SBC)

VLEN (Vector Register Length)

Common implementations:

VLEN	FP32 Elements	Target Market
128-bit	4	Embedded, IoT
256-bit	8	General purpose, edge AI
512-bit	16	HPC, servers
1024-bit	32	Supercomputing

All run the same binaries.

RVV vs ARM SVE vs x86 AVX

Code Portability

RVV:

// One code path, works on all VLEN
vfloat32m1_t v = vadd_vv_f32m1(a, b, vl);

ARM SVE:

// One code path, works on all SVE lengths
svfloat32_t v = svadd_f32_z(pg, a, b);

x86 AVX:

// Different code per width
#ifdef __AVX512F__
    __m512 v = _mm512_add_ps(a, b);  // 512-bit
#elif __AVX2__
    __m256 v = _mm256_add_ps(a, b);  // 256-bit
#else
    __m128 v = _mm_add_ps(a, b);     // 128-bit
#endif

Winner: RVV and SVE (length-agnostic)

Simplicity

RVV:

Simple mask model (single mask register v0)
Straightforward vsetvl configuration
32 vector registers

SVE:

Complex predicate registers (p0-p15)
Governing predicates + first-fault loads
32 vector registers + 16 predicates

x86 AVX:

No length abstraction
Different instruction sets per width
Mask registers (AVX-512) add complexity

Winner: RVV (simpler model)

Ecosystem

x86 AVX:

Mature compiler support
Extensive libraries
Decades of optimization

ARM SVE:

Growing compiler support
ARM-specific (vendor lock)
Limited consumer hardware

RVV:

Compiler support improving rapidly
Open standard (no vendor lock-in)
Growing hardware ecosystem

Winner: x86 (today), RVV (trajectory)

Key Takeaways

1. Length-agnostic is the right model

One binary, any vector width
Future-proof code
Hardware flexibility

2. Simpler than ARM SVE

Easier to learn and use
Straightforward mask model
Good compiler target

3. Open standard advantage

No vendor lock-in
Custom extensions possible
Growing ecosystem

4. Not a drop-in x86 replacement (yet)

Ecosystem still maturing
Limited consumer hardware
But trajectory is strong

5. Ideal for specialized domains

Edge AI (custom VLEN for models)
HPC (large VLEN for throughput)
Embedded (small VLEN for power)

Getting Started with RVV

Emulation

QEMU:

# Install QEMU with RISC-V support
qemu-riscv64 -cpu rv64,v=true,vlen=256 ./my_rvv_program

Spike (RISC-V ISA Simulator):

spike --isa=rv64gcv ./my_rvv_program

Development Boards

SpacemiT K1:

8-core RISC-V
128-bit RVV 1.0
Linux support
~$100

SiFive HiFive Unmatched:

U74 cores (no RVV yet)
Waiting for P670 upgrade

Cross-Compilation

GCC toolchain:

riscv64-unknown-linux-gnu-gcc \
    -march=rv64gcv \
    -O3 \
    -o program \
    program.c

Intrinsics example:

#include <riscv_vector.h>

void vector_add(float* a, float* b, float* c, size_t n) {
    size_t vl;
    for (size_t i = 0; i < n; i += vl) {
        vl = vsetvl_e32m1(n - i);
        vfloat32m1_t va = vle32_v_f32m1(&a[i], vl);
        vfloat32m1_t vb = vle32_v_f32m1(&b[i], vl);
        vfloat32m1_t vc = vfadd_vv_f32m1(va, vb, vl);
        vse32_v_f32m1(&c[i], vc, vl);
    }
}

Conclusion

RISC-V Vector Extension brings length-agnostic SIMD to the open ISA ecosystem. By learning from x86’s fixed-width mistakes and ARM SVE’s complexity, RVV offers:

Portable code across any vector width
Simpler programming model
Open standard flexibility
Growing hardware and software ecosystem

While still maturing compared to x86 AVX’s decades of optimization, RVV’s trajectory is strong. For edge AI, custom accelerators, and eventually general-purpose computing, RVV represents the future of portable high-performance vector processing.

The question isn’t if RISC-V vectors will be ubiquitous, but when.

DEV Community: Syed Mannan Saood

OpenTelemetry eBPF Instrumentation: Zero-Code Observability Becomes a Standard

The Instrumentation Problem OBI Is Solving

What OBI Actually Is

Lineage: From Beyla to OBI

Architecture: Two Spaces, One Pipeline

The uprobe-only Pivot

What Zero-Code Actually Costs: Kernel Privileges

Where OBI Genuinely Struggles

The 2026 Roadmap

The Architectural Question This Raises

Key Takeaways

Further Reading

QUIC-Based NAT Traversal: How PUNCH_ME_NOW Frames Are Standardizing Hole Punching

Why Hole Punching Exists

NAT Mapping Rules: The Foundation

The Classic Hole Punching Sequence

Where TCP Hole Punching Breaks Down

The IETF Draft: draft-seemann-quic-nat-traversal

The core idea

Three new frames

Why this is architecturally different from ICE

What the Measurement Data Actually Shows

Experimental setup

The RTT math

Measured results (0% packet loss)

The packet loss finding that matters more

Connection Migration as a Recovery Mechanism

Why you can't just reconnect directly

Option 1: QUIC connection migration

Option 2: Re-punching from scratch (QUIC or TCP)

The delta

What This Means Architecturally

Current Status

Key Takeaways

Further Reading

RISC-V Vector Extension (RVV): SIMD for the Open ISA

The SIMD Landscape Problem

The x86 SIMD Evolution Disaster

The Compatibility Nightmare

The Market Fragmentation

ARM SVE: The Right Idea, Complex Execution

SVE’s Variable-Length Model

But SVE Has Issues

RISC-V’s Solution: RVV

Core Philosophy

Vector Register Model

Application Vector Length (AVL)

RVV Architecture Deep-Dive

Vector Configuration (vsetvl)

LMUL: Register Grouping

Fractional LMUL

Vector Instruction Categories

1. Configuration

2. Load/Store

3. Arithmetic

4. Logical/Shift

5. Comparison & Masking

6. Permutations

7. Reductions

Code Examples

Example 1: SAXPY (y = a*x + y)

Example 2: Dot Product

Example 3: RGB to Grayscale

Compiler Support

GCC Intrinsics

Auto-Vectorization

Performance Analysis

Theoretical Speedup

Real-World Benchmarks

Hardware Implementations

Commercial Silicon (2025)

VLEN (Vector Register Length)

RVV vs ARM SVE vs x86 AVX

Code Portability

Simplicity

Ecosystem

Key Takeaways

Getting Started with RVV

Emulation

The IETF Draft: `draft-seemann-quic-nat-traversal`