DEV Community: Blake Pelton

Gigaflow: Pipeline-Aware Sub-Traversal Caching for Modern SmartNICs

Blake Pelton — Wed, 15 Oct 2025 12:01:25 +0000

This was originally posted on Dangling Pointers. My goal is to help busy people stay current with recent academic developments. Head there to subscribe for regular summaries of computer science research.

Gigaflow: Pipeline-Aware Sub-Traversal Caching for Modern SmartNICs Annus Zulfiqar, Ali Imran, Venkat Kunaparaju, Ben Pfaff, Gianni Antichi, and Muhammad Shahbaz ASPLOS'25

Virtual Switch

A virtual switch (vSwitch) routes network traffic to and from virtual machines. Section 2.1 of the paper describes the historical development of vSwitch technology, ending with a pipeline of match-action tables (MATs). A match-action table is a data-driven way to configuration a vSwitch, comprising matching rules, and associated actions to take when a matching packet is encountered. When a packet arrives at the vSwitch, it traverses the full pipeline of match-action tables. At each pipeline stage, header fields from the packet are used to perform a lookup into a match action table. If a match is found, then the packet is modified according to the actions found in the table.

Megaflow

Megaflow is prior work which memoizes the full pipeline of MATs. The memoization data structure is treated like a cache. When a packet arrives, a cache lookup occurs. On a miss, the regular vSwitch implementation is called to transform the packet. Subsequent packets which hit in the cache avoid executing the vSwitch code entirely. Megaflow supports keys with wildcards, to allow one cache entry to serve multiple flows.

A problem with Megaflow is that even with wildcards, a large cache is needed to achieve a high hit rate. For throughput reasons, one may wish to place a Megaflow cache in on-chip memory on a SmartNIC. However, if the SmartNIC does not have enough on-chip memory to achieve a high hit rate, then throughput suffers. See this post, and this post for a description of SmartNIC architectures and their on-chip memories.

Gigaflow

This paper introduces a different memoization scheme (Gigaflow) to make better use of SmartNIC memory. Rather than memoizing the entire vSwitch pipeline for a packet, Gigaflow divides the vSwitch pipeline into multiple smaller pipelines, and memoizes each one separately. Fig. 1 illustrates this:

Source: https://dl.acm.org/doi/10.1145/3676641.3716000

Gigaflow takes advantage of SmartNICs’ ability to perform many table lookups per packet while maintaining high throughput. The total working set in a typical workload is reduced, because many flows can share some table entries (rather than all-or-nothing sharing).

Another way to think about this is that Megaflow combines all of the MATs in a vSwitch pipeline into one very large table, whereas Gigaflow partitions the MAT vSwitch pipeline into a handful of sub-pipelines, and combines each sub-pipeline into a medium-sized table.

Sections 4.1.1 and 4.2.2 of the paper have the nitty-gritty details of how Gigaflow decides how to correctly assign subsets of the vSwitch MAT pipeline into a set of tables.

Results

Fig. 9 shows cache misses for Megaflow and Gigaflow for a variety of benchmarks:

Source: https://dl.acm.org/doi/10.1145/3676641.3716000

Dangling Pointers

Memoization is useful in settings outside of networking. It would be interesting to see if the idea of separable memoization could be applied to other applications.

Like I mentioned here, hardware support for memoization in general purpose CPUs seems compelling.

Optimizing Datalog for the GPU

Blake Pelton — Mon, 13 Oct 2025 12:02:25 +0000

Optimizing Datalog for the GPU Yihao Sun, Ahmedur Rahman Shovon, Thomas Gilray, Sidharth Kumar, and Kristopher Micinski ASPLOS'25

Datalog Primer

Datalog source code comprises a set of relations, and a set of rules.

A relation can be explicitly defined with a set of tuples. A running example in the paper is to define a graph with a relation named edge:

Edge(0, 1)
Edge(1, 3)
Edge(0, 2)

A relation can also be implicitly defined with a set of rules. The paper uses the Same Generation (SG) relation as an example:

1: SG(x, y) <- Edge(p, x), Edge(p, y), x != y
2: SG(x, y) <- Edge(a, x), SG(a, b), Edge(b, y), x != y

Rule 1 states that two vertices (x and y) are part of the same generation if they both share a common ancestor (p), and they are not actually the same vertex (x != y).

Rule 2 states that two vertices (x and y) are part of the same generation if they have ancestors (a and b) from the same generation.

“Running a Datalog program” entails evaluating all rules until a fixed point is reached (no more tuples are added).

Semi-naïve Evaluation

One key idea to internalize is that evaluating a Datalog rule is equivalent to performing a SQL join. For example, rule 1 is equivalent to joining the Edge relation with itself, using p as the join key, and (x != y) as a filter.

Semi-naïve Evaluation is an algorithm for performing these joins until convergence, while not wasting too much effort on redundant work. The tuples in a relation are put into three buckets:

new: holds tuples that were discovered on the current iteration
delta: holds tuples which were added in the previous iteration
full: holds all tuples that have been found in any iteration

For a join involving two relations (A and B), new is computed as the union of the result of 3 joins:

delta(A) joined with full(B)
full(A) joined with delta(B)
delta(A) joined with delta(B)

The key fact for performance is that full(A) is never joined with full(B).

More details on Semi-naïve Evaluation can be found in these notes.

`Hash-Indexed Sorted Array`

This paper introduces the hash-indexed sorted array for storing relations while executing Semi-naïve Evaluation on a GPU. It seems to me like this data structure would work well on other chips too. Fig. 2 illustrates the data structure (join keys are in red):

Source: https://dl.acm.org/doi/10.1145/3669940.3707274

The data array holds the actual tuple data. It is densely packed in row-major order.

The sorted index array holds pointers into the data array (one pointer per tuple). These pointers are lexicographically sorted (join keys take higher priority in the sort).

The hash table is an open-addressed hash table which maps a hash of the join keys to the first element in the sorted index array that contains those join keys.

A join of relations A and B, can be implemented with the following pseudo-code:

for each tuple 'a' in the sorted index array of A:

  lookup (hash table) the first tuple in B which has matching join keys to 'a'

  iterate over all tuples in the sorted index array of B with matching keys

Memory accesses when probing through the sorted index array are coherent. Memory accesses when accessing the data array are coherent up to the number of elements in a tuple.

Results

Table 3 compares the results from this paper (GPULog) against a state-of-the-art CPU implementation (Soufflé). HIP represents GPULog ported to AMD’s HIP runtime and then run on the same Nvidia GPU.

Source: https://dl.acm.org/doi/10.1145/3669940.3707274

Dangling Pointers

The data structure and algorithms described by this paper seem generic, it would be interesting to see them run on other chips (FPGA, DPU, CPU, HPC cluster).

I would guess most of GPULog is bound by memory bandwidth, not compute. I wonder if there are Datalog-specific algorithms to reduce the bandwidth/compute ratio.

No Cap, This Memory Slaps: Breaking Through the Memory Wall of Transactional Database Systems with Processing-in-Memory

Blake Pelton — Wed, 08 Oct 2025 12:00:59 +0000

No Cap, This Memory Slaps: Breaking Through the Memory Wall of Transactional Database Systems with Processing-in-Memory Hyoungjoo Kim, Yiwei Zhao, Andrew Pavlo, Phillip B. Gibbons VLDB'25

This paper describes how processing-in-memory (PIM) hardware can be used to improve OLTP performance. Here is a prior paper summary from me on a similar topic, but that one is focused on OLAP rather than OLTP.

UPMEM

UPMEM is specific PIM product (also used in the prior paper) on this blog. A UPMEM DIMM is like a DRAM DIMM, but each DRAM bank is extended with a simple processor which can run user code. That processor has access to a small local memory and the DRAM associated with the bank. This paper calls each processor a PIM Module. There is no direct communication between PIM modules.

Fig. 2 illustrates the system architecture used by this paper. A traditional CPU is connected to a set of boring old DRAM DIMMs and is also connected to a set of UPMEM DIMMs.

https://vldb.org/pvldb/volumes/18/paper/No%20Cap%2C%20This%20Memory%20Slaps%3A%20Breaking%20Through%20the%20Memory%20Wall%20of%20Transactional%20Database%20Systems%20with%20Processing-in-Memory

Four Challenges

The paper identifies the following difficulties associated with using UPMEM to accelerate an OLTP workload:

PIM modules can only access their local memory
PIM modules do not have typical niceties associated with x64 CPUs (high clock frequency, caches, SIMD)
There is a non-trivial cost for the CPU to send data to UPMEM DIMMs (similar to the CPU writing data to regular DRAM)
OLTP workloads have tight latency constraints

Near Memory Affinity

The authors arrived at a solution that both provides a good speedup and doesn’t require boiling the ocean. The database code and architecture remain largely unchanged. Much of the data remains in standard DRAM DIMMs, and the database operates on it as it always has.

In section 3.2 the authors identify a handful of data structures and operations with near-memory affinity which are offloaded. These data structures are stored in UPMEM DIMMs, and the algorithms which access them are offloaded to the PIM modules.

The key feature that these algorithms have in common is pointer chasing. The sweet spots the authors identify involve a small number of parameters sent from the CPU to a PIM module, then the PIM module performing multiple roundtrips to its local DRAM bank, followed by the CPU reading back a small amount of response data. The roundtrips to PIM-local DRAM have lower latency than accesses from a traditional CPU core.

Hash-Partitioned Index

One data structure which involves a lot of pointer chasing is B+ tree traversal. Thus, the system described in this paper moves B+ tree indexes into UPMEM DIMMs and uses PIM modules to search for values in an index. Note that the actual tuples that hold row data stay in plain-old DRAM.

The tricky part is handling range queries while distributing an index across many banks. The solution described in this paper is to partition the set of keys into 2^R partitions (the lower R bits of a key define the index the partition which holds that key). Each partition is thus responsible for a contiguous array of keys. For a range query, the lower R bits of the lower and upper bounds of the range can be used to determine which partitions must be searched. Each PIM module is responsible for multiple partitions, and a hash function is used to convert a partition index into a PIM module index.

MVCC Chain Traversal

MVCC is a concurrency control method which requires the database to keep around old versions of a given row (to allow older in-flight queries to access them). The set of versions associated with a row are typically stored in a linked list (yet another pointer traversal). Again, the actual tuple contents are stored in regular DRAM, but the list links are stored in UPMEM DIMMs, with the PIM modules traversing the links. Section 4.3 has more information about how old versions are eventually reclaimed with garbage collection.

Results

Fig. 7 has the headline results. MosaicDB is the baseline, OLTPim is the work described by this paper. It is interesting that OLTPim only beats MosaicDB on TPC-C for read-only workloads.

_Source: https://vldb.org/pvldb/volumes/18/paper/No%20Cap%2C%20This%20Memory%20Slaps%3A%20Breaking%20Through%20the%20Memory%20Wall%20of%20Transactional%20Database%20Systems%20with%20Processing-in-Memory

Dangling Pointers

Processing-in-memory can help with memory bandwidth and memory latency. It seems like this work is primarily focused on memory latency. I suppose this indicates that OLTP workloads are fundamentally latency-bound, because there is not enough potential concurrency between transactions to hide that latency. Is there no way to structure a database such that OLTP workloads are not bound by memory latency?

It would be interesting to see if these tricks could work in a distributed system, where the PIM modules are replaced by separate nodes in the system.

Parendi: Thousand-Way Parallel RTL Simulation

Blake Pelton — Mon, 06 Oct 2025 11:02:09 +0000

Parendi: Thousand-Way Parallel RTL Simulation Mahyar Emami, Thomas Bourgeat, and James R. Larus ASPLOS'25

This paper describes an RTL simulator running on (one or more) Graphcore IPUs. One nice side benefit of this paper is the quantitative comparisons of IPU synchronization performance vs traditional CPUs.

Here is another paper summary which describes some challenges with RTL simulation.

Graphcore IPU

The Graphcore IPU used in this paper is a chip with 1472 cores, operating with a MIMD architecture. A 1U server can contain 4 IPUs. It is interesting to see a chip that was designed for DNN workloads adapted to the domain of RTL simulation.

Partitioning and Scheduling

Similar to other papers on RTL simulation, a fundamental step of the Parendi simulator is partitioning the circuit to be simulated. Parendi partitions the circuit into fibers. A fiber comprises a single (word-wide) register, and all of the combinational logic which feeds it. Note that some combinational logic may be present in multiple fibers. Fig. 3 contains an example, node a3 is present in multiple fibers. As far as I can tell, Parendi does not try to deduplicate this work (extra computation to save synchronization).

Source: https://dl.acm.org/doi/10.1145/3676641.3716010

The driving factor in the design of this fiber-specific partitioning system is scalability. Each register has storage to hold the value of the register at the beginning and end of the current clock cycle (i.e., the current and next values).

I think of the logic to simulate a single clock cycle with the following pseudo-code (f.root is the register rooted at fiber f):

parallel for each fiber : f
  f.root.next = evaluate f

barrier

parallel for each fiber : f
    f.root.current = f.root.next

barrier

Scalability comes from the fact that there are only two barriers per simulated clock cycle. This is an instance of the bulk synchronous parallel (BSP) model.

Partitioning

In many cases, there are more fibers than CPU/IPU cores. Parendi addresses this by distributing the simulation across chips and scheduling multiple fibers to run on the same core.

If the simulation is distributed across multiple chips, then a min-cut algorithm is used to partition the fibers across chips while minimizing communication.

The Parendi compiler statically groups multiple fibers together into a single process. A core simulates all fibers within a process. The merging process primarily seeks to minimize inter-core communication. First, a special case merging algorithm merges multiple fibers which reference the same large array. This is to avoid communicating the contents of such an array across cores. I imagine this is primarily for simulation of on-chip memories. Secondly, a general-purpose merging algorithm merges fibers which each have low compute cost, and high data sharing with each other.

Results

Fig. 7 compares Parendi vs Verilator simulation. x64_ix3 is a 2-socket server with 28 Intel cores per socket. x64_ae4 is a 2-socket server with 64 AMD cores per socket:

Source: https://dl.acm.org/doi/10.1145/3676641.3716010

Section 6.4 claims a roughly 2x improvement in cost per simulation using cloud pricing.

Dangling Pointers

As far as I can tell, this system doesn’t have optimizations for the case where some or all of a fiber’s inputs do not change between clock cycles. It seems tricky to optimize for this case while maintaining a static assignment of fibers to cores.

Fig. 4 has a fascinating comparison of synchronization costs between an IPU and a traditional x64 CPU. This microbenchmark loads up the system with simple fibers (roughly 6 instructions per fiber). Note that the curves represent different fiber counts (e.g., the red dotted line represents 7 fibers on the IPU graph, vs 736 fibers on the x64 graph). The paper claims that a barrier between 56 x64 threads implemented with atomic memory accesses consumes thousands of cycles, whereas the IPU has dedicated hardware barrier support.

Source: https://dl.acm.org/doi/10.1145/3676641.3716010

This seems to be one of many examples of how generic multi-core CPUs do not perform well with fine-grained multi-threading. We’ve seen it with pipeline parallelism, and now with the BSP model. Interestingly, both cases seem to work better with specialized multi-core chips (pipeline parallelism works with CPU-based SmartNICs, BSP works with IPUs). I’m not convinced this is a fundamental hardware problem that cannot be addressed with better software.

Skia: Exposing Shadow Branches

Blake Pelton — Fri, 03 Oct 2025 12:02:09 +0000

Skia: Exposing Shadow Branches Chrysanthos Pepi, Bhargav Reddy Godala, Krishnam Tibrewala, Gino A. Chacon, Paul V. Gratz, Daniel A. Jiménez, Gilles A. Pokam, and David I. August ASPLOS'25

This paper starts with your yearly reminder of the high cost of the Turing Tax:

Recent works demonstrate that the front-end is a considerable source of performance loss [16], with upwards of 53% of performance [23] bounded by the front-end.

Fetch Directed Instruction Prefetching

Everyone knows that the front-end runs ahead of the back-end of a processor. If you want to think of it in AI terms, imagine a model that is told about the current value of and recent history of the program counter, and asked to predict future values of the program counter. The accuracy of these predictions determines how utilized the processor pipeline is.

What I did not know is that in a modern processor, the front-end itself is divided into two decoupled components, one of which runs ahead of the other. Fig. 4 illustrates this Fetch Direction Instruction Processing (FDIP) microarchitecture:

Source: https://dl.acm.org/doi/10.1145/3676641.3716273

The Instruction Address Generator (IAG) runs the furthest ahead and uses tables (e.g., the Branch Target Buffer (BTB)) in the Branch Prediction Unit (BPU) to predict the sequence of basic blocks which will be executed. Information about each predicted basic block is stored in the Fetch Target Queue (FTQ).

The Instruction Fetch Unit (IFU) uses the control flow predictions from the FTQ to actually read instructions from the instruction cache. Some mispredictions can be detected after an instruction has been read and decoded. These result in an early re-steer (i.e., informing the IAG about the misprediction immediately after decode).

When a basic block is placed into the FTQ, the associated instructions are prefetched into the IFU (to reduce the impact of instruction cache misses).

Shadow Branches

This paper introduces the term “shadow branch”. A shadow branch is a (static) branch instruction which is currently stored in the instruction cache but is not present in any BPU tables.

The top of fig. 5 illustrates a head shadow branch. A branch instruction caused execution to jump to byte 24 and execute the non-shaded instructions. This causes an entire cache line to be pulled into the instruction cache, including the branch instruction starting at byte 19.

The bottom of fig. 5 shows a tail shadow branch. In this case, the instruction at byte 12 jumped away from the cache line, causing the red branch instruction at byte 16 to not be executed (even though it is present in the instruction cache).

Source: https://dl.acm.org/doi/10.1145/3676641.3716273

Skia

The proposed design (Skia) allows the IAG to make accurate predictions for a subset of shadow branches, thus improving pipeline utilization and reducing instruction cache misses. The types of shadow branches which Skia supports are:

Direct unconditional branches (target PC can be determined without looking at backend state)
Function calls
Returns

As shown in Fig. 6, these three categories of branches (purple, red, orange) account for a significant fraction of all BTB misses:

Source: https://dl.acm.org/doi/10.1145/3676641.3716273

When a cache line enters the instruction cache, the Shadow Branch Decoder (SBD) decodes just enough information to locate shadow branches in the cache line and determine the target PC (for direct unconditional branches and function calls). Metadata from the SBD is placed into two new branch prediction tables in the BPU:

The U-SBB holds information about direct unconditional branches and function calls
The R-SBB holds information about returns

When the BPU encounters a BTB miss, it can fall back to the U-SBB or R-SBB for a prediction.

Fig. 11 illustrates the microarchitectural changes proposed by Skia:

Source: https://dl.acm.org/doi/10.1145/3676641.3716273

Section 4 goes into more details about these structures including:

Replacement policy
How a shadow branch is upgraded into a first-class branch in the BTB
Handling variable length instructions

Results

Fig. 14 has (simulated) IPC improvements across a variety of benchmarks:

Source: https://dl.acm.org/doi/10.1145/3676641.3716273

Dangling Pointers

A common problem that HW and SW architects solve is getting teams out of a local minimum caused by fixed interfaces. The failure mode is when two groups of engineers agree on a static interface, and then each optimize their component as best they can without changing the interface.

In this paper, the interface is the ISA, and Skia is a clever optimization inside of the CPU front-end. Skia shows that there is fruit to be picked here. It would be interesting to examine potential performance gains from architectural (i.e., ISA) changes to pick the same fruit.

Accelerate Distributed Joins with Predicate Transfer

Blake Pelton — Mon, 29 Sep 2025 12:02:31 +0000

Accelerate Distributed Joins with Predicate Transfer Yifei Yang and Xiangyao Yu SIGMOD'25

Thanks to for “dereferencing” this dangling pointer from the prior post on predicate transfer. This paper extends prior work on predicate transfer to apply to distributed joins.

Predicate Transfer Refresh

If you have time, check out my post on predicate transfer. If not, an executive can derive a summary from Fig. 1:

Source: https://dl.acm.org/doi/10.1145/3725259

The idea is to pre-filter the tables involved in a query, so as to reduce total query time by joining smaller tables. Fig. 1(a) shows two tables which will be joined during query execution: R and S. S' is the pre-filtered version of S. S' is constructed with the following steps:

Iterate through all join keys in R, inserting each key into a bloom filter (BF.R)
Iterate through all rows in S, probing BF.R for each row, insert rows that pass the bloom filter into S'

Now that S' is constructed, the algorithm takes another step in the join graph (illustrated in Fig. 1(b)). In this next step, the S' computed in a previous iteration performs the job of R (a different join key is used in this step). The algorithm starts at tables with pushed-down filters, propagates predicate information forward through the join graph, and then reverses and propagates predicate information backward.

Now that you remember the basics of predicate transfer, it’s time to deal with distributed joins. In such an environment, each node in the system holds a subset of each table (e.g., R and S).

Broadcast

If R is small relative to S, then it makes sense to broadcast BF.R to each node. Fig. 3 illustrates three ways to do this:

Source: https://dl.acm.org/doi/10.1145/3725259

(a) Design 1 is the simplest, let’s start there. It is a two-step process to compute the pre-filtered version of S (i.e., S').

Node i iterates through all rows in its local subset of R, and inserts each join key into a local bloom filter BF.Ri. Each of these small bloom filters is broadcast to every other node (which isn’t too expensive because R is assumed to be small).
Node i iterates through all rows in its local subset of S, and probes all of the small bloom filters (BF.R1, BF.R2, …). If any probe operation results in a hit, then the row is inserted into the local subset of S'.

Design 2 merges all of the small bloom filters together to avoid multiple probes, and design 3 parallelizes the merging process.

Shuffle

If both tables are roughly the same size, then shuffling is likely more efficient. Shuffling is based on the following property of relational algebra (this is the 3rd post with this same formula):

In English: partition R and S into two partitions (based on hashing the join key) and then perform partition-wise joins.

In the distributed setting, the number of partitions can equal the number of nodes.

Fig. 4(b) illustrates shuffle-based predicate transfer with (N=2) nodes:

Source: https://dl.acm.org/doi/10.1145/3725259

Node i partitions its local subset of R and S into N partitions (Ri.JK1, Ri.JK2, …), using a hash of the join key to assign each row to a partition. Partitions of R are only used to compute local bloom filters (one per partition). The resulting bloom filters and partitions of S are sent across the network. For example, Si.JK2 is sent to node 2, and similarly the bloom filter derived from Ri.JK2 is sent to node 2.
Each node iterates through all join keys in the partition of S that it just received, probing all bloom filters. If there is a hit in any bloom filter, then the join key is inserted into one of N bloom filters (the bloom filter index depends on which node the row originally came from ). These bloom filters are sent back to the associated nodes.
Each node iterates through its local subset of S. For each row, the join key is used to determine which node computed the corresponding bloom filter. That bloom filter is used to check to see if the row should be inserted into the local subset of S'.

In step 2, each node acts like an RPC server: it handles requests and sends responses. The request payload is a subset of S. The response payload is a bloom filter which represents the subset of that subset which should be included in S'.

Results

Fig. 6 has results for both end-to-end time, and the amount of data sent over the network. NoPT is the vanilla baseline, QS is prior work that tries to achieve a similar goal.

Source: https://dl.acm.org/doi/10.1145/3725259

Dangling Pointers

I’ve added the SlowRandomAccess tag to this one, Bloom filter insertion and probe operations require a small amount of compute, and then at least one random read/write. It would be amazing if there was another approximate membership testing algorithm that was more friendly to the memory hierarchy. In this paper, it seems like this could be a poor point for scalability, because at most steps there are multiple bloom filters at play, so the total working set for all bloom filters accessed by a single node in a single step is large.

In the shuffling case, bloom filter representations of subsets of R are sent across the network (nice for reducing networking bandwidth), but the actual contents of S must be sent. I believe this is because there is no efficient way to compute the intersection of two sets represented by two bloom filters.

To PRI or Not To PRI, That's the question

Blake Pelton — Fri, 26 Sep 2025 12:02:41 +0000

To PRI or Not To PRI, That's the question Yun Wang, Liang Chen, Jie Ji, Xianting Tian, and Ben Luo, Zhixiang Wei, Zhibai Huang, and Kailiang Xu, Kaihuan Peng, Kaijie Guo, Ning Luo, Guangjian Wang, Shengdong Dai, Yibin Shen, Jiesheng Wu, and Zhengwei Qi OSDI'25

Fast IO and Oversubscription

The problem this paper addresses comes from the tension between two requirements in cloud environments:

Fast, Virtualized, IO
DRAM oversubscription

PCIe has bells and whistles to enable fast, virtualized, IO. With Single Root I/O Virtualization (SR-IOV) a device (e.g., a NIC) can advertise many virtual functions (VFs). Each virtual function can be mapped directly into a VM. Each VF appears to a VM as a dedicated NIC which the VM can directly access. For example, a VM can send network packets without a costly hypervisor switch on each packet.

Oversubscription allows more VMs to be packed onto a single server, taking advantage of the fact that it is rare that all VMs actually need all of the memory they have been allocated. The hypervisor can sneakily move rarely used pages out to disk, even though the guest OS still thinks those pages are resident.

The trouble with putting these two (SR-IOV and DRAM oversubscription) together is that devices typically require that any page they access in host memory is pinned. In other words, the device cannot handle a page fault when doing a DMA read or write. This stops the hypervisor for paging out any page which may be accessed by an I/O device.

This paper describes the Page Request Interface (PRI) of PCIe, which enables devices to handle page faults during DMA. The trouble with this interface is that the end-to-end latency of handling a page fault is high:

Mellanox [20] and VPRI focus on optimizing the latency of the IOPF process, claiming that the entire IOPF handling cycle introduces a latency of a few hundred milliseconds.

A NIC cannot hide this latency, and thus a page fault causes packets to be dropped. Additionally, PRI is relatively new and does not have OS support in many VMs that are running in the cloud today.

Fig. 1 shows data from a production environment which indicates that 30 additional VMs could be packed into the environment if SR-IOV could be made to not prevent DRAM oversubscription:

Source: https://www.usenix.org/conference/osdi25/presentation/wang-yun

VM Categorization

Fig. 4 argues for a two-pronged approach (“there are two types of VMs in this world”):

Source: https://www.usenix.org/conference/osdi25/presentation/wang-yun

The key observation is that most VMs don’t do high frequency IO. The paper proposes to dynamically classify VMs into two IO frequency buckets:

For VMs with low frequency IO, the hypervisor is in the loop for each IO operation and moves pages between disk/DRAM to allow oversubscription. The paper calls this mode IOPA-Snoop.
For VMs with high frequency IO, the hypervisor ensures that all pages stay resident, and the hypervisor gets out of the way as much as possible. The paper calls this mode Passthrough

At any point in time, most VMs are in IOPA-Snoop mode, and the hypervisor benefits from DRAM oversubscription for these VMs.

The engineering marvel here is that this system can be done without any changes to the guest OS. The fine print on that point is this system requires the guest OS to use VirtIO drivers.

Implementation

The system described in this paper leverages nested paging. Intel’s implementation is called Extended Paged Table (EPT). Each ring buffer used to communicate with the device (i.e., Native Ring) has a Shadow Ring buffer associated with it. EPT is used to atomically switch the guest VM between the two rings. Similarly, the I/O page table (IOPT) in the IOMMU is used to atomically switch the device between the two rings.

When operating in IOPA-Snoop mode, the guest OS writes packet descriptors into the shadow ring. A module in the hypervisor detects a change to the shadow ring, moves pages to/from disk as necessary, and then updates the native ring, thus triggering the device to perform the IO. When operating in elastic passthrough mode, the guest OS writes packet descriptors directly into the native ring, and the hardware processes them immediately.

Before transitioning from snoop to passthrough mode, the hypervisor disables DRAM oversubscription (and pages in all pages the device could possibly access). Section 3.3 has more details on how the transition is implemented.

Results

Fig. 11 shows throughput for IOPA-Snoop, Passthrough, and hardware-based page fault handling (i.e., VPRI). Results are normalized to passthrough throughput (i.e., 100 is the speed at which passthrough mode operates). The right-hand side shows the significant cost of hardware page fault handling.

Source: https://www.usenix.org/conference/osdi25/presentation/wang-yun

Dangling Pointers

I wonder how much there is to be gained by a NIC-specific vertical solution. Disentangling the Dual Role of NIC Receive Rings indicates that there could be significant performance gained by cooperation between the NIC, hypervisor, and guest OS. For example, there could be a portion of host memory dedicated to hold packets received by the NIC, and this host memory could be dynamically shared between all guest VMs.

Pushing the Limits of In-Network Caching for Key-Value Stores

Blake Pelton — Wed, 24 Sep 2025 15:01:26 +0000

Pushing the Limits of In-Network Caching for Key-Value Stores Gyuyeong Kim NSDI'25

Load Balancing Cache

I generally think of a cache as serving the purpose of load reduction, (e.g., reducing the total number of DRAM accesses or backend server requests). The purpose here is different: load balancing. This paper builds on prior work which defined the small cache effect:

small cache effect: we can balance loads for N servers (or partitions) by caching the O(N log N) hottest items, regardless of the number of items

Imagine a key-value store with keys sharded across multiple backend servers. By configuring the network switches (which are already present in the network) correctly, read requests for hot keys can be handled entirely by the switch, which balances the load across the backend servers.

Everything Looks Like a Nail

The core of this paper is a creative approach to configuring a reconfigurable match table (RMT) switch to act as a load balancing cache. The RMT architecture contains a pipeline of stages, where each stage has access to dedicated SRAM and TCAM memories. The specific switch used in this paper is an Intel Tofino.

The natural thing to try is to store the hot items in the SRAM/TCAM associated with each stage. When a cache read request packet arrives at the switch, the packet can flow through the RMT pipeline searching for matches (TCAM should be very helpful). That is not the approach taken in this work. Section 2 of the paper goes into depth about the drawbacks of this approach (e.g., handling variable length keys and values).

Here is the punchline: store read requests in switch memory. When a read request arrives at the switch, the switch determines if the request is for a hot item. If so, it searches for an empty spot in on-chip memory and stores the request there. Hot items are not stored in switch memory, rather they are continuously recycled through the switch. It is like there is another component in the network which is continuously sending packets to the switch which represent the hot (key, value) pairs. No such component is necessary however, because switches have the ability to recycle packets back through themselves.

This cache is called OrbitCache, because you can think of the recycled packets as moons orbiting a planet (the switch).

When a new (key, value) pair is cached, a (fixed length) hash of the key is stored in the lookup table (in switch memory), and the (key, value) pair is continuously recycled through the switch.
When a read request arrives at the switch, a hash of the key is used to check the lookup table for a match. If there is a match (cache hit), then the read request is stored in switch memory. If there is no match (cache miss), then the read request is forwarded to the appropriate backend server.
When a cached (key, value) pair is recycled through the switch, a hash of the key is used to check if there are pending read requests in cache memory. If so, for each pending read request, a response packet is generated and sent back to the client.

Variable length keys are handled by hashing all key bytes down to a fixed length. Hash collisions are detected and handled by the client (this assumes that collisions are rare).

Variable length values are naturally handled by all of the networking protocol support for variable length packets (up to an MTU).

Section 3 of the paper goes into more details (e.g., how cache coherence is maintained).

Results

Fig. 13 has results for Twitter workloads. NetCache is prior work that stores cached data in switch memory. NetCache doesn’t perform as well because it cannot cache all items due to key/value size limits.

Source: https://www.usenix.org/system/files/nsdi25-kim.pdf

Dangling Pointers

It seems like Tofino is a great networking research platform. I imagine many academics were saddened when Intel exited the network switch business. It seems like the world could use a research platform for switches ala RAMP.

Switches can contain DRAM, I wonder how well it could be used as a larger (and slower) cache.

State-Compute Replication: Parallelizing High-Speed Stateful Packet Processing

Blake Pelton — Mon, 22 Sep 2025 15:01:09 +0000

State-Compute Replication: Parallelizing High-Speed Stateful Packet Processing Qiongwen Xu et al. NSDI'25

It is difficult for me to say if this idea is brilliant or crazy. I suspect it will force you to change some intuitions.

State, Skew, Parallelism

The goal is laudable: a framework to support generic packet processing on CPUs with the following properties:

Shared state between packets (this state can be read and written when processing each packet)
High performance with skewed distributions of flows (some flows may dominate the total amount of work, i.e., receive side scaling won’t work)
Throughput increases with the number of CPU cores used

As I see it, there are two options for parallel processing of network packets:

Distribute packets across cores
Pipeline parallelism

Both pose different problems for handling shared state. This paper advocates for distributing packets among cores and has a fascinating technique for solving the shared state problem.

Fast Forward State Replication

In this model, each core has a private replica of shared state. Incoming packets are sent to each core in a round-robin fashion. And now you are totally confused, how can this ever work? Enter the sequencer.

The sequencer is a hardware component (it could be a feature of the NIC or switch) which processes all incoming packets sequentially and appends a “packet history” to each packet immediately before it is sent to be processed by a core.

Say that four cores are used to implement a DDoS mitigator, and the crux of the DDoS mitigator state update depends on IP addresses from incoming packets. Each core receives a full copy of every fourth packet. Along with every fourth packet, each core also receives the IP addresses of the previous three packets (the packets that were sent to the other cores).

The per-core DDoS mitigator first updates its local state replica using the IP addresses of the previous three packets from the packet history. The paper calls this fast-forwarding the local state replica. At this point, the core has a fresh view of the state and can process the incoming packet.

Another way of thinking about this is that network functions can be decomposed into two steps:

Shared state update , which is relatively inexpensive, and only depends on a few packet fields
Packet processing , which is relatively expensive, and consumes both the shared state and a full packet as input

The shared state update computation is performed redundantly by all cores; there is no parallel speedup associated with it.

The authors argue that many network functions can be decomposed in such a manner and also argue that packet processing is fundamentally expensive because it involves the CPU overhead of programming a NIC to send an outgoing packet.

Results

Fig. 6 has results. The baseline (orange) implementation suffers in many cases because of cross-core synchronization. The sharding implementations suffer in cases of flow skew (a few flows dominating the total cost).

Source: https://www.usenix.org/system/files/nsdi25-xu-qiongwen.pdf

Dangling Pointers

There are heterodox and orthodox assumptions underpinning this paper.

An orthodox assumption is that the packet sequencer must run in hardware because it must process packets in order, and that is the sort of thing that dedicated hardware is much better at than a multi-core CPU.

A heterodox assumption is that many network functions can be expressed with the fast-forward idiom.

I wonder about the orthodox assumption. Hardware isn’t magic, is there a fundamental reason a multi-core CPU cannot handle 100Gbps networking processing with only pipeline parallelism?

Scaling IP Lookup to Large Databases using the CRAM Lens

Blake Pelton — Wed, 17 Sep 2025 14:03:10 +0000

Scaling IP Lookup to Large Databases using the CRAM Lens Robert Chang, Pradeep Dogga, Andy Fingerhut, Victor Rios, and George Varghese NSDI'25

IP Lookup

This paper introduces a new computational model (CRAM) and then applies that model to the problem of IP lookup. IP lookup is simply a mapping of an IP address to arbitrary data (e.g., the address of the next hop that a packet should be routed to). The mapping is described by (relatively static) routing tables.

The tricky part is that the keys in a routing table can contain wildcards. Table 1 contains a simple example:

Source: https://www.usenix.org/conference/nsdi25/presentation/chang

Content Addressable Memory

Here is a refresher if you’ve forgotten the difference between SRAM, CAM, and TCAM. All three are types of on-chip memories.

SRAM behaves like an array: it maps integer keys to values. SRAM is dense: if the key width is 10 bits, then the SRAM contains 1024 entries.

CAM behaves like a hash table. It maps integer keys to values, but it contains sparse storage. For example, a CAM could have an input key width of 10 bits but only have storage for 64 entries.

A TCAM is like a CAM but supports wildcards in the key bits. For example, a CAM could have 10-bit input keys and contain a key→value mapping like ( 1010 ***\ **11* → 35). The input key 1010 0000 11 would produce a value of 35, and an input key 1010 1101 11 would also.

RMT and dRMT

The CRAM model introduced by this paper is a simplified computational model for hardware which works well for this type of application. The elevator pitch for this model is that it is simple, and yet accurately predicts performance for two widely used network processing hardware architectures: RMT and dRMT.

The idea is that a hardware architect, or network engineer could use the CRAM model to do performance analysis before diving into the weeds.

Fig. 11 illustrates the RMT and dRMT hardware architectures:

Source: https://www.usenix.org/conference/nsdi25/presentation/chang

In the context of IP lookup, the four hardware blocks illustrated in Fig. 11 perform the following tasks:

Match compares packet addresses against addresses in the routing table (key comparison)
Action forwards packets based on the data read from the routing table
CAM represents CAM or TCAM memories
RAM represents SRAM

RMT is a more rigid pipeline, whereas dRMT is more similar to a multi-core processor where each core has access to a shared pool of CAM and RAM.

CRAM Model

The core of the CRAM model is a DAG. Each vertex represents a step of the computation. Data flows between steps in a fixed-size array of fixed-sized registers.

A step comprises an optional table lookup, followed by a sequence of statements.

The lookup table can be SRAM, CAM or TCAM. Table lookup keys come from registers, and table lookup results are stored in registers. The lookup table itself is coupled with the step. There is no way for step N to access the lookup table associated with step M. Note that in the strict CRAM model, tables cannot be stored off-chip (e.g., DRAM).

Each statement looks more or less like a typical RISC instruction (e.g., R1 = R2 ^ R3). All statements within a step run in parallel (after the table lookup).

Design Space Exploration

Section 2.2 lists eight idioms which can be used to explore the design space of CRAM graphs (I think of them as CRAM programs). Examples include:

Converting a TCAM to an SRAM by duplicating entries, or visa-versa
Using a large SRAM to handle common (and simple) cases, and a small TCAM to handle uncommon (and complex) cases
Breaking a single large lookup table into multiple small lookup tables, with each small lookup table handling a subset of the key bits

Results

There are two types of results in this paper:

New algorithms discovered by the authors using design-space exploration with the CRAM model
The predictive accuracy of the CRAM model vs the real world

RESAIL

The paper describes three new algorithms, RESAIL is one of them. RESAIL is a incremental improvement to SAIL (a state-of-the-art IPv4 lookup algorithm which requires DRAM). RESAIL is based on the fact that most IPv4 addresses (keys) in real-world routing tables have a suffix of wildcards that is at least 8 bits wide (e.g., 54.135.54.*). The prefix length of an address in the routing table is number of bits before the wildcard suffix.

RESAIL uses a small (~3KiB) TCAM to handle routing table entries with a prefix length greater than 24. The set of routing table entries with long prefix lengths is a good fit for TCAM because it is sparse.

For narrower prefix lengths, RESAIL uses two data structures in SRAM. The first is a set of 24 bitmaps. Bitmap i has length 2^i. i prefix bits from an IPv4 address are used to check if that prefix is present in the routing table. All bitmaps can be checked in parallel. If there are multiple matches, then the longest prefix is chosen.

These bitmaps are also sparse (a whole lot of zeros), but it is still better to use SRAM vs TCAM because each entry is only a single bit wide.

If the bitmap lookup results in a hit, then a hash table (also stored in SRAM) is used to lookup the value associated with the IPv4 address. The prefix length from the bitmap lookup is used when constructing the key used for the hash table lookup. This is how wildcard support is added to a traditional hash table. The paper mentions that a hash table with low probability of collisions is used, but does not go into more detail about how collisions are handled.

Appendix A.5 contains pseudocode for RESAIL:

Source: https://www.usenix.org/conference/nsdi25/presentation/chang

Predictive Accuracy

Table 10 compares TCAM and SRAM usage predicted by the CRAM model vs an ideal and actual RMT architecture. I suppose the unspoken assumption on all of this is that networking is all about on-chip memory, logic isn’t significant.

Source: https://www.usenix.org/conference/nsdi25/presentation/chang

Dangling Pointers

It is a bit odd that CAM and TCAM are so important for networking hardware, but not other chips. Maps/Dictionaries are extremely common in general software, so why don’t general purpose processors have dedicated support? There is probably a lot of work to design an architecture which allows for virtualization and composition of libraries.

In some sense, a CPU cache acts a bit like a CAM. I wonder if the same hardware could be reused to support an architecture which had explicit CAM support.

Enabling Silent Telemetry Data Transmission with InvisiFlow

Blake Pelton — Mon, 15 Sep 2025 15:02:41 +0000

Enabling Silent Telemetry Data Transmission with InvisiFlow Yinda Zhang, Liangcheng Yu, Gianni Antichi, Ran Ben Basat, and Vincent Liu NSDI'25

Watch out for symptoms of techno-Eeyore syndrome (TES):

You frequently feel like all of the low hanging fruit has been picked
Someone has recently asked you: “if this is such a good idea, why aren’t other companies doing it?”
You believe that the only remaining path for innovation requires spending a significant fraction of global GDP on new datacenters

If you are experiencing TES, then this paper will snap you out of it. Herein lies a wonderfully elegant solution to a real-world problem.

Schrödinger's Network Telemetry

The task at hand is collecting comprehensive network telemetry without disrupting the network itself. Services, operating systems, NICs, and switches are all great sources of telemetry. A global view of network telemetry could provide valuable insight into the behavior of a network.

Collecting information about network activity requires sending telemetry data over the network in question (unless one wants to build a separate network just for telemetry, which would be expensive). The problem this paper addresses is: how to collect such telemetry without altering the behavior of the network itself? The paper cites prior work which claims that enabling network telemetry can degrade application-level network throughput by 20%.

Telemetry Flows Like Water

Here is the refreshingly elegant solution. Designate one or more servers in the network as telemetry collector sinks. These sinks are the ultimate destination for any packet containing telemetry data. Any device which produces telemetry data is called a source. Sources produce network packets which contain telemetry information, and those packets make their way through the network until they reach a sink which consumes them.

The magic of this system is that when a source produces a telemetry packet, the address of the sink is not known. The packet meanders through the network (on uncongested links) until it arrives at a sink. The paper makes a great analogy: telemetry packets flowing through the network are like drops of water flowing down a mountain toward the ocean.

Each component in the network does its part via the following steps:

Maintaining a dedicated buffer (i.e., a telemetry buffer) for telemetry packets which are waiting to continue their journey through the network
Periodically sending pull requests to neighbors (networking equipment physically connected to the component in question). This is somewhat like PFC, where a NIC can send flow-control data to the switch it is connected to (and vice versa). This kind of pull request has nothing to do with a Git pull request. A pull request informs neighboring network components how full the telemetry buffer is. In the hydrodynamic analogy: the fullness of a telemetry buffer represents how high a networking component is above sea level.
Responding to pull requests from neighbors. When a network component receives a pull request, it checks to see if its own telemetry buffer is more or less full than the neighbor’s buffer. If the neighbor’s buffer is less full, then telemetry packets are moved to the neighbor. Telemetry packets are only sent in windows of time where there are no application-level packets to send on a link (i.e., telemetry packets fill in the empty space on each link).
Telemetry sinks send pull requests when they are ready to ingest more telemetry data. When a telemetry packet arrives at a sink, the sink processes the packet, and the packet disappears from the network.

The author’s call this “gradient-based transmission”; the differences in telemetry buffer utilization create a gradient, and packets descend that gradient. If there are many paths between a particular switch and a sink, this gradient information will cause the switch to choose the least-congested path.

The only hiccup here occurs when most telemetry buffers are empty most of the time. In this case, the gradient is near zero everywhere, and telemetry packets can oscillate rather than make forward progress to a sink. The fix for this is to determine the distance (in network hops) between each network component and the nearest sink. This distance is used to bias the gradient toward sinks.

Results

The authors were able to use P4 to implement this algorithm on Wedge100BF-32X switches.

Fig. 7 contains some great results:

Source: https://www.usenix.org/conference/nsdi25/presentation/zhang-yinda

Dangling Pointers

It seems like this could be generalized to more than just telemetry. If there is ever a situation where a network packet could be sent to one of a set of destinations, it may be beneficial for the sender to not pick a destination and instead organize the behavior of each network component such that the network has an emergent property that causes packets to take an uncongested path toward one of the destinations.

Enabling Portable and High-Performance SmartNIC Programs with Alkali

Blake Pelton — Fri, 12 Sep 2025 15:02:21 +0000

Enabling Portable and High-Performance SmartNIC Programs with Alkali Jiaxin Lin, Zhiyuan Guo, Mihir Shah, Tao Ji, Yiying Zhang, Daehyeok Kim and Aditya Akella NSDI'25

SmartNIC HodgePodge

Fig. 1 illustrates four modern SmartNIC hardware architectures. Note the diversity of implementation options:

Source: https://www.usenix.org/conference/nsdi25/presentation/lin-jiaxin

Woe to the engineer who must write high performance SmartNIC programs which are portable across these architectures. Alkali is a SmartNIC programming framework which enables high-performance SmartNIC programs to be written in a concise, portable way.

This paper classifies a SmartNIC based on the presence and configuration of three types of resources:

Compute
Memory
Reconfigurable match-action pipelines

The authors programming view #3 as a solved problem, and thus primarily focus on the first two resources.

Handler Graph

The Alkali framework is based on a compiler. It represents a SmartNIC program as a handler graph, where each edge in the graph runs inside of a single compute unit. A (single-threaded) user-written program is initially transformed into a simple handler graph. Then compiler optimization passes transform the graph so as to improve performance.

There are two primary ways that a handler graph can be optimized: replication and pipelining.

Replication

Replication allows multiple packets to be processed in parallel (across multiple compute units) by replicating handlers. An SMT solver is used to determine how many replicas of each handler should exist and assigns handlers to specific compute units. One input to the SMT solver is a performance model which is used to predict the amount of time a handler will spend processing each packet.

This transformation converts a single-threaded input program into a parallel one with shared-memory multi-threading. So, the question you should be asking yourself is: how can this be done correctly? Parallel access to shared data structures is only supported in two cases:

Accesses to read-only tables can be parallelized
Accesses to shardable tables can be parallelized (i.e., a table can be sharded across compute units)

Receive side scaling is an example of #2. Fields from a packet header are hashed together to produce an index, and that index is used to access a table. The table can be spread across compute units (e.g., compute unit 0 accesses half of the table while compute unit 1 accesses the other half). Packets are directed to the appropriate handler based on the computed index.

Pipelining

After replication, the compiler has an estimate of which handler is the bottleneck in the system. The compiler then attempts to split the handler into two, thus adding pipeline parallelism. A min-cut algorithm is used, which attempts to find a cut point that minimizes the amount of data that must flow between pipeline stages.

Just like replication, the pipelining process must be careful to not introduce correctness problems (i.e., concurrent access to a shared data structure from multiple pipeline stages). Before the min-cut algorithm is run, the compiler inserts synthetic vertices and edges into the graph corresponding to accesses to shared tables. For example, if a handler both reads and writes a table, then a vertex is added to the graph corresponding to the table, with bidirectional edges connecting the synthetic vertex to handlers which access the table. The edges have infinite weights, which prevents them from being cut.

Results

Fig. 11 has a performance comparison against an open-source implementation of a network transport. The pitch here is that Alkali offers abstraction and portability without sacrificing performance.

Source: https://www.usenix.org/conference/nsdi25/presentation/lin-jiaxin

Dangling Pointers

The replication and pipelining process assumes that performance is predictable at compile time. Support for dynamic loops would invalidate this assumption, requiring profile-guided optimization or dynamic compilation.

The two supported shared-memory access patterns seem restrictive, it seems worthy to understand what other memory access patterns can be parallelized.

Section 9 of the paper mentions both of these limitations.

It is interesting that this framework is cast as SmartNIC-specific. Is there something special about networking that makes pipeline parallelism so attractive? I would think other domains could benefit from a similar framework.

Rather than a technical solution to SmartNIC programming, there might be an economic one. The early days of 3D graphics was a bit of a zoo, with separate programming frameworks for each hardware vendor. One thing that enabled Direct3D and OpenGL to succeed was industry consolidation. Over time, the market figured out which hardware approaches were best, and the industry converged. Once that convergence reached a certain point, then a portable platform became much easier to create. I wonder if SmartNIC programming simply needs time for the market to do its thing and create an ecosystem which is ripe for standardization.