DEV Community

Blake Pelton
Blake Pelton

Posted on • Originally published at danglingpointers.substack.com on

Efficiently Processing Joins and Grouped Aggregations on GPUs

This was originally posted on Dangling Pointers. My goal is to help busy people stay current with recent academic developments. Head there to subscribe for regular summaries of computer science research.

Efficiently Processing Joins and Grouped Aggregations on GPUs Bowen Wu, Dimitrios Koutsoukos, and Gustavo Alonso SIGMOD'25

This paper describes multiple optimizations. I’ve selected one to describe here. Many of the concepts apply to other optimizations described in the paper.

Partitioned Join Refresher

Those of you who read the summary of SPID-Join may remember:

Here, both A and B have both been partitioned into smaller relations. In other words, to partition A into N partitions, iterate over each tuple in A and compute a log2(N)-bit-wide hash of the join key. That hash value is the index of the target partition for the current tuple.

This property describes a two-step partitioned join. In step one, both input relations are partitioned into N partitions according to the join key(s). In step two, N partition-wise joins are executed. Both steps can be parallelized.

Prior Art

Fig. 2 describes prior published results for executing joins on a GPU:


Source: https://doi.org/10.1145/3709689

First, new relations (R' and S') are derived from the input relations. R' contains 2 columns. The first column contains the join key; the second column contains the index of the row. S' is similarly derived from S.

Next, R' and S' are partitioned according to the join key. In Fig. 2 there are only two partitions, and the least-significant bit of the join key determines which partition a row is placed in.

Now that the data has been divided, it is time to conquer. Each partition of R' is joined with the corresponding partition of S'. This works great because there are many partitions which can be processed in parallel, and the working set required to join a single partition is small enough to fit into (on-chip) shared memory.

Finally, the row index values are converted back to raw input values with a gather operation (one gather per output payload column).

All steps but the last are a good fit for the GPU architecture and are not too embarrassing for any engineers involved. The gathers however will turn your cheeks red. Each one involves very little compute, and a whole bunch of random small reads (e.g., 4 or 8 bytes wide) from DRAM/HBM. As the number of payload columns rises, so does the awkwardness.

You can see the source of the randomness in the map argument of the gather calls in Fig. 2. The partitioning process all but guarantees that the row index values in a partition will be non-sequential.

The Solution

The fix is both straightforward and counterintuitive: partition the payload values along with the join keys in the partitioning step. This is counterintuitive because it seems to add more work to the whole algorithm. The amount of data read and written by the partitioning step increases linearly with the number of payload columns. However, this cost is more than paid for by the fact that the gather step now operates on sequential indices.

Fig. 4. illustrates the process:


Source: https://doi.org/10.1145/3709689

The examples above have a match ratio of 100%. Each foreign key value in S matches with a primary key in R. When the optimized scheme is used with a lower match ratio, the map argument of the gather operations will be monotonically increasing but will be sparse (rows will be skipped).

Results

Fig. 7 shows a nice improvement:


Source: https://doi.org/10.1145/3709689

The relevant bars are ST (baseline) and PT (optimized). The “Materialization” rectangles represent the time spent in the gather operations.

Dangling Pointers

This is yet another example of optimizing a system to work well with memory which claims to support “random access” but requires a decent amount of coherence to achieve high goodput.

In the baseline algorithm with many payload columns, it seems like changing the interface between the memory controller and DRAM could improve the situation. For example, the base address of each payload column could be programmed once, and then each map element could be sent across the DRAM pins once. Each map element sent to DRAM could result in multiple payload elements (one per column) coming back. Maybe this is a good use case for Processing-in-Memory.

Top comments (0)