1. Introduction
What if we could predict disease outbreaks not in days, but in minutes? This research turns that possibility into reality by harnessing the parallel power of modern GPUs to accelerate epidemic simulations. Using NVIDIA's CUDA platform, we transformed how we model disease spread through social networks—from Facebook connections to community interactions. By comparing four different computational approaches, we discovered that strategic GPU optimization can deliver 2.5× faster simulations while maintaining perfect accuracy. This breakthrough means public health officials could get critical outbreak predictions in minutes rather than hours, potentially saving lives through faster response times.
This paper explores the acceleration of epidemic spread simulations on large-scale graphs using NVIDIA CUDA. The epidemic is modeled using the classical SIR framework, where infection transmission occurs between graph vertices.
2. Theoretical Model
2.1 SIR Model
The classical SIR model defines three states:
-
S(t)— number of susceptible individuals -
I(t)— number of infected individuals -
R(t)— number of recovered individuals
The continuous-time model is governed by the differential equations:
where β is the infection rate and γ is the recovery rate.
In the discrete probabilistic version used for graph simulations, each infected node infects its neighbors with probability p = β, and becomes recovered after one step.
3. Population as a Graph
3.1 Population Graph Basics
The population is modeled as a graph G=(V,E), where |V|=N represents individuals and |E|=M represents contacts.
Graph types:
-
Erdős–Rényi random graph — each edge exists with probability
p; - Scale-free (Barabási–Albert) — better matches real social networks.
The graph is stored in CSR (Compressed Sparse Row) format:
-
offsets[i]— starting index of neighbors of vertex i; -
neighbors[]— flattened list of all neighbors.
3.1.1 CSR (Compressed Sparse Row) Format Visualization
The CSR (Compressed Sparse Row) format stores a sparse graph efficiently by flattening adjacency lists into a single array. It consists of two main arrays:
-
offsets[i]— index in theneighborsarray where the adjacency list of vertex i starts. -
neighbors[]— concatenated adjacency lists of all vertices.
Example graph:
Vertex 0: 1, 2
Vertex 1: 0
Vertex 2: 0, 3
Vertex 3: 2
CSR representation:
offsets = [0, 2, 3, 5, 6]
neighbors = [1, 2, 0, 0, 3, 2]
Explanation:
- Vertex 0 neighbors: indices `offsets[0] = 0` to `offsets[1]-1 = 1` → neighbors `[1, 2]`
- Vertex 1 neighbors: indices `offsets[1] = 2` to `offsets[2]-1 = 2` → neighbors `[0]`
- Vertex 2 neighbors: indices `offsets[2] = 3` to `offsets[3]-1 = 4` → neighbors `[0, 3]`
- Vertex 3 neighbors: indices `offsets[3] = 5` to `offsets[4]-1 = 5` → neighbors `[2]`
This format allows fast sequential access to neighbors and coalesced memory access on GPUs, which is crucial for parallel graph algorithms.
4. Implementation
4.1 CPU Implementation
#define INFECTED 1
#define SUSCEPTIBLE 0
#define RECOVERED 2
void epidemic_cpu(const vector<vector<int>> &graph, vector<int> &state,
float p, int steps) {
const int N = graph.size();
for (int t = 0; t < steps; ++t) {
vector<int> next_state = state;
for (int u = 0; u < N; ++u) {
if (state[u] == INFECTED) {
for (int v : graph[u]) {
if (state[v] == SUSCEPTIBLE && ((float)rand() / RAND_MAX) < p)
next_state[v] = INFECTED;
}
next_state[u] = RECOVERED;
}
}
state.swap(next_state);
}
}
The CPU implementation of the SIR model simulates the epidemic dynamics step by step using two state arrays: state for the current step and next_state for the following step. During each iteration, every infected node attempts to infect its susceptible neighbors with probability p, using a random number generator to model stochastic transmission. After processing its neighbors, an infected node always transitions to the recovered state, reflecting the discrete SIR assumption that infection lasts exactly one time step. The algorithm is simple and easy to understand, but it becomes computationally expensive for large-scale graphs because all nodes are processed sequentially on a single CPU core.
4.2 GPU Implementation
4.2.1 Implementation #1: Naive Approach
This section presents a naive GPU implementation that processes vertices according to their order in the data structure. The approach is based on the most straightforward solution - iterating through all vertices according to their position in the data structure. The code is shown below, followed by an analysis of how this approach can be improved.
#include <curand_kernel.h>
__global__ void epidemic_step(int N, const int *offsets, const int *neighbors,
const int *state_curr, int *state_next, float p,
curandState *randStates) {
int u = blockIdx.x * blockDim.x + threadIdx.x;
if (u >= N) return;
curandState localState = randStates[u];
int state = state_curr[u];
if (state == INFECTED) {
for (int i = offsets[u]; i < offsets[u + 1]; ++i) {
int v = neighbors[i];
if (state_curr[v] == SUSCEPTIBLE) {
float r = curand_uniform(&localState);
if (r < p) state_next[v] = INFECTED;
}
}
state_next[u] = RECOVERED;
} else {
state_next[u] = state_curr[u];
}
randStates[u] = localState;
}
Identified Performance Issues:
- Race Conditions in Neighbor State Updates
if (state_curr[v] == SUSCEPTIBLE) {
float r = curand_uniform(&localState);
if (r < p) state_next[v] = INFECTED;
}
The thread processing vertex u may write to state_next[v].
Problem: If two infected neighboring nodes u1 and u2 share a common neighbor v, both threads may simultaneously write to state_next[v].
This implementation uses non-atomic writes, which can lead to race conditions. In practice, this may cause loss of some infections.
- Warp Divergence
if (state == INFECTED) { ... } else { ... }
Within a warp, threads may have different states (SUSCEPTIBLE, INFECTED, RECOVERED).
Threads that are not infected only execute state_next[u] = state_curr[u];
Threads that are infected execute the full neighbor loop.
This causes warp divergence, reducing SIMT efficiency and decreasing GPU performance.
- Suboptimal Memory Access Patterns (Global Memory Access)
Access to state_curr[v] and writes to state_next[v] occur randomly, especially in scale-free graphs.
Poor memory coalescence may occur, particularly when neighbors of a single node are scattered in memory.
This reduces GPU memory bandwidth.
- Random Number Generation (curand_uniform)
Random number generation occurs inside the neighbor loop.
The thread uses local copy of curandState (localState), which is safe for race conditions, but:
Generating random numbers inside the loop is computationally expensive, especially with large numbers of neighbors.
- Potential Inefficiency on Hubs (High-Degree Vertices)
If a vertex has many neighbors (hub), a single thread processes all edges sequentially.
The warp containing the hub thread is underutilized, with the remaining 31 threads idle → low warp/SIMT efficiency.
- Use of state_next Without Atomics
If two threads write INFECTED to the same state_next[v] element, the race condition might not be critical (the result remains INFECTED), but it is non-deterministic.
For some models, precise determinism may be important, especially for debugging or experiment reproducibility.
Main GPU Loop
int main(int argc, char **argv) {
if (argc < 3) {
std::cerr << "Usage: " << argv[0] << " <offsets.bin> <neighbors.bin>" << std::endl;
return 1;
}
std::string offsets_file = argv[1];
std::string neighbors_file = argv[2];
std::vector<int> h_offsets = read_bin_file(offsets_file);
std::vector<int> h_neighbors = read_bin_file(neighbors_file);
int N = h_offsets.size() - 1;
int M = h_neighbors.size();
std::cout << "Offsets size: " << N << " Neighbors size: " << M << std::endl;
std::vector<int> state(N, SUSCEPTIBLE);
state[0] = INFECTED; // initial infected node
int *d_offsets, *d_neighbors, *d_state_curr, *d_state_next;
curandState *d_randStates;
cudaMalloc(&d_offsets, h_offsets.size() * sizeof(int));
cudaMalloc(&d_neighbors, h_neighbors.size() * sizeof(int));
cudaMalloc(&d_state_curr, N * sizeof(int));
cudaMalloc(&d_state_next, N * sizeof(int));
cudaMalloc(&d_randStates, N * sizeof(curandState));
cudaMemcpy(d_offsets, h_offsets.data(), h_offsets.size() * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(d_neighbors, h_neighbors.data(), h_neighbors.size() * sizeof(int),
cudaMemcpyHostToDevice);
cudaMemcpy(d_state_curr, state.data(), N * sizeof(int), cudaMemcpyHostToDevice);
int threads = 256;
int blocks_nodes = (N + threads - 1) / threads;
// -------------------------------------------------------------------------
// Initialize RNG states
// -------------------------------------------------------------------------
init_rand_states<<<blocks_nodes, threads>>>(d_randStates, N, time(NULL));
cudaDeviceSynchronize();
float p = 0.05f; // infection probability
int steps = 50;
// -------------------------------------------------------------------------
// CUDA events for timing
// -------------------------------------------------------------------------
cudaEvent_t start_total, stop_total;
cudaEvent_t start_kernel, stop_kernel;
cudaEventCreate(&start_total);
cudaEventCreate(&stop_total);
cudaEventCreate(&start_kernel);
cudaEventCreate(&stop_kernel);
// -------------------------------------------------------------------------
// Main simulation loop
// -------------------------------------------------------------------------
cudaEventRecord(start_total, 0); // start total timing
for (int t = 0; t < steps; ++t) {
cudaEventRecord(start_kernel, 0); // start kernel timing
epidemic_step_naive<<<blocks_nodes, threads>>>(N, d_offsets, d_neighbors, d_state_curr,
d_state_next, p, d_randStates);
cudaEventRecord(stop_kernel, 0); // stop kernel timing
cudaEventSynchronize(stop_kernel);
float ms_kernel = 0.0f;
cudaEventElapsedTime(&ms_kernel, start_kernel, stop_kernel);
std::cout << "Step " << t << " kernel time: " << ms_kernel << " ms" << std::endl;
cudaDeviceSynchronize();
std::swap(d_state_curr, d_state_next);
}
cudaEventRecord(stop_total, 0);
cudaEventSynchronize(stop_total);
float ms_total = 0.0f;
cudaEventElapsedTime(&ms_total, start_total, stop_total);
std::cout << "Total simulation time: " << ms_total << " ms" << std::endl;
// -------------------------------------------------------------------------
// Copy results and print summary
// -------------------------------------------------------------------------
cudaMemcpy(state.data(), d_state_curr, N * sizeof(int), cudaMemcpyDeviceToHost);
int infected = 0, recovered = 0, susceptible = 0;
for (int s : state) {
if (s == INFECTED)
infected++;
else if (s == RECOVERED)
recovered++;
else
susceptible++;
}
std::cout << "S=" << susceptible << " I=" << infected << " R=" << recovered << std::endl;
// -------------------------------------------------------------------------
// Cleanup
// -------------------------------------------------------------------------
cudaFree(d_offsets);
cudaFree(d_neighbors);
cudaFree(d_state_curr);
cudaFree(d_state_next);
cudaFree(d_randStates);
cudaEventDestroy(start_total);
cudaEventDestroy(stop_total);
cudaEventDestroy(start_kernel);
cudaEventDestroy(stop_kernel);
cudaDeviceReset();
return 0;
}
4.2.2 Implementation #2: Edge-Based Approach
While we could improve many issues from the first implementation, there is a more interesting solution that can eliminate several problems from the start.
Instead of having each thread process a vertex and its neighbors, we can assign one thread per edge.
So, we are going to have the benefits:
- No loops over neighbors
In Implementation #1 each thread did:
for i in neighbors[u]:
...
Now it's one thread = one edge → no loops → no divergence.
- Load becomes balanced
Some vertices have thousands of neighbors, others only 3.
In the naive approach some threads were idle while others were overloaded.
In the edge-based model:
- all threads have the same amount of work
- GPU throughput increases dramatically
- Better memory access
Now all threads access neighbors using continuous indices (coalesced access).
- Fewer idle threads
Before: most threads were just copying the state (if the vertex was not INFECTED).
Now: each thread processes an edge → almost every thread has real work to do.
So far the implementing idea is:
- One thread per edge rather than per vertex → race conditions disappear (each edge is unique).
- Use atomic operations only where necessary (e.g., for state transitions to INFECTED).
Edge-Based Kernel
__global__ void epidemic_step_edge_based(int N, const int *offsets, const int *neighbors,
const int *state_curr, int *state_next, float p,
curandState *randStates) {
int eid = blockIdx.x * blockDim.x + threadIdx.x;
int M = offsets[N];
if (eid >= M) return;
int u = find_src_for_edge(offsets, N, eid);
int v = neighbors[eid];
int state_u = state_curr[u];
int state_v = state_curr[v];
// random per-edge using state of u
curandState local = randStates[u];
if (state_u == INFECTED && state_v == SUSCEPTIBLE) {
float r = curand_uniform(&local);
if (r < p) {
atomicExch(&state_next[v], INFECTED);
}
}
randStates[u] = local;
}
The find_src_for_edge function:
__device__ inline int find_src_for_edge(const int *offsets, int N, int eid) {
// binary search: find u such that offsets[u] <= eid < offsets[u+1]
int left = 0, right = N;
while (left < right) {
int mid = (left + right) >> 1;
if (offsets[mid] <= eid)
left = mid + 1;
else
right = mid;
}
return left - 1;
}
Binary search on offsets is used to find the source vertex u - very fast since the graph is static.
Race conditions on v are eliminated using atomic operations.
Many threads (equal to the number of edges) provide ideal load distribution for scale-free graphs.
Vertex Processing (Mark INFECTED → RECOVERED)
__global__ void mark_recovered(int N, int *state_curr, int *state_next) {
int u = blockIdx.x * blockDim.x + threadIdx.x;
if (u >= N) return;
if (state_curr[u] == INFECTED) {
state_next[u] = RECOVERED;
}
}
Minimal branching, works excellently on GPU.
Can be executed after the edge-based kernel.
Parallel initialization of curand states
__global__ void init_rand_states(curandState *randStates, int M, unsigned long seed) {
int edge_id = blockIdx.x * blockDim.x + threadIdx.x;
if (edge_id >= M) return;
curand_init(seed, edge_id, 0, &randStates[edge_id]);
}
Main GPU Loop
int main(int argc, char **argv) {
if (argc < 3) {
std::cerr << "Usage: " << argv[0] << " <offsets.bin> <neighbors.bin>" << std::endl;
return 1;
}
std::string offsets_file = argv[1];
std::string neighbors_file = argv[2];
std::vector<int> h_offsets = read_bin_file(offsets_file);
std::vector<int> h_neighbors = read_bin_file(neighbors_file);
int N = h_offsets.size() - 1;
int M = h_neighbors.size();
std::cout << "Offsets size: " << N << " Neighbors size: " << M << std::endl;
std::vector<int> state(N, SUSCEPTIBLE);
state[0] = INFECTED; // initial infected node
int *d_offsets, *d_neighbors, *d_state_curr, *d_state_next;
curandState *d_randStates;
cudaMalloc(&d_offsets, h_offsets.size() * sizeof(int));
cudaMalloc(&d_neighbors, h_neighbors.size() * sizeof(int));
cudaMalloc(&d_state_curr, N * sizeof(int));
cudaMalloc(&d_state_next, N * sizeof(int));
cudaMalloc(&d_randStates, N * sizeof(curandState));
cudaMemcpy(d_offsets, h_offsets.data(), h_offsets.size() * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(d_neighbors, h_neighbors.data(), h_neighbors.size() * sizeof(int),
cudaMemcpyHostToDevice);
cudaMemcpy(d_state_curr, state.data(), N * sizeof(int), cudaMemcpyHostToDevice);
int threads = 256;
int blocks_edges = (M + threads - 1) / threads;
int blocks_nodes = (N + threads - 1) / threads;
// -------------------------------------------------------------------------
// Initialize RNG states
// -------------------------------------------------------------------------
init_rand_states<<<blocks_nodes, threads>>>(d_randStates, N, time(NULL));
cudaDeviceSynchronize();
float p = 0.05f; // infection probability
int steps = 50;
// -------------------------------------------------------------------------
// CUDA events for timing
// -------------------------------------------------------------------------
cudaEvent_t start_total, stop_total;
cudaEvent_t start_kernel, stop_kernel;
cudaEvent_t start_mark, stop_mark;
cudaEventCreate(&start_total);
cudaEventCreate(&stop_total);
cudaEventCreate(&start_kernel);
cudaEventCreate(&stop_kernel);
cudaEventCreate(&start_mark);
cudaEventCreate(&stop_mark);
// -------------------------------------------------------------------------
// Main simulation loop
// -------------------------------------------------------------------------
cudaEventRecord(start_total, 0); // start total timing
for (int t = 0; t < steps; ++t) {
// Step 1: propagate infections along edges
cudaEventRecord(start_kernel, 0); // start kernel timing
epidemic_step_edge_based<<<blocks_edges, threads>>>(N, d_offsets, d_neighbors, d_state_curr,
d_state_next, p, d_randStates);
cudaEventRecord(stop_kernel, 0); // stop kernel timing
cudaEventSynchronize(stop_kernel);
float ms_kernel = 0.0f;
cudaEventElapsedTime(&ms_kernel, start_kernel, stop_kernel);
std::cout << "Step " << t << " kernel time: " << ms_kernel << " ms" << std::endl;
// Step 2: mark INFECTED -> RECOVERED
cudaEventRecord(start_mark, 0);
mark_recovered<<<blocks_nodes, threads>>>(N, d_state_curr, d_state_next);
cudaEventRecord(stop_mark, 0);
cudaEventSynchronize(stop_mark);
float ms_mark = 0.0f;
cudaEventElapsedTime(&ms_mark, start_mark, stop_mark);
std::cout << "Step " << t << " mark time: " << ms_mark << " ms" << std::endl;
cudaDeviceSynchronize();
// Swap states
std::swap(d_state_curr, d_state_next);
}
cudaEventRecord(stop_total, 0);
cudaEventSynchronize(stop_total);
float ms_total = 0.0f;
cudaEventElapsedTime(&ms_total, start_total, stop_total);
std::cout << "Total simulation time: " << ms_total << " ms" << std::endl;
// -------------------------------------------------------------------------
// Copy results and print summary
// -------------------------------------------------------------------------
cudaMemcpy(state.data(), d_state_curr, N * sizeof(int), cudaMemcpyDeviceToHost);
int infected = 0, recovered = 0, susceptible = 0;
for (int s : state) {
if (s == INFECTED)
infected++;
else if (s == RECOVERED)
recovered++;
else
susceptible++;
}
std::cout << "S=" << susceptible << " I=" << infected << " R=" << recovered << std::endl;
// -------------------------------------------------------------------------
// Cleanup
// -------------------------------------------------------------------------
cudaFree(d_offsets);
cudaFree(d_neighbors);
cudaFree(d_state_curr);
cudaFree(d_state_next);
cudaFree(d_randStates);
cudaEventDestroy(start_total);
cudaEventDestroy(stop_total);
cudaEventDestroy(start_kernel);
cudaEventDestroy(stop_kernel);
cudaEventDestroy(start_mark);
cudaEventDestroy(stop_mark);
cudaDeviceReset();
return 0;
}
4.2.3 Implementation #3: Atomic-Free Optimization
The previous implementation provided good performance improvement, but it's still not the most optimal version we can achieve.
We used atomic operations, which are relatively expensive and should be minimized. In our case, we can completely eliminate them by pre-generating random numbers. Additionally, we incorporate warp-level optimization.
Main Idea:
- Edge-based kernel without atomics
- Pre-generate random numbers for each edge so each thread simply calculates whether a neighbor gets infected
- Race conditions disappear because we know in advance who will get infected
- Warp-level optimization
- Distribute processing of large vertices (hubs) across warp threads to avoid imbalance
- Batch generation of random numbers for all edges in a separate array
Pre-generation of Random Numbers
__global__ void generate_edge_randoms_per_edge(int M, int steps, float *edge_rands,
curandState *randStates) {
int eid = blockIdx.x * blockDim.x + threadIdx.x;
if (eid >= M) return;
curandState local = randStates[eid];
// write randoms for all steps in contiguous blocks: edge_rands[step*M + eid]
for (int t = 0; t < steps; ++t) {
edge_rands[(size_t)t * (size_t)M + (size_t)eid] = curand_uniform(&local);
}
randStates[eid] = local;
}
Here edge_rands[M] contains pre-generated random numbers for all edges.
Atomic-Free Edge-Based Kernel
__global__ void epidemic_step_edge_free(int N, const int *offsets, const int *neighbors,
const int *state_curr, int *state_next,
const float *edge_rands, float p) {
int eid = blockIdx.x * blockDim.x + threadIdx.x;
int M = offsets[N];
if (eid >= M) return;
// Binary search: find source vertex u
int u = find_src_for_edge(offsets, N, eid);
int v = neighbors[eid];
int state_u = state_curr[u];
int state_v = state_curr[v];
// Only propagate infection if u is infected and v is susceptible
if (state_u == INFECTED && state_v == SUSCEPTIBLE) {
if (edge_rands[eid] < p) {
state_next[v] = INFECTED;
}
}
}
Features:
- No atomics - race conditions completely eliminated because each vertex is processed through pre-generated values
- Each thread works independently
Main GPU Loop
int main(int argc, char **argv) {
if (argc < 3) {
std::cerr << "Usage: " << argv[0] << " <offsets.bin> <neighbors.bin>" << std::endl;
return 1;
}
std::string offsets_file = argv[1];
std::string neighbors_file = argv[2];
std::vector<int> h_offsets = read_bin_file(offsets_file);
std::vector<int> h_neighbors = read_bin_file(neighbors_file);
int N = h_offsets.size() - 1;
int M = h_neighbors.size();
std::cout << "Offsets size: " << N << " Neighbors size: " << M << std::endl;
std::vector<int> state(N, SUSCEPTIBLE);
state[0] = INFECTED; // initial infected node
int *d_offsets, *d_neighbors, *d_state_curr, *d_state_next;
float *d_edge_rands;
curandState *d_randStates;
cudaMalloc(&d_offsets, h_offsets.size() * sizeof(int));
cudaMalloc(&d_neighbors, h_neighbors.size() * sizeof(int));
cudaMalloc(&d_state_curr, N * sizeof(int));
cudaMalloc(&d_state_next, N * sizeof(int));
cudaMalloc(&d_randStates, M * sizeof(curandState)); // enough for M edges
cudaMemcpy(d_offsets, h_offsets.data(), h_offsets.size() * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(d_neighbors, h_neighbors.data(), h_neighbors.size() * sizeof(int),
cudaMemcpyHostToDevice);
cudaMemcpy(d_state_curr, state.data(), N * sizeof(int), cudaMemcpyHostToDevice);
int threads = 256;
int blocks_edges = (M + threads - 1) / threads;
int blocks_nodes = (N + threads - 1) / threads;
// -------------------------------------------------------------------------
// Initialize RNG states
// -------------------------------------------------------------------------
init_rand_states<<<blocks_edges, threads>>>(d_randStates, M, time(NULL));
cudaDeviceSynchronize();
float p = 0.05f; // infection probability
int steps = 50;
int total_randoms = M * steps;
// Allocate memory for all random numbers for all steps
cudaMalloc(&d_edge_rands, total_randoms * sizeof(float));
// -------------------------------------------------------------------------
// CUDA events for timing
// -------------------------------------------------------------------------
cudaEvent_t start_total, stop_total;
cudaEvent_t start_rand_gen, stop_rand_gen;
cudaEvent_t start_kernel, stop_kernel;
cudaEvent_t start_mark, stop_mark;
cudaEventCreate(&start_total);
cudaEventCreate(&stop_total);
cudaEventCreate(&start_rand_gen);
cudaEventCreate(&stop_rand_gen);
cudaEventCreate(&start_kernel);
cudaEventCreate(&stop_kernel);
cudaEventCreate(&start_mark);
cudaEventCreate(&stop_mark);
// -------------------------------------------------------------------------
// Pre-generate all random numbers once
// -------------------------------------------------------------------------
cudaEventRecord(start_rand_gen, 0);
generate_edge_randoms_per_edge<<<blocks_edges, threads>>>(M, steps, d_edge_rands, d_randStates);
cudaDeviceSynchronize();
cudaEventRecord(stop_rand_gen, 0);
cudaEventSynchronize(stop_rand_gen);
cudaError_t err = cudaGetLastError();
if (err != cudaSuccess) {
std::cerr << "generate_edge_randoms_per_edge failed: " << cudaGetErrorString(err)
<< std::endl;
return 1;
}
float ms_rand_gen = 0.0f;
cudaEventElapsedTime(&ms_rand_gen, start_rand_gen, stop_rand_gen);
std::cout << "Pre-generation time: " << ms_rand_gen << " ms" << std::endl;
// -------------------------------------------------------------------------
// Main simulation loop
// -------------------------------------------------------------------------
cudaEventRecord(start_total, 0);
for (int t = 0; t < steps; ++t) {
const float *edge_rands_step = d_edge_rands + (size_t)t * (size_t)M;
// Step 1: propagate infection
cudaEventRecord(start_kernel, 0);
epidemic_step_edge_free<<<blocks_edges, threads>>>(N, d_offsets, d_neighbors, d_state_curr,
d_state_next, edge_rands_step, p);
cudaEventRecord(stop_kernel, 0);
cudaEventSynchronize(stop_kernel);
float ms_kernel = 0.0f;
cudaEventElapsedTime(&ms_kernel, start_kernel, stop_kernel);
std::cout << "Step " << t << " kernel time: " << ms_kernel << " ms" << std::endl;
// Step 2: mark INFECTED -> RECOVERED
cudaEventRecord(start_mark, 0);
mark_recovered<<<blocks_nodes, threads>>>(N, d_state_curr, d_state_next);
cudaEventRecord(stop_mark, 0);
cudaEventSynchronize(stop_mark);
float ms_mark = 0.0f;
cudaEventElapsedTime(&ms_mark, start_mark, stop_mark);
std::cout << "Step " << t << " mark time: " << ms_mark << " ms" << std::endl;
cudaDeviceSynchronize();
// Swap state buffers
std::swap(d_state_curr, d_state_next);
}
cudaEventRecord(stop_total, 0);
cudaEventSynchronize(stop_total);
float ms_total = 0.0f;
cudaEventElapsedTime(&ms_total, start_total, stop_total);
std::cout << "Total simulation time: " << ms_total << " ms" << std::endl;
// -------------------------------------------------------------------------
// Copy results and print summary
// -------------------------------------------------------------------------
cudaMemcpy(state.data(), d_state_curr, N * sizeof(int), cudaMemcpyDeviceToHost);
int infected = 0, recovered = 0, susceptible = 0;
for (int s : state) {
if (s == INFECTED)
infected++;
else if (s == RECOVERED)
recovered++;
else
susceptible++;
}
std::cout << "S=" << susceptible << " I=" << infected << " R=" << recovered << std::endl;
// -------------------------------------------------------------------------
// Cleanup
// -------------------------------------------------------------------------
cudaFree(d_offsets);
cudaFree(d_neighbors);
cudaFree(d_state_curr);
cudaFree(d_state_next);
cudaFree(d_edge_rands);
cudaFree(d_randStates);
cudaEventDestroy(start_total);
cudaEventDestroy(stop_total);
cudaEventDestroy(start_rand_gen);
cudaEventDestroy(stop_rand_gen);
cudaEventDestroy(start_kernel);
cudaEventDestroy(stop_kernel);
cudaEventDestroy(start_mark);
cudaEventDestroy(stop_mark);
cudaDeviceReset();
return 0;
}
Edge-Based Warp-Level GPU Kernel Processing Scheme
Let's examine the distribution of threads, vertices, and edges, as well as the computation flow using the last implementation as an example.
Assume:
- Graph with vertices V0 … V5
- Edges between them:
V0 → V1, V0 → V2
V1 → V2, V1 → V3
V2 → V3
V3 → V4
V4 → V5
- Each GPU thread processes one edge
Vertices: V0 V1 V2 V3 V4 V5
States: S I S S S S (S=0, I=1, R=2)
CSR:
offsets = [0, 2, 4, 5, 6, 7, 7]
neighbors = [V1,V2, V2,V3, V3, V4, V5]
Edges (flattened, one per thread):
E0: V0→V1
E1: V0→V2
E2: V1→V2
E3: V1→V3
E4: V2→V3
E5: V3→V4
E6: V4→V5
GPU Threads:
Thread0 -> E0
Thread1 -> E1
Thread2 -> E2
Thread3 -> E3
Thread4 -> E4
Thread5 -> E5
Thread6 -> E6
- Edge-based kernel:
Thread0: V0→V1 state[V0]=S, state[V1]=I -> no infection
Thread1: V0→V2 state[V0]=S, state[V2]=S -> no infection
Thread2: V1→V2 state[V1]=I, state[V2]=S -> edge_rands[2] < p? Yes -> state_next[V2]=I
Thread3: V1→V3 state[V1]=I, state[V3]=S -> edge_rands[3] < p? No -> state_next[V3]=S
Thread4: V2→V3 state[V2]=S, state[V3]=S -> no infection
Thread5: V3→V4 state[V3]=S, state[V4]=S -> no infection
Thread6: V4→V5 state[V4]=S, state[V5]=S -> no infection
- Each thread is independent
- No race conditions since state_next[v] is updated only based on pre-generated random numbers
- Warp-level distribution:
- On GPU, one warp = 32 threads
- If a vertex has many neighbors (hub), edge-based approach distributes edges across all warp threads
Example: Vertex Vh (hub) with 40 neighbors
Warp0: Thread0..31 -> edges E0..E31
Warp1: Thread0..8 -> edges E32..E39
- Advantage: no thread gets stuck on a hub
- Even load distribution across all threads, nearly perfect load balance
- Final representation:
+---------------------------+ +------------------+
| Vertices | | GPU Threads |
| V0 V1 V2 V3 V4 V5 ... | ---> | T0 T1 T2 ... TM |
+---------------------------+ +------------------+
1) Edge-based propagation:
Each thread processes one edge independently:
T0: V0->V1
T1: V0->V2
...
TM: Vh->Vh+neighbors
2) Mark INFECTED -> RECOVERED:
Each thread processes one vertex independently:
T0: V0
T1: V1
...
3) Swap states for next timestep
Edge-based kernel + pre-generated random numbers completely eliminates race conditions, scales well, and distributes load even for hubs so, that it's suitable for millions of vertices and tens of millions of edges
4.2.4 Implementation #4: Warp-Level Batch Processing
Implementation #3 is already suitable for computations on graphs with millions of vertices. However, there is another approach that can optimize this implementation for extremely large hubs. This approach is called Warp-level batch processing.
As it's known on GPU, warp = 32 threads executing the same instruction simultaneously. In the solving problem If a vertex (hub) has many neighbors, a single thread cannot process all edges in one step. So, the Warp-level batch processing distributes neighbors of a single vertex across threads of one warp, allowing all 32 threads to work in parallel. Let's take a look at the following example:
Example:
- Hub Vh has 64 neighbors
- Warp = 32 threads
- Split neighbors into batches of 32 edges:
Warp Thread IDs: 0 1 2 ... 31
Batch 0 (edges 0..31):
T0->edge0, T1->edge1, ..., T31->edge31
Batch 1 (edges 32..63):
T0->edge32, T1->edge33, ..., T31->edge63
- Each thread processes its own edge
- Result: one warp is fully loaded, not stalled on a hub
Advantages for SIR simulation:
- Load balance: large vertices don't slow down the warp
- SIMT efficiency: all warp threads execute identical instructions
- Minimal divergence: all threads process similar operations (infection check + state update)
- Suitable for scale-free graphs with hubs
Code implementation (brief):
- Determine edge range for each vertex from CSR: offsets[u]..offsets[u+1]-1
- For warp
warpId = threadIdx.x / warpSize: where each thread processesedge = offsets[u] + warpId + k*warpSizefor all batches k
Let's examine Warp-level Batch Processing for our edge-based SIR on GPU to show how a warp distributes processing of neighbors for a single vertex (hub).
Prerequisites:
- Warp = 32 threads (T0…T31)
- Hub Vh has 70 neighbors (E0…E69)
- Edges are processed in batches, each batch = warp size (32 edges)
- Edge distribution across batches:
Hub Vh neighbors: [E0, E1, E2, ..., E69]
Batch 0 (edges 0..31) -> Warp Threads T0..T31
Batch 1 (edges 32..63) -> Warp Threads T0..T31
Batch 2 (edges 64..69) -> Warp Threads T0..T5
- Scheme: warp processing batches:
Warp Thread IDs
T0 T1 T2 ... T31
Batch 0 (edges 0..31)
E0 E1 E2 ... E31
Batch 1 (edges 32..63)
E32 E33 E34 ... E63
Batch 2 (edges 64..69)
E64 E65 E66 E67 E68 E69 - threads T6..T31 idle
Each warp thread is responsible for one edge within a batch.
After processing one batch, the warp moves on to the next.
Idle threads appear only on the last, incomplete batch, but this is minimal.
Visualization with threads and batches:
Warp0: T0 T1 T2 ... T31
Batch0: E0 E1 E2 ... E31
Batch1: E32 E33 E34 ... E63
Batch2: E64 E65 E66 ... E69
- All batches of one hub are processed by a single warp
- Warp threads execute identical instructions (SIMT) → minimal divergence
- Scales to millions of vertices and tens of millions of edges
CUDA Kernel: Warp-Level Batch Processing
Finally, there is the implementation of the CUDA Kernel with Warp-Level Batch Processing for Edge-Based SIR on Graphs. So far, the main idea is:
- Each warp processes vertex
uwith neighbors offsets[u] .. offsets[u+1]-1 - Warp threads distribute edges across batches:
edge_index = offsets[u] + warpThreadId + batch*warpSize
- warpThreadId = threadIdx.x % warpSize
- Batch continues until all vertex edges are processed
Warp Batch Based Kernel
__inline__ __device__ int warp_id() { return threadIdx.x / WARP_SIZE; }
__inline__ __device__ int lane_id() { return threadIdx.x % WARP_SIZE; }
__global__ void epidemic_step_warp_batch(const int N, const int *offsets, const int *neighbors,
const int *state_curr, int *state_next,
const float *edge_rands, const float p) {
int warpIdGlobal = (blockIdx.x * blockDim.x + threadIdx.x) / WARP_SIZE;
int lane = lane_id();
if (warpIdGlobal >= N) return;
// Each warp handles vertex u = warpIdGlobal
int u = warpIdGlobal;
int state_u = state_curr[u];
// Copy state_curr[u] to state_next initially
state_next[u] = state_u;
if (state_u != INFECTED) return; // Only propagate if vertex is infected
int start = offsets[u];
int end = offsets[u + 1];
int degree = end - start;
// Process edges in batches of WARP_SIZE
for (int batch = 0; batch * WARP_SIZE + lane < degree; ++batch) {
int edge_idx = start + batch * WARP_SIZE + lane;
if (edge_idx >= end) continue;
int v = neighbors[edge_idx];
int state_v = state_curr[v];
// Infection propagation
if (state_v == SUSCEPTIBLE && edge_rands[edge_idx] < p) {
state_next[v] = INFECTED; // no atomics needed
}
}
// After processing edges, mark u as recovered
state_next[u] = RECOVERED;
}
Main GPU Loop
int main(int argc, char **argv) {
if (argc < 3) {
std::cerr << "Usage: " << argv[0] << " <offsets.bin> <neighbors.bin>" << std::endl;
return 1;
}
std::string offsets_file = argv[1];
std::string neighbors_file = argv[2];
std::vector<int> h_offsets = read_bin_file(offsets_file);
std::vector<int> h_neighbors = read_bin_file(neighbors_file);
int N = h_offsets.size() - 1;
int M = h_neighbors.size();
std::cout << "Offsets size: " << N << " Neighbors size: " << M << std::endl;
std::vector<int> state(N, SUSCEPTIBLE);
state[0] = INFECTED; // initial infected node
int *d_offsets, *d_neighbors, *d_state_curr, *d_state_next;
float *d_edge_rands;
curandState *d_randStates;
cudaMalloc(&d_offsets, h_offsets.size() * sizeof(int));
cudaMalloc(&d_neighbors, h_neighbors.size() * sizeof(int));
cudaMalloc(&d_state_curr, N * sizeof(int));
cudaMalloc(&d_state_next, N * sizeof(int));
cudaMalloc(&d_randStates, M * sizeof(curandState)); // enough for M edges
cudaMemcpy(d_offsets, h_offsets.data(), h_offsets.size() * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(d_neighbors, h_neighbors.data(), h_neighbors.size() * sizeof(int),
cudaMemcpyHostToDevice);
cudaMemcpy(d_state_curr, state.data(), N * sizeof(int), cudaMemcpyHostToDevice);
int threads = 256;
int blocks_edges = (M + threads - 1) / threads;
int threads_per_block = 32 * 4; // 4 warps per block
int blocks_vertices =
(N + (threads_per_block / WARP_SIZE) - 1) / (threads_per_block / WARP_SIZE);
// -------------------------------------------------------------------------
// Initialize RNG states
// -------------------------------------------------------------------------
init_rand_states<<<blocks_edges, threads>>>(d_randStates, M, time(NULL));
cudaDeviceSynchronize();
float p = 0.05f; // infection probability
int steps = 50;
int total_randoms = M * steps;
// Allocate memory for all random numbers for all steps
cudaMalloc(&d_edge_rands, total_randoms * sizeof(float));
// -------------------------------------------------------------------------
// CUDA events for timing
// -------------------------------------------------------------------------
cudaEvent_t start_total, stop_total;
cudaEvent_t start_rand_gen, stop_rand_gen;
cudaEvent_t start_kernel, stop_kernel;
cudaEventCreate(&start_total);
cudaEventCreate(&stop_total);
cudaEventCreate(&start_rand_gen);
cudaEventCreate(&stop_rand_gen);
cudaEventCreate(&start_kernel);
cudaEventCreate(&stop_kernel);
// -------------------------------------------------------------------------
// Pre-generate all random numbers once
// -------------------------------------------------------------------------
cudaEventRecord(start_rand_gen, 0);
generate_edge_randoms_per_edge<<<blocks_edges, threads>>>(M, steps, d_edge_rands, d_randStates);
cudaDeviceSynchronize();
cudaEventRecord(stop_rand_gen, 0);
cudaEventSynchronize(stop_rand_gen);
cudaError_t err = cudaGetLastError();
if (err != cudaSuccess) {
std::cerr << "generate_edge_randoms_per_edge failed: " << cudaGetErrorString(err)
<< std::endl;
return 1;
}
float ms_rand_gen = 0.0f;
cudaEventElapsedTime(&ms_rand_gen, start_rand_gen, stop_rand_gen);
std::cout << "Pre-generation time: " << ms_rand_gen << " ms" << std::endl;
// -------------------------------------------------------------------------
// Main simulation loop
// -------------------------------------------------------------------------
cudaEventRecord(start_total, 0);
for (int t = 0; t < steps; ++t) {
const float *edge_rands_step = d_edge_rands + (size_t)t * (size_t)M;
// Step 1: warp-level batch kernel
cudaEventRecord(start_kernel, 0);
epidemic_step_warp_batch<<<blocks_vertices, threads_per_block>>>(
N, d_offsets, d_neighbors, d_state_curr, d_state_next, edge_rands_step, p);
cudaEventRecord(stop_kernel, 0);
cudaEventSynchronize(stop_kernel);
// error check
cudaError_t kerr = cudaGetLastError();
if (kerr != cudaSuccess) {
std::cerr << "epidemic_step_warp_batch failed at step " << t << ": "
<< cudaGetErrorString(kerr) << std::endl;
return 1;
}
float ms_kernel = 0.0f;
cudaEventElapsedTime(&ms_kernel, start_kernel, stop_kernel);
std::cout << "Step " << t << " kernel time: " << ms_kernel << " ms" << std::endl;
// cudaDeviceSynchronize();
// Swap state buffers
std::swap(d_state_curr, d_state_next);
}
cudaEventRecord(stop_total, 0);
cudaEventSynchronize(stop_total);
float ms_total = 0.0f;
cudaEventElapsedTime(&ms_total, start_total, stop_total);
std::cout << "Total simulation time: " << ms_total << " ms" << std::endl;
// -------------------------------------------------------------------------
// Copy results and print summary
// -------------------------------------------------------------------------
cudaMemcpy(state.data(), d_state_curr, N * sizeof(int), cudaMemcpyDeviceToHost);
int infected = 0, recovered = 0, susceptible = 0;
for (int s : state) {
if (s == INFECTED)
infected++;
else if (s == RECOVERED)
recovered++;
else
susceptible++;
}
std::cout << "S=" << susceptible << " I=" << infected << " R=" << recovered << std::endl;
// -------------------------------------------------------------------------
// Cleanup
// -------------------------------------------------------------------------
cudaFree(d_offsets);
cudaFree(d_neighbors);
cudaFree(d_state_curr);
cudaFree(d_state_next);
cudaFree(d_edge_rands);
cudaFree(d_randStates);
cudaEventDestroy(start_total);
cudaEventDestroy(stop_total);
cudaEventDestroy(start_rand_gen);
cudaEventDestroy(stop_rand_gen);
cudaEventDestroy(start_kernel);
cudaEventDestroy(stop_kernel);
cudaDeviceReset();
return 0;
}
Now each warp processes one vertex completely.
Warp threads distribute edges across batches:
edge_idx = offsets[u] + lane_id + batch*warpSize
Batch continues until all edges are processed. Even if a hub has 100+ neighbors, the warp processes them in batches of 32 edges. All warp threads are occupied, idle threads appear only in the last incomplete batch. All warp threads execute identical instructions (SIMT):
- Take an edge
- Check infection
- Update state_next
Divergence between threads is minimal, and scaling to millions of vertices with tens of millions of edges is significantly improved.
4.2.5 Implementation #5: Warp-Level Batch Processing (Pull Model)
4.2.5.1 Roofline Analysis of the Previous Approach
The initial GPU implementation followed a classical push-based infection model: each infected node iterated over its neighbors and attempted to infect them.
From an algorithmic viewpoint this is natural; however, from the GPU-architecture perspective, it leads to poor arithmetic intensity and thus a memory-bound regime.
Memory accesses per neighbor (previous approach)
| Access | Size (bytes) | Notes |
|---|---|---|
state_curr[u] (read) |
4 | node state |
state_next[u] (write) |
4 | copy initial state |
offsets[u] (read) |
4 | neighbor list start |
offsets[u+1] (read) |
4 | neighbor list end |
neighbors[edge_idx] (read) |
4 | neighbor index |
state_curr[v] (read) |
4 | neighbor state |
state_next[v] (write) |
4 | update neighbor if infected |
state_next[u] (write) |
4 | mark as recovered |
| Total per neighbor | 32 bytes | simple sequential reads and writes |
Characteristics:
- Memory accesses are mostly coalesced, sequential, and predictable.
- Minimal warp divergence.
- No warp-wide synchronization or additional kernel launches.
- GPU is mostly memory-bound but benefits from simple, direct access patterns. This behavior is typical for graph algorithms, where memory-access irregularity dominates computational cost.
4.2.5.2 How the New Approach Improves the Situation
The key change is a switch from a push model to a pull model:
“I am a node. Let me check all my neighbors and see if any of them can infect me.”
These changes:
✔ Reduce global memory writes
✔ Make neighbor iterations more warp-friendly
✔ Increase data reuse within a warp
✔ Flatten control flow (less divergence)
Even though the arithmetic intensity remains low, the number of global memory transactions per useful operation decreases, improving real throughput.
Warp-Batch Pull-Based Kernel
// ---------- Infection Kernel (warp-batch) ----------
// reads states_prev, writes states_next (one warp per node)
__global__ void epidemic_infection_kernel(int N,
const int * __restrict__ offsets,
const int * __restrict__ neighbors,
const float * __restrict__ edge_rands,
const unsigned int * __restrict__ states_prev,
unsigned int * __restrict__ states_next,
int * __restrict__ infected_steps,
int step,
int M)
{
int warpId = (blockIdx.x * blockDim.x + threadIdx.x) / d_warpSize;
int lane = threadIdx.x % d_warpSize;
int node = warpId;
if (node >= N) return;
// copy previous state by default
unsigned int prev_state = states_prev[node];
states_next[node] = prev_state; // single warp writes this entry -> no race
// only SUSCEPTIBLE nodes can be infected
if (prev_state != SUSCEPTIBLE) return;
int start = offsets[node];
int end = offsets[node + 1];
bool infected_flag = false;
// each lane checks a strided subset of edges
for (int e = start + lane; e < end; e += d_warpSize) {
int neigh = neighbors[e];
if (states_prev[neigh] == INFECTED) { // read from snapshot
float r = edge_rands[step * M + e];
if (r < INFECTION_PROB_THRESHOLD) {
infected_flag = true;
}
}
}
// warp-wide OR
infected_flag = __any_sync(0xffffffff, infected_flag);
// only lane 0 commits the change for this node
if (lane == 0 && infected_flag) {
states_next[node] = INFECTED;
// newly infected: reset counter (safe because prev_state was SUSCEPTIBLE)
infected_steps[node] = 0;
}
}
// ---------- Recovery Kernel (local) ----------
// reads states_prev to decide who was infected at start of step,
// increments infected_steps for those and writes RECOVERED into states_next if condition met.
__global__ void epidemic_recovery_kernel(int N,
const unsigned int * __restrict__ states_prev,
unsigned int * __restrict__ states_next,
int * __restrict__ infected_steps)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx >= N) return;
// Only nodes that were INFECTED at the beginning of timestep are eligible to progress toward recovery this step.
if (states_prev[idx] == INFECTED) {
// increment counter - single thread writes to infected_steps[idx] here (no concurrent writes to same index)
int v = infected_steps[idx] + 1;
infected_steps[idx] = v;
if (v >= RECOVERY_TIME) {
states_next[idx] = RECOVERED;
}
}
}
Let us examine memory traffic per node.
The process is split into two kernels: infection and recovery. Each warp still handles a single node. Infection is computed only for susceptible nodes. A warp-wide reduction (__any_sync) decides whether the node should become infected. Only lane 0 writes the result. Recovery updates infection counters and marks nodes as recovered if they reach the threshold.
Memory accesses per neighbor (approximate, combining both kernels):
| Access | Size (bytes) | Notes |
|---|---|---|
states_prev[node] (read) |
4 | node state |
states_next[node] (write) |
4 | copy previous state |
offsets[node] & offsets[node+1] (read) |
8 | neighbor list |
neighbors[e] (read) |
4 | neighbor index |
states_prev[neigh] (read) |
4 | neighbor state |
edge_rands[...] (read) |
4 | infection probability |
states_next[node] (write, lane 0) |
<4 amortized | single commit per warp |
infected_steps[node] (read/write) |
8 | infection counter update |
states_next[idx] (write, recovery) |
4 | mark recovered |
| Total per neighbor | ≈44–48 bytes | includes both kernels and extra counters |
Characteristics:
- Additional memory accesses and two kernels increase raw byte count.
- Warp-wide reduction reduces the number of concurrent writes to
states_next. - Recovery kernel serializes counter updates efficiently, avoiding write conflicts.
- Access patterns are more structured, reducing contention on memory despite higher byte counts.
Main GPU Loop
int main(int argc, char **argv) {
if (argc != 3) { std::cerr << "Usage: " << argv[0] << " offsets.bin neighbors.bin\n"; return 1; }
auto offsets = read_bin_file(argv[1]);
auto neighbors = read_bin_file(argv[2]);
int N = offsets.size() - 1;
int M = neighbors.size();
float avg_deg = static_cast<float>(M) / N;
std::cout << "Nodes: " << N << ", Edges: " << M
<< ", Avg connections per node: " << avg_deg << "\n";
// device warp size
int currentDevice;
checkCuda(cudaGetDevice(¤tDevice), "cudaGetDevice");
cudaDeviceProp deviceProp;
checkCuda(cudaGetDeviceProperties(&deviceProp, currentDevice), "cudaGetDeviceProperties");
int warp_size = deviceProp.warpSize;
checkCuda(cudaMemcpyToSymbol(d_warpSize, &warp_size, sizeof(int)), "cudaMemcpyToSymbol(d_warpSize)");
std::cout << "GPU Warp size: " << warp_size << std::endl;
// Allocate device memory
int *d_offsets = nullptr, *d_neighbors = nullptr;
unsigned int *d_states_A = nullptr, *d_states_B = nullptr; // double buffer
float *d_edge_rands = nullptr;
int *d_infected_steps = nullptr;
curandState *d_randStates = nullptr;
unsigned int *d_counts = nullptr;
checkCuda(cudaMalloc(&d_offsets, offsets.size() * sizeof(int)), "cudaMalloc d_offsets");
checkCuda(cudaMalloc(&d_neighbors, neighbors.size() * sizeof(int)), "cudaMalloc d_neighbors");
checkCuda(cudaMalloc(&d_states_A, N * sizeof(unsigned int)), "cudaMalloc d_states_A");
checkCuda(cudaMalloc(&d_states_B, N * sizeof(unsigned int)), "cudaMalloc d_states_B");
checkCuda(cudaMalloc(&d_infected_steps, N * sizeof(int)), "cudaMalloc d_infected_steps");
checkCuda(cudaMalloc(&d_edge_rands, (size_t)MAX_STEPS * M * sizeof(float)), "cudaMalloc d_edge_rands");
checkCuda(cudaMalloc(&d_randStates, M * sizeof(curandState)), "cudaMalloc d_randStates");
checkCuda(cudaMalloc(&d_counts, 3 * sizeof(unsigned int)), "cudaMalloc d_counts");
checkCuda(cudaMemcpy(d_offsets, offsets.data(), offsets.size() * sizeof(int), cudaMemcpyHostToDevice), "memcpy offsets");
checkCuda(cudaMemcpy(d_neighbors, neighbors.data(), neighbors.size() * sizeof(int), cudaMemcpyHostToDevice), "memcpy neighbors");
// Initialize host states: mark a few initial infected for testing
std::vector<unsigned int> h_states(N, SUSCEPTIBLE);
if (N > 0) h_states[0] = INFECTED;
if (N > 1) h_states[1] = INFECTED;
if (N > 2) h_states[2] = INFECTED;
// infected_steps: -1 means never infected yet; for already infected set 0
std::vector<int> h_infected_steps(N, -1);
for (int i = 0; i < N; ++i) if (h_states[i] == INFECTED) h_infected_steps[i] = 0;
// copy initial states into both buffers (so states_prev can be derived trivially by swapping)
checkCuda(cudaMemcpy(d_states_A, h_states.data(), N * sizeof(unsigned int), cudaMemcpyHostToDevice), "memcpy states_A");
checkCuda(cudaMemcpy(d_states_B, h_states.data(), N * sizeof(unsigned int), cudaMemcpyHostToDevice), "memcpy states_B");
checkCuda(cudaMemcpy(d_infected_steps, h_infected_steps.data(), N * sizeof(int), cudaMemcpyHostToDevice), "memcpy infected_steps");
// prepare kernel launch params
int threadsPerBlock = 128;
int warpsPerBlock = threadsPerBlock / warp_size;
int numWarps = N; // 1 warp per node
int blocksNodes = (numWarps + warpsPerBlock - 1) / warpsPerBlock;
int blocksEdges = (M + threadsPerBlock - 1) / threadsPerBlock;
int blocksForCount = (N + threadsPerBlock - 1) / threadsPerBlock;
// ---------- RNG pre-generation timing ----------
cudaEvent_t ev_gen_start, ev_gen_stop;
cudaEventCreate(&ev_gen_start);
cudaEventCreate(&ev_gen_stop);
cudaEventRecord(ev_gen_start);
init_rand_states<<<blocksEdges, threadsPerBlock>>>(d_randStates, M, 1234567UL);
generate_edge_randoms_per_edge<<<blocksEdges, threadsPerBlock>>>(M, MAX_STEPS, d_edge_rands, d_randStates);
cudaEventRecord(ev_gen_stop);
cudaEventSynchronize(ev_gen_stop);
float gen_ms = 0.f;
cudaEventElapsedTime(&gen_ms, ev_gen_start, ev_gen_stop);
std::cout << "[Timing] RNG pre-generation: " << gen_ms << " ms\n";
// ---------- Events reused for per-step timing ----------
cudaEvent_t ev_step_start, ev_step_mid, ev_step_stop;
cudaEventCreate(&ev_step_start);
cudaEventCreate(&ev_step_mid);
cudaEventCreate(&ev_step_stop);
// double-buffer pointers: states_prev = A, states_next = B (will swap each step)
unsigned int *states_prev = d_states_A;
unsigned int *states_next = d_states_B;
bool print_per_step = true;
// ---------- Per-step loop (with per-step timing) ----------
for (int step = 0; step < MAX_STEPS; ++step) {
// record start
cudaEventRecord(ev_step_start);
// Infection (reads states_prev, writes states_next)
epidemic_infection_kernel<<<blocksNodes, threadsPerBlock>>>(
N, d_offsets, d_neighbors, d_edge_rands, states_prev, states_next, d_infected_steps, step, M);
// mark mid event
cudaEventRecord(ev_step_mid);
// Recovery (reads states_prev, may write RECOVERED into states_next, increments infected_steps for previous infected)
epidemic_recovery_kernel<<<blocksNodes, threadsPerBlock>>>(
N, states_prev, states_next, d_infected_steps);
// record stop and synchronize
cudaEventRecord(ev_step_stop);
cudaEventSynchronize(ev_step_stop);
float t_start_mid = 0.f, t_mid_stop = 0.f;
cudaEventElapsedTime(&t_start_mid, ev_step_start, ev_step_mid); // infection time
cudaEventElapsedTime(&t_mid_stop, ev_step_mid, ev_step_stop); // recovery time
//t_total = t_start_mid + t_mid_stop;
std::printf("[Timing] Step %2d S->I: %.3f ms, I->R: %.3f ms", step, t_start_mid, t_mid_stop);
// optionally count states after the step (costly - do only if needed)
if (print_per_step) {
checkCuda(cudaMemset(d_counts, 0, 3 * sizeof(unsigned int)), "memset counts");
count_states_kernel<<<blocksForCount, threadsPerBlock>>>(N, states_next, d_counts);
unsigned int h_counts[3] = {0,0,0};
checkCuda(cudaMemcpy(h_counts, d_counts, 3 * sizeof(unsigned int), cudaMemcpyDeviceToHost), "memcpy counts to host");
std::printf(" | after step %2d : S=%u I=%u R=%u\n", step, h_counts[SUSCEPTIBLE], h_counts[INFECTED], h_counts[RECOVERED]);
}
// swap buffers for next iteration
std::swap(states_prev, states_next);
}
// ---------- Total-only timing loop (same logic but timed as a whole) ----------
// Reset states back to initial (we kept d_states_A/d_states_B modified) — reload initial host state
checkCuda(cudaMemcpy(d_states_A, h_states.data(), N * sizeof(unsigned int), cudaMemcpyHostToDevice), "reset states_A");
checkCuda(cudaMemcpy(d_states_B, h_states.data(), N * sizeof(unsigned int), cudaMemcpyHostToDevice), "reset states_B");
// reset infected_steps
checkCuda(cudaMemcpy(d_infected_steps, h_infected_steps.data(), N * sizeof(int), cudaMemcpyHostToDevice), "reset infected_steps");
// reuse states_prev/states_next variables
states_prev = d_states_A;
states_next = d_states_B;
cudaEvent_t ev_total_start, ev_total_stop;
cudaEventCreate(&ev_total_start);
cudaEventCreate(&ev_total_stop);
cudaEventRecord(ev_total_start);
for (int step = 0; step < MAX_STEPS; ++step) {
epidemic_infection_kernel<<<blocksNodes, threadsPerBlock>>>(
N, d_offsets, d_neighbors, d_edge_rands, states_prev, states_next, d_infected_steps, step, M);
epidemic_recovery_kernel<<<blocksNodes, threadsPerBlock>>>(
N, states_prev, states_next, d_infected_steps);
std::swap(states_prev, states_next);
}
cudaEventRecord(ev_total_stop);
cudaEventSynchronize(ev_total_stop);
float total_ms = 0.f;
cudaEventElapsedTime(&total_ms, ev_total_start, ev_total_stop);
std::cout << "[Timing] Total epidemic loop: " << total_ms << " ms\n";
// ---------- Final counts (from states_prev because we swapped after last step) ----------
checkCuda(cudaMemset(d_counts, 0, 3 * sizeof(unsigned int)), "memset counts final");
count_states_kernel<<<blocksForCount, threadsPerBlock>>>(N, states_prev, d_counts);
unsigned int h_counts[3] = {0,0,0};
checkCuda(cudaMemcpy(h_counts, d_counts, 3 * sizeof(unsigned int), cudaMemcpyDeviceToHost), "memcpy final counts");
std::printf("Final states: S=%u I=%u R=%u\n", h_counts[SUSCEPTIBLE], h_counts[INFECTED], h_counts[RECOVERED]);
// Cleanup
cudaFree(d_offsets);
cudaFree(d_neighbors);
cudaFree(d_states_A);
cudaFree(d_states_B);
cudaFree(d_edge_rands);
cudaFree(d_infected_steps);
cudaFree(d_randStates);
cudaFree(d_counts);
cudaEventDestroy(ev_gen_start);
cudaEventDestroy(ev_gen_stop);
cudaEventDestroy(ev_step_start);
cudaEventDestroy(ev_step_mid);
cudaEventDestroy(ev_step_stop);
cudaEventDestroy(ev_total_start);
cudaEventDestroy(ev_total_stop);
return 0;
}
5. Results and Experiments
The research utilized real-world social network data from the SNAP database—specifically the Facebook social graph containing 4,039 users and 176,468 friendship connections. This dataset, available through https://snap.stanford.edu/data/facebook_combined.txt.gz, represents authentic human social interactions, providing a realistic foundation for testing our epidemic models after conversion to CSR format.
To evaluate the performance of the different implementations, a series of experiments was conducted.
Experimental Setup:
- Graph: A model graph with 4,039 vertices and 176,468 edges.
- Model Parameters: Infection probability
p = 0.05, number of simulation stepssteps = 50. - Hardware Configuration:
- GPU: NVIDIA Tesla T4 with 2,560 CUDA cores, 15360 MB GDDR6 memory, Compute Capability 7.5
- CPU: Intel Xeon CPU @ 2.00GHz, 2 cores/4 threads, 38.5 MB L3 cache
- Software: CUDA 12.5, Driver Version 550.54.15
Execution Time Comparison:
The table below presents the average total execution time for each implementation.
| Version | Algorithm Characteristics | Total simulation time | Additional costs | Notes |
|---|---|---|---|---|
| CPU (Naive) | One thread handles the vertices sequentially | 4.11 ms | Simple implementation | Baseline |
| GPU #1 — Naive (vertex-based, atomics) | One thread per node, per-node loop over neighbors, heavy atomics, strong divergence | 5.58 ms | Minor init overhead | Slowest version; severe divergence and atomics cause serialization |
| GPU #2 — Edge-Based | One thread per edge, two kernels per step (infection, mark_recovered) |
5.45 ms | Many small kernel launches | Good load balance; graph too small to saturate GPU |
| GPU #3 — Atomic-Free | Pre-generated randomness, no atomics, two kernels | 5.28 ms | 7.4 ms RNG pregen | Per-step faster, but pre-generation dominates total time |
| GPU #4 — Warp-Batch | One warp processes a node, neighbors processed in warp batches | 2.47 ms | 7.4 ms RNG pregen | First major speedup; ~2.2× faster than naive |
| GPU #5 — Warp-Pull (two kernels) | Warp-pull model with separate infection and recovery kernels | 0.821 ms | 7.44 ms RNG pregen | Fastest version; 3× faster than warp-batch, 6× faster than naive |
Hardware-Specific Performance Analysis:
CPU Context: The Intel Xeon processor operates at 2.0GHz with only 2 physical cores, representing a modest server-grade CPU. The naive CPU implementation serves as a realistic baseline for entry-level cloud computing environments.
GPU #1 (Naive): The Tesla T4, despite its 2,560 CUDA cores, shows worse performance than CPU due to inefficient utilization. This demonstrates that simply porting CPU algorithms to GPU without architectural optimization can yield poor results, especially on datacenter-grade GPUs like T4.
GPU #2 & #3 (Intermediate): These implementations begin to leverage GPU parallelism but are limited by memory access patterns and synchronization overhead. The T4's memory bandwidth (320 GB/s) is underutilized due to uncoalesced accesses.
-
GPU #4 (Warp-Level Batch Model): This implementation marks the first major architectural shift that begins to fully exploit the Tesla T4’s internal parallelism. Instead of assigning a single thread to each vertex, GPU #4 assigns an entire warp (32 threads) to process one node cooperatively.
Architectural Advantages:
- Warp-per-node parallelism: Every warp scans a node’s adjacency list in a coordinated fashion, significantly reducing intra-warp divergence compared to prior versions.
- Coalesced memory access: Neighbor lists are consumed in warp-aligned batches, improving effective memory bandwidth utilization within the constraints of the graph’s irregular degree distribution.
- No atomics: Because each warp exclusively “owns” a vertex, there is no write contention and no need for atomic operations.
- Reduced synchronization overhead: The algorithm relies purely on natural warp-synchronous behavior; no explicit warp intrinsics were needed in this version.
Outcome:
GPU #4 demonstrates a ~2.2× speedup relative to earlier approaches—even though the graph is extremely small by GPU standards (4k nodes). This clearly shows that restructuring computation to respect warp execution patterns is significantly more impactful than raw FLOPs or core count.Energy Efficiency Considerations:
The Tesla T4’s 70W TDP combined with a highly parallel warp-batch design means the GPU completes the simulation faster and returns to low-power idle state sooner, reducing energy consumption in real-world server deployments. -
GPU #5 (Warp-Pull Model, Two-Kernel Architecture): The fifth implementation represents the culmination of all optimizations, combining a warp-synchronous infection model with a clean separation of infection (S→I) and recovery (I→R) transitions.
This design transforms the algorithm into a memory-efficient, branch-free pipeline.Architectural Advantages:
- Warp-per-node “pull” traversal: Instead of infected nodes pushing infection outward, susceptible nodes pull infection status from their neighbors. This aligns perfectly with warp processing: all 32 lanes in the warp inspect different neighbors of the same node.
- Guaranteed coalesced memory access: Adjacency lists are consumed in tightly strided access patterns, minimizing cache misses and maximizing throughput on the T4’s GDDR6 memory subsystem.
- Zero atomics and conflict-free writes: Only lane 0 of each warp writes to its corresponding node’s state, eliminating all write contention.
-
Two-kernel design reduces branching:
- Infection kernel handles only susceptible nodes
- Recovery kernel handles only infected nodes
This eliminates large conditional branches that previous versions suffered from.
-
Deterministic and cache-friendly double-buffering:
Using
states_prev→states_nexteliminates read/write hazards and ensures stable data for each phase.
-
Deterministic and cache-friendly double-buffering:
Using
Outcome:
GPU #5 achieves the fastest total simulation time (0.82 ms)—a 6.4× improvement over the naive GPU version and 3× improvement over the first warp-level approach.Despite the graph being relatively small, GPU #5 achieves near-optimal memory behavior for a Turing GPU executing an irregular graph algorithm.
Energy and Datacenter Context:
Due to extremely short kernel runtimes and reduced divergence, GPU #5 enables the Tesla T4 to return to its 70W idle envelope almost immediately.
In large-scale simulations typical for epidemiology (millions of nodes), this architectural pattern will scale nearly linearly and provide even greater energy and cost savings in datacenter environments.
6. Conclusion
In this work, we set out to understand how to run epidemic simulations efficiently on the GPU — and we quickly discovered that raw GPU power means little without the right algorithmic structure.
Our initial versions were intentionally naive, and they taught us an important lesson: a GPU with thousands of cores can easily be slower than a small CPU if memory access is irregular. The early implementations were heavily memory-bound and suffered from warp divergence, so the hardware simply couldn’t show its strengths.
The turning point came when we reorganized the simulation around warp-level cooperation. Instead of assigning work to individual threads, we let an entire warp process one node together. This single architectural shift dramatically improved memory coherence, reduced divergence, and aligned the algorithm with how GPUs naturally execute code.
The final “warp-pull” version pushed the idea further. By splitting infection and recovery into two simple kernels and letting susceptible nodes pull information from neighbors, we removed synchronisation hazards and eliminated unnecessary atomics. As a result, the GPU finally worked at full efficiency — and the simulation time dropped to under a millisecond.
Overall, we learned that the biggest speedups don’t come from increasing the number of threads, but from restructuring the algorithm so the GPU can use its memory system and warps effectively. Once we respected the hardware’s execution model, performance followed almost automatically.
7. References
- Kermack W.O., McKendrick A.G. (1927). A contribution to the mathematical theory of epidemics. Proceedings of the Royal Society A.
- NVIDIA. CUDA C Programming Guide. (2024).
- Pastor-Satorras R., Vespignani A. (2001). Epidemic spreading in scale-free networks. Physical Review Letters.

Top comments (1)
Thank you, this was very helpful, the approach is quite relevant for the solving problem. Good job!