Timur Asadullin

Posted on Nov 7 • Edited on Nov 26

Modeling Epidemic Spread on Large Graphs Using CUDA

#cuda #gpu #cpp #c

1. Introduction

What if we could predict disease outbreaks not in days, but in minutes? This research turns that possibility into reality by harnessing the parallel power of modern GPUs to accelerate epidemic simulations. Using NVIDIA's CUDA platform, we transformed how we model disease spread through social networks—from Facebook connections to community interactions. By comparing four different computational approaches, we discovered that strategic GPU optimization can deliver 2.5× faster simulations while maintaining perfect accuracy. This breakthrough means public health officials could get critical outbreak predictions in minutes rather than hours, potentially saving lives through faster response times.

This paper explores the acceleration of epidemic spread simulations on large-scale graphs using NVIDIA CUDA. The epidemic is modeled using the classical SIR framework, where infection transmission occurs between graph vertices.

2. Theoretical Model

2.1 SIR Model

The classical SIR model defines three states:

S(t) — number of susceptible individuals
I(t) — number of infected individuals
R(t) — number of recovered individuals

The continuous-time model is governed by the differential equations:

where β is the infection rate and γ is the recovery rate.

In the discrete probabilistic version used for graph simulations, each infected node infects its neighbors with probability p = β, and becomes recovered after one step.

3. Population as a Graph

3.1 Population Graph Basics

The population is modeled as a graph G=(V,E), where |V|=N represents individuals and |E|=M represents contacts.

Graph types:

Erdős–Rényi random graph — each edge exists with probability p;
Scale-free (Barabási–Albert) — better matches real social networks.

The graph is stored in CSR (Compressed Sparse Row) format:

offsets[i] — starting index of neighbors of vertex i;
neighbors[] — flattened list of all neighbors.

3.1.1 CSR (Compressed Sparse Row) Format Visualization

The CSR (Compressed Sparse Row) format stores a sparse graph efficiently by flattening adjacency lists into a single array. It consists of two main arrays:

offsets[i] — index in the neighbors array where the adjacency list of vertex i starts.
neighbors[] — concatenated adjacency lists of all vertices.

Example graph:

Vertex 0: 1, 2
Vertex 1: 0
Vertex 2: 0, 3
Vertex 3: 2

CSR representation:

offsets = [0, 2, 3, 5, 6]
neighbors = [1, 2, 0, 0, 3, 2]

Explanation:

- Vertex 0 neighbors: indices `offsets[0] = 0` to `offsets[1]-1 = 1` → neighbors `[1, 2]`
- Vertex 1 neighbors: indices `offsets[1] = 2` to `offsets[2]-1 = 2` → neighbors `[0]`
- Vertex 2 neighbors: indices `offsets[2] = 3` to `offsets[3]-1 = 4` → neighbors `[0, 3]`
- Vertex 3 neighbors: indices `offsets[3] = 5` to `offsets[4]-1 = 5` → neighbors `[2]`

This format allows fast sequential access to neighbors and coalesced memory access on GPUs, which is crucial for parallel graph algorithms.

4. Implementation

4.1 CPU Implementation

#define INFECTED 1
#define SUSCEPTIBLE 0
#define RECOVERED 2

void epidemic_cpu(const vector<vector<int>> &graph, vector<int> &state, 
                  float p, int steps) {
    const int N = graph.size();
    for (int t = 0; t < steps; ++t) {
        vector<int> next_state = state;
        for (int u = 0; u < N; ++u) {
            if (state[u] == INFECTED) {
                for (int v : graph[u]) {
                    if (state[v] == SUSCEPTIBLE && ((float)rand() / RAND_MAX) < p)
                        next_state[v] = INFECTED;
                }
                next_state[u] = RECOVERED;
            }
        }
        state.swap(next_state);
    }
}

The CPU implementation of the SIR model simulates the epidemic dynamics step by step using two state arrays: state for the current step and next_state for the following step. During each iteration, every infected node attempts to infect its susceptible neighbors with probability p, using a random number generator to model stochastic transmission. After processing its neighbors, an infected node always transitions to the recovered state, reflecting the discrete SIR assumption that infection lasts exactly one time step. The algorithm is simple and easy to understand, but it becomes computationally expensive for large-scale graphs because all nodes are processed sequentially on a single CPU core.

4.2 GPU Implementation

4.2.1 Implementation #1: Naive Approach

This section presents a naive GPU implementation that processes vertices according to their order in the data structure. The approach is based on the most straightforward solution - iterating through all vertices according to their position in the data structure. The code is shown below, followed by an analysis of how this approach can be improved.

#include <curand_kernel.h>

__global__ void epidemic_step(int N, const int *offsets, const int *neighbors,
                              const int *state_curr, int *state_next, float p,
                              curandState *randStates) {
    int u = blockIdx.x * blockDim.x + threadIdx.x;
    if (u >= N) return;

    curandState localState = randStates[u];
    int state              = state_curr[u];

    if (state == INFECTED) {
        for (int i = offsets[u]; i < offsets[u + 1]; ++i) {
            int v = neighbors[i];
            if (state_curr[v] == SUSCEPTIBLE) {
                float r = curand_uniform(&localState);
                if (r < p) state_next[v] = INFECTED;
            }
        }
        state_next[u] = RECOVERED;
    } else {
        state_next[u] = state_curr[u];
    }

    randStates[u] = localState;
}

Identified Performance Issues:

Race Conditions in Neighbor State Updates

if (state_curr[v] == SUSCEPTIBLE) {
    float r = curand_uniform(&localState);
    if (r < p) state_next[v] = INFECTED;
}

The thread processing vertex u may write to state_next[v].
Problem: If two infected neighboring nodes u1 and u2 share a common neighbor v, both threads may simultaneously write to state_next[v].
This implementation uses non-atomic writes, which can lead to race conditions. In practice, this may cause loss of some infections.

Warp Divergence

if (state == INFECTED) { ... } else { ... }

Within a warp, threads may have different states (SUSCEPTIBLE, INFECTED, RECOVERED).
Threads that are not infected only execute state_next[u] = state_curr[u];
Threads that are infected execute the full neighbor loop.
This causes warp divergence, reducing SIMT efficiency and decreasing GPU performance.

Suboptimal Memory Access Patterns (Global Memory Access)

Access to state_curr[v] and writes to state_next[v] occur randomly, especially in scale-free graphs.
Poor memory coalescence may occur, particularly when neighbors of a single node are scattered in memory.
This reduces GPU memory bandwidth.

Random Number Generation (curand_uniform)

Random number generation occurs inside the neighbor loop.
The thread uses local copy of curandState (localState), which is safe for race conditions, but:
Generating random numbers inside the loop is computationally expensive, especially with large numbers of neighbors.

Potential Inefficiency on Hubs (High-Degree Vertices)

If a vertex has many neighbors (hub), a single thread processes all edges sequentially.
The warp containing the hub thread is underutilized, with the remaining 31 threads idle → low warp/SIMT efficiency.

Use of state_next Without Atomics

If two threads write INFECTED to the same state_next[v] element, the race condition might not be critical (the result remains INFECTED), but it is non-deterministic.
For some models, precise determinism may be important, especially for debugging or experiment reproducibility.

Main GPU Loop

int main(int argc, char **argv) {
    if (argc < 3) {
        std::cerr << "Usage: " << argv[0] << " <offsets.bin> <neighbors.bin>" << std::endl;
        return 1;
    }

    std::string offsets_file   = argv[1];
    std::string neighbors_file = argv[2];

    std::vector<int> h_offsets   = read_bin_file(offsets_file);
    std::vector<int> h_neighbors = read_bin_file(neighbors_file);

    int N = h_offsets.size() - 1;
    int M = h_neighbors.size();
    std::cout << "Offsets size: " << N << " Neighbors size: " << M << std::endl;

    std::vector<int> state(N, SUSCEPTIBLE);
    state[0] = INFECTED; // initial infected node

    int *d_offsets, *d_neighbors, *d_state_curr, *d_state_next;
    curandState *d_randStates;

    cudaMalloc(&d_offsets, h_offsets.size() * sizeof(int));
    cudaMalloc(&d_neighbors, h_neighbors.size() * sizeof(int));
    cudaMalloc(&d_state_curr, N * sizeof(int));
    cudaMalloc(&d_state_next, N * sizeof(int));
    cudaMalloc(&d_randStates, N * sizeof(curandState));

    cudaMemcpy(d_offsets, h_offsets.data(), h_offsets.size() * sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(d_neighbors, h_neighbors.data(), h_neighbors.size() * sizeof(int),
               cudaMemcpyHostToDevice);
    cudaMemcpy(d_state_curr, state.data(), N * sizeof(int), cudaMemcpyHostToDevice);

    int threads      = 256;
    int blocks_nodes = (N + threads - 1) / threads;

    // -------------------------------------------------------------------------
    // Initialize RNG states
    // -------------------------------------------------------------------------
    init_rand_states<<<blocks_nodes, threads>>>(d_randStates, N, time(NULL));
    cudaDeviceSynchronize();

    float p   = 0.05f; // infection probability
    int steps = 50;

    // -------------------------------------------------------------------------
    // CUDA events for timing
    // -------------------------------------------------------------------------
    cudaEvent_t start_total, stop_total;
    cudaEvent_t start_kernel, stop_kernel;
    cudaEventCreate(&start_total);
    cudaEventCreate(&stop_total);
    cudaEventCreate(&start_kernel);
    cudaEventCreate(&stop_kernel);

    // -------------------------------------------------------------------------
    // Main simulation loop
    // -------------------------------------------------------------------------
    cudaEventRecord(start_total, 0); // start total timing

    for (int t = 0; t < steps; ++t) {
        cudaEventRecord(start_kernel, 0); // start kernel timing

        epidemic_step_naive<<<blocks_nodes, threads>>>(N, d_offsets, d_neighbors, d_state_curr,
                                                       d_state_next, p, d_randStates);
        cudaEventRecord(stop_kernel, 0); // stop kernel timing
        cudaEventSynchronize(stop_kernel);
        float ms_kernel = 0.0f;
        cudaEventElapsedTime(&ms_kernel, start_kernel, stop_kernel);
        std::cout << "Step " << t << " kernel time: " << ms_kernel << " ms" << std::endl;

        cudaDeviceSynchronize();
        std::swap(d_state_curr, d_state_next);
    }

    cudaEventRecord(stop_total, 0);
    cudaEventSynchronize(stop_total);

    float ms_total = 0.0f;
    cudaEventElapsedTime(&ms_total, start_total, stop_total);
    std::cout << "Total simulation time: " << ms_total << " ms" << std::endl;

    // -------------------------------------------------------------------------
    // Copy results and print summary
    // -------------------------------------------------------------------------
    cudaMemcpy(state.data(), d_state_curr, N * sizeof(int), cudaMemcpyDeviceToHost);

    int infected = 0, recovered = 0, susceptible = 0;
    for (int s : state) {
        if (s == INFECTED)
            infected++;
        else if (s == RECOVERED)
            recovered++;
        else
            susceptible++;
    }
    std::cout << "S=" << susceptible << " I=" << infected << " R=" << recovered << std::endl;

    // -------------------------------------------------------------------------
    // Cleanup
    // -------------------------------------------------------------------------
    cudaFree(d_offsets);
    cudaFree(d_neighbors);
    cudaFree(d_state_curr);
    cudaFree(d_state_next);
    cudaFree(d_randStates);

    cudaEventDestroy(start_total);
    cudaEventDestroy(stop_total);
    cudaEventDestroy(start_kernel);
    cudaEventDestroy(stop_kernel);

    cudaDeviceReset();

    return 0;
}

4.2.2 Implementation #2: Edge-Based Approach

While we could improve many issues from the first implementation, there is a more interesting solution that can eliminate several problems from the start.
Instead of having each thread process a vertex and its neighbors, we can assign one thread per edge.
So, we are going to have the benefits:

No loops over neighbors

In Implementation #1 each thread did:

for i in neighbors[u]:
    ...

Now it's one thread = one edge → no loops → no divergence.

Load becomes balanced

Some vertices have thousands of neighbors, others only 3.
In the naive approach some threads were idle while others were overloaded.

In the edge-based model:

all threads have the same amount of work
GPU throughput increases dramatically

Better memory access

Now all threads access neighbors using continuous indices (coalesced access).

Fewer idle threads

Before: most threads were just copying the state (if the vertex was not INFECTED).
Now: each thread processes an edge → almost every thread has real work to do.

So far the implementing idea is:

One thread per edge rather than per vertex → race conditions disappear (each edge is unique).
Use atomic operations only where necessary (e.g., for state transitions to INFECTED).

Edge-Based Kernel

__global__ void epidemic_step_edge_based(int N, const int *offsets, const int *neighbors,
                                         const int *state_curr, int *state_next, float p,
                                         curandState *randStates) {
    int eid = blockIdx.x * blockDim.x + threadIdx.x;
    int M   = offsets[N];
    if (eid >= M) return;

    int u = find_src_for_edge(offsets, N, eid);
    int v = neighbors[eid];

    int state_u = state_curr[u];
    int state_v = state_curr[v];

    // random per-edge using state of u
    curandState local = randStates[u];
    if (state_u == INFECTED && state_v == SUSCEPTIBLE) {
        float r = curand_uniform(&local);
        if (r < p) {
            atomicExch(&state_next[v], INFECTED);
        }
    }
    randStates[u] = local;
}

The find_src_for_edge function:

__device__ inline int find_src_for_edge(const int *offsets, int N, int eid) {
    // binary search: find u such that offsets[u] <= eid < offsets[u+1]
    int left = 0, right = N;
    while (left < right) {
        int mid = (left + right) >> 1;
        if (offsets[mid] <= eid)
            left = mid + 1;
        else
            right = mid;
    }
    return left - 1;
}

Binary search on offsets is used to find the source vertex u - very fast since the graph is static.
Race conditions on v are eliminated using atomic operations.
Many threads (equal to the number of edges) provide ideal load distribution for scale-free graphs.

Vertex Processing (Mark INFECTED → RECOVERED)

__global__ void mark_recovered(int N, int *state_curr, int *state_next) {
    int u = blockIdx.x * blockDim.x + threadIdx.x;
    if (u >= N) return;
    if (state_curr[u] == INFECTED) {
        state_next[u] = RECOVERED;
    }
}

Minimal branching, works excellently on GPU.
Can be executed after the edge-based kernel.

Parallel initialization of curand states

__global__ void init_rand_states(curandState *randStates, int M, unsigned long seed) {
    int edge_id = blockIdx.x * blockDim.x + threadIdx.x;
    if (edge_id >= M) return;    
    curand_init(seed, edge_id, 0, &randStates[edge_id]);
}

Main GPU Loop

int main(int argc, char **argv) {
    if (argc < 3) {
        std::cerr << "Usage: " << argv[0] << " <offsets.bin> <neighbors.bin>" << std::endl;
        return 1;
    }

    std::string offsets_file   = argv[1];
    std::string neighbors_file = argv[2];

    std::vector<int> h_offsets   = read_bin_file(offsets_file);
    std::vector<int> h_neighbors = read_bin_file(neighbors_file);

    int N = h_offsets.size() - 1;
    int M = h_neighbors.size();
    std::cout << "Offsets size: " << N << " Neighbors size: " << M << std::endl;

    std::vector<int> state(N, SUSCEPTIBLE);
    state[0] = INFECTED; // initial infected node

    int *d_offsets, *d_neighbors, *d_state_curr, *d_state_next;
    curandState *d_randStates;

    cudaMalloc(&d_offsets, h_offsets.size() * sizeof(int));
    cudaMalloc(&d_neighbors, h_neighbors.size() * sizeof(int));
    cudaMalloc(&d_state_curr, N * sizeof(int));
    cudaMalloc(&d_state_next, N * sizeof(int));
    cudaMalloc(&d_randStates, N * sizeof(curandState));

    cudaMemcpy(d_offsets, h_offsets.data(), h_offsets.size() * sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(d_neighbors, h_neighbors.data(), h_neighbors.size() * sizeof(int),
               cudaMemcpyHostToDevice);
    cudaMemcpy(d_state_curr, state.data(), N * sizeof(int), cudaMemcpyHostToDevice);

    int threads      = 256;
    int blocks_edges = (M + threads - 1) / threads;
    int blocks_nodes = (N + threads - 1) / threads;

    // -------------------------------------------------------------------------
    // Initialize RNG states
    // -------------------------------------------------------------------------
    init_rand_states<<<blocks_nodes, threads>>>(d_randStates, N, time(NULL));
    cudaDeviceSynchronize();

    float p   = 0.05f; // infection probability
    int steps = 50;

    // -------------------------------------------------------------------------
    // CUDA events for timing
    // -------------------------------------------------------------------------
    cudaEvent_t start_total, stop_total;
    cudaEvent_t start_kernel, stop_kernel;
    cudaEvent_t start_mark, stop_mark;
    cudaEventCreate(&start_total);
    cudaEventCreate(&stop_total);
    cudaEventCreate(&start_kernel);
    cudaEventCreate(&stop_kernel);
    cudaEventCreate(&start_mark);
    cudaEventCreate(&stop_mark);

    // -------------------------------------------------------------------------
    // Main simulation loop
    // -------------------------------------------------------------------------
    cudaEventRecord(start_total, 0); // start total timing

    for (int t = 0; t < steps; ++t) {
        // Step 1: propagate infections along edges
        cudaEventRecord(start_kernel, 0); // start kernel timing

        epidemic_step_edge_based<<<blocks_edges, threads>>>(N, d_offsets, d_neighbors, d_state_curr,
                                                            d_state_next, p, d_randStates);

        cudaEventRecord(stop_kernel, 0); // stop kernel timing
        cudaEventSynchronize(stop_kernel);
        float ms_kernel = 0.0f;
        cudaEventElapsedTime(&ms_kernel, start_kernel, stop_kernel);
        std::cout << "Step " << t << " kernel time: " << ms_kernel << " ms" << std::endl;

        // Step 2: mark INFECTED -> RECOVERED
        cudaEventRecord(start_mark, 0);

        mark_recovered<<<blocks_nodes, threads>>>(N, d_state_curr, d_state_next);

        cudaEventRecord(stop_mark, 0);
        cudaEventSynchronize(stop_mark);
        float ms_mark = 0.0f;
        cudaEventElapsedTime(&ms_mark, start_mark, stop_mark);
        std::cout << "Step " << t << " mark time: " << ms_mark << " ms" << std::endl;

        cudaDeviceSynchronize();

        // Swap states
        std::swap(d_state_curr, d_state_next);
    }

    cudaEventRecord(stop_total, 0);
    cudaEventSynchronize(stop_total);

    float ms_total = 0.0f;
    cudaEventElapsedTime(&ms_total, start_total, stop_total);
    std::cout << "Total simulation time: " << ms_total << " ms" << std::endl;

    // -------------------------------------------------------------------------
    // Copy results and print summary
    // -------------------------------------------------------------------------
    cudaMemcpy(state.data(), d_state_curr, N * sizeof(int), cudaMemcpyDeviceToHost);

    int infected = 0, recovered = 0, susceptible = 0;
    for (int s : state) {
        if (s == INFECTED)
            infected++;
        else if (s == RECOVERED)
            recovered++;
        else
            susceptible++;
    }
    std::cout << "S=" << susceptible << " I=" << infected << " R=" << recovered << std::endl;

    // -------------------------------------------------------------------------
    // Cleanup
    // -------------------------------------------------------------------------
    cudaFree(d_offsets);
    cudaFree(d_neighbors);
    cudaFree(d_state_curr);
    cudaFree(d_state_next);
    cudaFree(d_randStates);

    cudaEventDestroy(start_total);
    cudaEventDestroy(stop_total);
    cudaEventDestroy(start_kernel);
    cudaEventDestroy(stop_kernel);
    cudaEventDestroy(start_mark);
    cudaEventDestroy(stop_mark);

    cudaDeviceReset();

    return 0;
}

4.2.3 Implementation #3: Atomic-Free Optimization

The previous implementation provided good performance improvement, but it's still not the most optimal version we can achieve.
We used atomic operations, which are relatively expensive and should be minimized. In our case, we can completely eliminate them by pre-generating random numbers. Additionally, we incorporate warp-level optimization.

Main Idea:

Edge-based kernel without atomics
- Pre-generate random numbers for each edge so each thread simply calculates whether a neighbor gets infected
- Race conditions disappear because we know in advance who will get infected
Warp-level optimization
- Distribute processing of large vertices (hubs) across warp threads to avoid imbalance
Batch generation of random numbers for all edges in a separate array

Pre-generation of Random Numbers

__global__ void generate_edge_randoms_per_edge(int M, int steps, float *edge_rands,
                                               curandState *randStates) {
    int eid = blockIdx.x * blockDim.x + threadIdx.x;
    if (eid >= M) return;

    curandState local = randStates[eid];
    // write randoms for all steps in contiguous blocks: edge_rands[step*M + eid]
    for (int t = 0; t < steps; ++t) {
        edge_rands[(size_t)t * (size_t)M + (size_t)eid] = curand_uniform(&local);
    }
    randStates[eid] = local;
}

Here edge_rands[M] contains pre-generated random numbers for all edges.

Atomic-Free Edge-Based Kernel

__global__ void epidemic_step_edge_free(int N, const int *offsets, const int *neighbors,
                                        const int *state_curr, int *state_next,
                                        const float *edge_rands, float p) {
    int eid = blockIdx.x * blockDim.x + threadIdx.x;
    int M   = offsets[N];
    if (eid >= M) return;

    // Binary search: find source vertex u
    int u = find_src_for_edge(offsets, N, eid);
    int v = neighbors[eid];

    int state_u = state_curr[u];
    int state_v = state_curr[v];

    // Only propagate infection if u is infected and v is susceptible
    if (state_u == INFECTED && state_v == SUSCEPTIBLE) {
        if (edge_rands[eid] < p) {
            state_next[v] = INFECTED;
        }
    }
}

Features:

No atomics - race conditions completely eliminated because each vertex is processed through pre-generated values
Each thread works independently

Main GPU Loop

int main(int argc, char **argv) {
    if (argc < 3) {
        std::cerr << "Usage: " << argv[0] << " <offsets.bin> <neighbors.bin>" << std::endl;
        return 1;
    }

    std::string offsets_file   = argv[1];
    std::string neighbors_file = argv[2];

    std::vector<int> h_offsets   = read_bin_file(offsets_file);
    std::vector<int> h_neighbors = read_bin_file(neighbors_file);

    int N = h_offsets.size() - 1;
    int M = h_neighbors.size();
    std::cout << "Offsets size: " << N << " Neighbors size: " << M << std::endl;

    std::vector<int> state(N, SUSCEPTIBLE);
    state[0] = INFECTED; // initial infected node

    int *d_offsets, *d_neighbors, *d_state_curr, *d_state_next;
    float *d_edge_rands;
    curandState *d_randStates;

    cudaMalloc(&d_offsets, h_offsets.size() * sizeof(int));
    cudaMalloc(&d_neighbors, h_neighbors.size() * sizeof(int));
    cudaMalloc(&d_state_curr, N * sizeof(int));
    cudaMalloc(&d_state_next, N * sizeof(int));
    cudaMalloc(&d_randStates, M * sizeof(curandState)); // enough for M edges

    cudaMemcpy(d_offsets, h_offsets.data(), h_offsets.size() * sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(d_neighbors, h_neighbors.data(), h_neighbors.size() * sizeof(int),
               cudaMemcpyHostToDevice);
    cudaMemcpy(d_state_curr, state.data(), N * sizeof(int), cudaMemcpyHostToDevice);

    int threads      = 256;
    int blocks_edges = (M + threads - 1) / threads;
    int blocks_nodes = (N + threads - 1) / threads;

    // -------------------------------------------------------------------------
    // Initialize RNG states
    // -------------------------------------------------------------------------
    init_rand_states<<<blocks_edges, threads>>>(d_randStates, M, time(NULL));
    cudaDeviceSynchronize();

    float p           = 0.05f; // infection probability
    int steps         = 50;
    int total_randoms = M * steps;

    // Allocate memory for all random numbers for all steps
    cudaMalloc(&d_edge_rands, total_randoms * sizeof(float));

    // -------------------------------------------------------------------------
    // CUDA events for timing
    // -------------------------------------------------------------------------
    cudaEvent_t start_total, stop_total;
    cudaEvent_t start_rand_gen, stop_rand_gen;
    cudaEvent_t start_kernel, stop_kernel;
    cudaEvent_t start_mark, stop_mark;
    cudaEventCreate(&start_total);
    cudaEventCreate(&stop_total);
    cudaEventCreate(&start_rand_gen);
    cudaEventCreate(&stop_rand_gen);
    cudaEventCreate(&start_kernel);
    cudaEventCreate(&stop_kernel);
    cudaEventCreate(&start_mark);
    cudaEventCreate(&stop_mark);

    // -------------------------------------------------------------------------
    // Pre-generate all random numbers once
    // -------------------------------------------------------------------------
    cudaEventRecord(start_rand_gen, 0);
    generate_edge_randoms_per_edge<<<blocks_edges, threads>>>(M, steps, d_edge_rands, d_randStates);
    cudaDeviceSynchronize();
    cudaEventRecord(stop_rand_gen, 0);
    cudaEventSynchronize(stop_rand_gen);

    cudaError_t err = cudaGetLastError();
    if (err != cudaSuccess) {
        std::cerr << "generate_edge_randoms_per_edge failed: " << cudaGetErrorString(err)
                  << std::endl;
        return 1;
    }

    float ms_rand_gen = 0.0f;
    cudaEventElapsedTime(&ms_rand_gen, start_rand_gen, stop_rand_gen);
    std::cout << "Pre-generation time: " << ms_rand_gen << " ms" << std::endl;

    // -------------------------------------------------------------------------
    // Main simulation loop
    // -------------------------------------------------------------------------
    cudaEventRecord(start_total, 0);

    for (int t = 0; t < steps; ++t) {
        const float *edge_rands_step = d_edge_rands + (size_t)t * (size_t)M;

        // Step 1: propagate infection
        cudaEventRecord(start_kernel, 0);
        epidemic_step_edge_free<<<blocks_edges, threads>>>(N, d_offsets, d_neighbors, d_state_curr,
                                                           d_state_next, edge_rands_step, p);
        cudaEventRecord(stop_kernel, 0);
        cudaEventSynchronize(stop_kernel);
        float ms_kernel = 0.0f;
        cudaEventElapsedTime(&ms_kernel, start_kernel, stop_kernel);
        std::cout << "Step " << t << " kernel time: " << ms_kernel << " ms" << std::endl;

        // Step 2: mark INFECTED -> RECOVERED
        cudaEventRecord(start_mark, 0);
        mark_recovered<<<blocks_nodes, threads>>>(N, d_state_curr, d_state_next);
        cudaEventRecord(stop_mark, 0);
        cudaEventSynchronize(stop_mark);
        float ms_mark = 0.0f;
        cudaEventElapsedTime(&ms_mark, start_mark, stop_mark);
        std::cout << "Step " << t << " mark time: " << ms_mark << " ms" << std::endl;

        cudaDeviceSynchronize();

        // Swap state buffers
        std::swap(d_state_curr, d_state_next);
    }

    cudaEventRecord(stop_total, 0);
    cudaEventSynchronize(stop_total);

    float ms_total = 0.0f;
    cudaEventElapsedTime(&ms_total, start_total, stop_total);
    std::cout << "Total simulation time: " << ms_total << " ms" << std::endl;

    // -------------------------------------------------------------------------
    // Copy results and print summary
    // -------------------------------------------------------------------------
    cudaMemcpy(state.data(), d_state_curr, N * sizeof(int), cudaMemcpyDeviceToHost);

    int infected = 0, recovered = 0, susceptible = 0;
    for (int s : state) {
        if (s == INFECTED)
            infected++;
        else if (s == RECOVERED)
            recovered++;
        else
            susceptible++;
    }
    std::cout << "S=" << susceptible << " I=" << infected << " R=" << recovered << std::endl;

    // -------------------------------------------------------------------------
    // Cleanup
    // -------------------------------------------------------------------------
    cudaFree(d_offsets);
    cudaFree(d_neighbors);
    cudaFree(d_state_curr);
    cudaFree(d_state_next);
    cudaFree(d_edge_rands);
    cudaFree(d_randStates);

    cudaEventDestroy(start_total);
    cudaEventDestroy(stop_total);
    cudaEventDestroy(start_rand_gen);
    cudaEventDestroy(stop_rand_gen);
    cudaEventDestroy(start_kernel);
    cudaEventDestroy(stop_kernel);
    cudaEventDestroy(start_mark);
    cudaEventDestroy(stop_mark);

    cudaDeviceReset();
    return 0;
}

Edge-Based Warp-Level GPU Kernel Processing Scheme

Let's examine the distribution of threads, vertices, and edges, as well as the computation flow using the last implementation as an example.

Assume:

Graph with vertices V0 … V5
Edges between them:

V0 → V1, V0 → V2
V1 → V2, V1 → V3
V2 → V3
V3 → V4
V4 → V5

Each GPU thread processes one edge

Vertices:   V0  V1  V2  V3  V4  V5
States:     S   I   S   S   S   S  (S=0, I=1, R=2)

CSR:
offsets = [0, 2, 4, 5, 6, 7, 7] 
neighbors = [V1,V2, V2,V3, V3, V4, V5]

Edges (flattened, one per thread):
E0: V0→V1
E1: V0→V2
E2: V1→V2
E3: V1→V3
E4: V2→V3
E5: V3→V4
E6: V4→V5

GPU Threads:
Thread0 -> E0
Thread1 -> E1
Thread2 -> E2
Thread3 -> E3
Thread4 -> E4
Thread5 -> E5
Thread6 -> E6

Edge-based kernel:

Thread0: V0→V1   state[V0]=S, state[V1]=I   -> no infection
Thread1: V0→V2   state[V0]=S, state[V2]=S   -> no infection
Thread2: V1→V2   state[V1]=I, state[V2]=S   -> edge_rands[2] < p? Yes -> state_next[V2]=I
Thread3: V1→V3   state[V1]=I, state[V3]=S   -> edge_rands[3] < p? No  -> state_next[V3]=S
Thread4: V2→V3   state[V2]=S, state[V3]=S   -> no infection
Thread5: V3→V4   state[V3]=S, state[V4]=S   -> no infection
Thread6: V4→V5   state[V4]=S, state[V5]=S   -> no infection

Each thread is independent
No race conditions since state_next[v] is updated only based on pre-generated random numbers

Warp-level distribution:
On GPU, one warp = 32 threads
If a vertex has many neighbors (hub), edge-based approach distributes edges across all warp threads

Example: Vertex Vh (hub) with 40 neighbors
Warp0: Thread0..31 -> edges E0..E31
Warp1: Thread0..8  -> edges E32..E39

Advantage: no thread gets stuck on a hub
Even load distribution across all threads, nearly perfect load balance

Final representation:

+---------------------------+      +------------------+
|        Vertices           |      |  GPU Threads     |
| V0 V1 V2 V3 V4 V5 ...     | ---> | T0 T1 T2 ... TM  |
+---------------------------+      +------------------+

1) Edge-based propagation:
   Each thread processes one edge independently:
   T0: V0->V1
   T1: V0->V2
   ...
   TM: Vh->Vh+neighbors

2) Mark INFECTED -> RECOVERED:
   Each thread processes one vertex independently:
   T0: V0
   T1: V1
   ...

3) Swap states for next timestep

Edge-based kernel + pre-generated random numbers completely eliminates race conditions, scales well, and distributes load even for hubs so, that it's suitable for millions of vertices and tens of millions of edges

4.2.4 Implementation #4: Warp-Level Batch Processing

Implementation #3 is already suitable for computations on graphs with millions of vertices. However, there is another approach that can optimize this implementation for extremely large hubs. This approach is called Warp-level batch processing.
As it's known on GPU, warp = 32 threads executing the same instruction simultaneously. In the solving problem If a vertex (hub) has many neighbors, a single thread cannot process all edges in one step. So, the Warp-level batch processing distributes neighbors of a single vertex across threads of one warp, allowing all 32 threads to work in parallel. Let's take a look at the following example:

Example:

Hub Vh has 64 neighbors
Warp = 32 threads
Split neighbors into batches of 32 edges:

Warp Thread IDs: 0  1  2 ... 31

Batch 0 (edges 0..31):
T0->edge0, T1->edge1, ..., T31->edge31

Batch 1 (edges 32..63):
T0->edge32, T1->edge33, ..., T31->edge63

Each thread processes its own edge
Result: one warp is fully loaded, not stalled on a hub

Advantages for SIR simulation:

Load balance: large vertices don't slow down the warp
SIMT efficiency: all warp threads execute identical instructions
Minimal divergence: all threads process similar operations (infection check + state update)
Suitable for scale-free graphs with hubs

Code implementation (brief):

Determine edge range for each vertex from CSR: offsets[u]..offsets[u+1]-1
For warp warpId = threadIdx.x / warpSize : where each thread processes edge = offsets[u] + warpId + k*warpSize for all batches k

Let's examine Warp-level Batch Processing for our edge-based SIR on GPU to show how a warp distributes processing of neighbors for a single vertex (hub).

Prerequisites:

Warp = 32 threads (T0…T31)
Hub Vh has 70 neighbors (E0…E69)
Edges are processed in batches, each batch = warp size (32 edges)

Edge distribution across batches:

Hub Vh neighbors: [E0, E1, E2, ..., E69]

Batch 0 (edges 0..31) -> Warp Threads T0..T31
Batch 1 (edges 32..63) -> Warp Threads T0..T31
Batch 2 (edges 64..69) -> Warp Threads T0..T5

Scheme: warp processing batches:

Warp Thread IDs
T0   T1   T2   ...   T31

Batch 0 (edges 0..31)
E0   E1   E2   ...   E31

Batch 1 (edges 32..63)
E32  E33  E34  ...   E63

Batch 2 (edges 64..69)
E64  E65  E66  E67  E68  E69  - threads T6..T31 idle

Each warp thread is responsible for one edge within a batch.
After processing one batch, the warp moves on to the next.
Idle threads appear only on the last, incomplete batch, but this is minimal.

Visualization with threads and batches:

Warp0:  T0  T1  T2 ... T31
Batch0: E0  E1  E2 ... E31
Batch1: E32 E33 E34 ... E63
Batch2: E64 E65 E66 ... E69

All batches of one hub are processed by a single warp
Warp threads execute identical instructions (SIMT) → minimal divergence
Scales to millions of vertices and tens of millions of edges

CUDA Kernel: Warp-Level Batch Processing
Finally, there is the implementation of the CUDA Kernel with Warp-Level Batch Processing for Edge-Based SIR on Graphs. So far, the main idea is:

Each warp processes vertex u with neighbors offsets[u] .. offsets[u+1]-1
Warp threads distribute edges across batches:

edge_index = offsets[u] + warpThreadId + batch*warpSize

warpThreadId = threadIdx.x % warpSize
Batch continues until all vertex edges are processed

Warp Batch Based Kernel

__inline__ __device__ int warp_id() { return threadIdx.x / WARP_SIZE; }
__inline__ __device__ int lane_id() { return threadIdx.x % WARP_SIZE; }

__global__ void epidemic_step_warp_batch(const int N, const int *offsets, const int *neighbors,
                                         const int *state_curr, int *state_next,
                                         const float *edge_rands, const float p) {
    int warpIdGlobal = (blockIdx.x * blockDim.x + threadIdx.x) / WARP_SIZE;
    int lane         = lane_id();

    if (warpIdGlobal >= N) return;

    // Each warp handles vertex u = warpIdGlobal
    int u = warpIdGlobal;

    int state_u = state_curr[u];

    // Copy state_curr[u] to state_next initially
    state_next[u] = state_u;

    if (state_u != INFECTED) return; // Only propagate if vertex is infected

    int start  = offsets[u];
    int end    = offsets[u + 1];
    int degree = end - start;

    // Process edges in batches of WARP_SIZE
    for (int batch = 0; batch * WARP_SIZE + lane < degree; ++batch) {
        int edge_idx = start + batch * WARP_SIZE + lane;
        if (edge_idx >= end) continue;

        int v       = neighbors[edge_idx];
        int state_v = state_curr[v];

        // Infection propagation
        if (state_v == SUSCEPTIBLE && edge_rands[edge_idx] < p) {
            state_next[v] = INFECTED; // no atomics needed
        }
    }

    // After processing edges, mark u as recovered
    state_next[u] = RECOVERED;
}

Main GPU Loop

int main(int argc, char **argv) {
    if (argc < 3) {
        std::cerr << "Usage: " << argv[0] << " <offsets.bin> <neighbors.bin>" << std::endl;
        return 1;
    }

    std::string offsets_file   = argv[1];
    std::string neighbors_file = argv[2];

    std::vector<int> h_offsets   = read_bin_file(offsets_file);
    std::vector<int> h_neighbors = read_bin_file(neighbors_file);

    int N = h_offsets.size() - 1;
    int M = h_neighbors.size();
    std::cout << "Offsets size: " << N << " Neighbors size: " << M << std::endl;

    std::vector<int> state(N, SUSCEPTIBLE);
    state[0] = INFECTED; // initial infected node

    int *d_offsets, *d_neighbors, *d_state_curr, *d_state_next;
    float *d_edge_rands;
    curandState *d_randStates;

    cudaMalloc(&d_offsets, h_offsets.size() * sizeof(int));
    cudaMalloc(&d_neighbors, h_neighbors.size() * sizeof(int));
    cudaMalloc(&d_state_curr, N * sizeof(int));
    cudaMalloc(&d_state_next, N * sizeof(int));
    cudaMalloc(&d_randStates, M * sizeof(curandState)); // enough for M edges

    cudaMemcpy(d_offsets, h_offsets.data(), h_offsets.size() * sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(d_neighbors, h_neighbors.data(), h_neighbors.size() * sizeof(int),
               cudaMemcpyHostToDevice);
    cudaMemcpy(d_state_curr, state.data(), N * sizeof(int), cudaMemcpyHostToDevice);

    int threads           = 256;
    int blocks_edges      = (M + threads - 1) / threads;
    int threads_per_block = 32 * 4; // 4 warps per block
    int blocks_vertices =
        (N + (threads_per_block / WARP_SIZE) - 1) / (threads_per_block / WARP_SIZE);

    // -------------------------------------------------------------------------
    // Initialize RNG states
    // -------------------------------------------------------------------------
    init_rand_states<<<blocks_edges, threads>>>(d_randStates, M, time(NULL));
    cudaDeviceSynchronize();

    float p           = 0.05f; // infection probability
    int steps         = 50;
    int total_randoms = M * steps;

    // Allocate memory for all random numbers for all steps
    cudaMalloc(&d_edge_rands, total_randoms * sizeof(float));

    // -------------------------------------------------------------------------
    // CUDA events for timing
    // -------------------------------------------------------------------------
    cudaEvent_t start_total, stop_total;
    cudaEvent_t start_rand_gen, stop_rand_gen;
    cudaEvent_t start_kernel, stop_kernel;

    cudaEventCreate(&start_total);
    cudaEventCreate(&stop_total);
    cudaEventCreate(&start_rand_gen);
    cudaEventCreate(&stop_rand_gen);
    cudaEventCreate(&start_kernel);
    cudaEventCreate(&stop_kernel);

    // -------------------------------------------------------------------------
    // Pre-generate all random numbers once
    // -------------------------------------------------------------------------
    cudaEventRecord(start_rand_gen, 0);
    generate_edge_randoms_per_edge<<<blocks_edges, threads>>>(M, steps, d_edge_rands, d_randStates);
    cudaDeviceSynchronize();
    cudaEventRecord(stop_rand_gen, 0);
    cudaEventSynchronize(stop_rand_gen);

    cudaError_t err = cudaGetLastError();
    if (err != cudaSuccess) {
        std::cerr << "generate_edge_randoms_per_edge failed: " << cudaGetErrorString(err)
                  << std::endl;
        return 1;
    }

    float ms_rand_gen = 0.0f;
    cudaEventElapsedTime(&ms_rand_gen, start_rand_gen, stop_rand_gen);
    std::cout << "Pre-generation time: " << ms_rand_gen << " ms" << std::endl;

    // -------------------------------------------------------------------------
    // Main simulation loop
    // -------------------------------------------------------------------------
    cudaEventRecord(start_total, 0);

    for (int t = 0; t < steps; ++t) {
        const float *edge_rands_step = d_edge_rands + (size_t)t * (size_t)M;

        // Step 1: warp-level batch kernel
        cudaEventRecord(start_kernel, 0);
        epidemic_step_warp_batch<<<blocks_vertices, threads_per_block>>>(
            N, d_offsets, d_neighbors, d_state_curr, d_state_next, edge_rands_step, p);
        cudaEventRecord(stop_kernel, 0);
        cudaEventSynchronize(stop_kernel);

        // error check
        cudaError_t kerr = cudaGetLastError();
        if (kerr != cudaSuccess) {
            std::cerr << "epidemic_step_warp_batch failed at step " << t << ": "
                      << cudaGetErrorString(kerr) << std::endl;
            return 1;
        }

        float ms_kernel = 0.0f;
        cudaEventElapsedTime(&ms_kernel, start_kernel, stop_kernel);
        std::cout << "Step " << t << " kernel time: " << ms_kernel << " ms" << std::endl;

        // cudaDeviceSynchronize();

        // Swap state buffers
        std::swap(d_state_curr, d_state_next);
    }

    cudaEventRecord(stop_total, 0);
    cudaEventSynchronize(stop_total);

    float ms_total = 0.0f;
    cudaEventElapsedTime(&ms_total, start_total, stop_total);
    std::cout << "Total simulation time: " << ms_total << " ms" << std::endl;

    // -------------------------------------------------------------------------
    // Copy results and print summary
    // -------------------------------------------------------------------------
    cudaMemcpy(state.data(), d_state_curr, N * sizeof(int), cudaMemcpyDeviceToHost);

    int infected = 0, recovered = 0, susceptible = 0;
    for (int s : state) {
        if (s == INFECTED)
            infected++;
        else if (s == RECOVERED)
            recovered++;
        else
            susceptible++;
    }
    std::cout << "S=" << susceptible << " I=" << infected << " R=" << recovered << std::endl;

    // -------------------------------------------------------------------------
    // Cleanup
    // -------------------------------------------------------------------------
    cudaFree(d_offsets);
    cudaFree(d_neighbors);
    cudaFree(d_state_curr);
    cudaFree(d_state_next);
    cudaFree(d_edge_rands);
    cudaFree(d_randStates);

    cudaEventDestroy(start_total);
    cudaEventDestroy(stop_total);
    cudaEventDestroy(start_rand_gen);
    cudaEventDestroy(stop_rand_gen);
    cudaEventDestroy(start_kernel);
    cudaEventDestroy(stop_kernel);

    cudaDeviceReset();
    return 0;
}

Now each warp processes one vertex completely.
Warp threads distribute edges across batches:

edge_idx = offsets[u] + lane_id + batch*warpSize

Batch continues until all edges are processed. Even if a hub has 100+ neighbors, the warp processes them in batches of 32 edges. All warp threads are occupied, idle threads appear only in the last incomplete batch. All warp threads execute identical instructions (SIMT):

Take an edge
Check infection
Update state_next

Divergence between threads is minimal, and scaling to millions of vertices with tens of millions of edges is significantly improved.

4.2.5 Implementation #5: Warp-Level Batch Processing (Pull Model)

4.2.5.1 Roofline Analysis of the Previous Approach

The initial GPU implementation followed a classical push-based infection model: each infected node iterated over its neighbors and attempted to infect them.
From an algorithmic viewpoint this is natural; however, from the GPU-architecture perspective, it leads to poor arithmetic intensity and thus a memory-bound regime.

Memory accesses per neighbor (previous approach)

Access	Size (bytes)	Notes
`state_curr[u]` (read)	4	node state
`state_next[u]` (write)	4	copy initial state
`offsets[u]` (read)	4	neighbor list start
`offsets[u+1]` (read)	4	neighbor list end
`neighbors[edge_idx]` (read)	4	neighbor index
`state_curr[v]` (read)	4	neighbor state
`state_next[v]` (write)	4	update neighbor if infected
`state_next[u]` (write)	4	mark as recovered
Total per neighbor	32 bytes	simple sequential reads and writes

Characteristics:

Memory accesses are mostly coalesced, sequential, and predictable.
Minimal warp divergence.
No warp-wide synchronization or additional kernel launches.
GPU is mostly memory-bound but benefits from simple, direct access patterns. This behavior is typical for graph algorithms, where memory-access irregularity dominates computational cost.

4.2.5.2 How the New Approach Improves the Situation
The key change is a switch from a push model to a pull model:

“I am a node. Let me check all my neighbors and see if any of them can infect me.”

These changes:
✔ Reduce global memory writes
✔ Make neighbor iterations more warp-friendly
✔ Increase data reuse within a warp
✔ Flatten control flow (less divergence)

Even though the arithmetic intensity remains low, the number of global memory transactions per useful operation decreases, improving real throughput.

Warp-Batch Pull-Based Kernel

// ---------- Infection Kernel (warp-batch) ----------
// reads states_prev, writes states_next (one warp per node)
__global__ void epidemic_infection_kernel(int N,
                                          const int * __restrict__ offsets,
                                          const int * __restrict__ neighbors,
                                          const float * __restrict__ edge_rands,
                                          const unsigned int * __restrict__ states_prev,
                                          unsigned int * __restrict__ states_next,
                                          int * __restrict__ infected_steps,
                                          int step,
                                          int M)
{
    int warpId = (blockIdx.x * blockDim.x + threadIdx.x) / d_warpSize;
    int lane = threadIdx.x % d_warpSize;
    int node = warpId;
    if (node >= N) return;

    // copy previous state by default
    unsigned int prev_state = states_prev[node];
    states_next[node] = prev_state; // single warp writes this entry -> no race

    // only SUSCEPTIBLE nodes can be infected
    if (prev_state != SUSCEPTIBLE) return;

    int start = offsets[node];
    int end = offsets[node + 1];

    bool infected_flag = false;

    // each lane checks a strided subset of edges
    for (int e = start + lane; e < end; e += d_warpSize) {
        int neigh = neighbors[e];
        if (states_prev[neigh] == INFECTED) { // read from snapshot
            float r = edge_rands[step * M + e];
            if (r < INFECTION_PROB_THRESHOLD) {
                infected_flag = true;
            }
        }
    }

    // warp-wide OR
    infected_flag = __any_sync(0xffffffff, infected_flag);

    // only lane 0 commits the change for this node
    if (lane == 0 && infected_flag) {
        states_next[node] = INFECTED;
        // newly infected: reset counter (safe because prev_state was SUSCEPTIBLE)
        infected_steps[node] = 0;
    }
}

// ---------- Recovery Kernel (local) ----------
// reads states_prev to decide who was infected at start of step,
// increments infected_steps for those and writes RECOVERED into states_next if condition met.
__global__ void epidemic_recovery_kernel(int N,
                                         const unsigned int * __restrict__ states_prev,
                                         unsigned int * __restrict__ states_next,
                                         int * __restrict__ infected_steps)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= N) return;

    // Only nodes that were INFECTED at the beginning of timestep are eligible to progress toward recovery this step.
    if (states_prev[idx] == INFECTED) {
        // increment counter - single thread writes to infected_steps[idx] here (no concurrent writes to same index)
        int v = infected_steps[idx] + 1;
        infected_steps[idx] = v;
        if (v >= RECOVERY_TIME) {
            states_next[idx] = RECOVERED;
        }
    }
}

Let us examine memory traffic per node.
The process is split into two kernels: infection and recovery. Each warp still handles a single node. Infection is computed only for susceptible nodes. A warp-wide reduction (__any_sync) decides whether the node should become infected. Only lane 0 writes the result. Recovery updates infection counters and marks nodes as recovered if they reach the threshold.

Memory accesses per neighbor (approximate, combining both kernels):

Access	Size (bytes)	Notes
`states_prev[node]` (read)	4	node state
`states_next[node]` (write)	4	copy previous state
`offsets[node]` & `offsets[node+1]` (read)	8	neighbor list
`neighbors[e]` (read)	4	neighbor index
`states_prev[neigh]` (read)	4	neighbor state
`edge_rands[...]` (read)	4	infection probability
`states_next[node]` (write, lane 0)	<4 amortized	single commit per warp
`infected_steps[node]` (read/write)	8	infection counter update
`states_next[idx]` (write, recovery)	4	mark recovered
Total per neighbor	≈44–48 bytes	includes both kernels and extra counters

Characteristics:

Additional memory accesses and two kernels increase raw byte count.
Warp-wide reduction reduces the number of concurrent writes to states_next.
Recovery kernel serializes counter updates efficiently, avoiding write conflicts.
Access patterns are more structured, reducing contention on memory despite higher byte counts.

Main GPU Loop

int main(int argc, char **argv) {
    if (argc != 3) { std::cerr << "Usage: " << argv[0] << " offsets.bin neighbors.bin\n"; return 1; }

    auto offsets = read_bin_file(argv[1]);
    auto neighbors = read_bin_file(argv[2]);

    int N = offsets.size() - 1;
    int M = neighbors.size();
    float avg_deg = static_cast<float>(M) / N;
    std::cout << "Nodes: " << N << ", Edges: " << M
              << ", Avg connections per node: " << avg_deg << "\n";

    // device warp size
    int currentDevice;
    checkCuda(cudaGetDevice(&currentDevice), "cudaGetDevice");
    cudaDeviceProp deviceProp;
    checkCuda(cudaGetDeviceProperties(&deviceProp, currentDevice), "cudaGetDeviceProperties");
    int warp_size = deviceProp.warpSize;
    checkCuda(cudaMemcpyToSymbol(d_warpSize, &warp_size, sizeof(int)), "cudaMemcpyToSymbol(d_warpSize)");
    std::cout << "GPU Warp size: " << warp_size << std::endl;

    // Allocate device memory
    int *d_offsets = nullptr, *d_neighbors = nullptr;
    unsigned int *d_states_A = nullptr, *d_states_B = nullptr; // double buffer
    float *d_edge_rands = nullptr;
    int *d_infected_steps = nullptr;
    curandState *d_randStates = nullptr;
    unsigned int *d_counts = nullptr;

    checkCuda(cudaMalloc(&d_offsets, offsets.size() * sizeof(int)), "cudaMalloc d_offsets");
    checkCuda(cudaMalloc(&d_neighbors, neighbors.size() * sizeof(int)), "cudaMalloc d_neighbors");
    checkCuda(cudaMalloc(&d_states_A, N * sizeof(unsigned int)), "cudaMalloc d_states_A");
    checkCuda(cudaMalloc(&d_states_B, N * sizeof(unsigned int)), "cudaMalloc d_states_B");
    checkCuda(cudaMalloc(&d_infected_steps, N * sizeof(int)), "cudaMalloc d_infected_steps");
    checkCuda(cudaMalloc(&d_edge_rands, (size_t)MAX_STEPS * M * sizeof(float)), "cudaMalloc d_edge_rands");
    checkCuda(cudaMalloc(&d_randStates, M * sizeof(curandState)), "cudaMalloc d_randStates");
    checkCuda(cudaMalloc(&d_counts, 3 * sizeof(unsigned int)), "cudaMalloc d_counts");

    checkCuda(cudaMemcpy(d_offsets, offsets.data(), offsets.size() * sizeof(int), cudaMemcpyHostToDevice), "memcpy offsets");
    checkCuda(cudaMemcpy(d_neighbors, neighbors.data(), neighbors.size() * sizeof(int), cudaMemcpyHostToDevice), "memcpy neighbors");

    // Initialize host states: mark a few initial infected for testing
    std::vector<unsigned int> h_states(N, SUSCEPTIBLE);
    if (N > 0) h_states[0] = INFECTED;
    if (N > 1) h_states[1] = INFECTED;
    if (N > 2) h_states[2] = INFECTED;

    // infected_steps: -1 means never infected yet; for already infected set 0
    std::vector<int> h_infected_steps(N, -1);
    for (int i = 0; i < N; ++i) if (h_states[i] == INFECTED) h_infected_steps[i] = 0;

    // copy initial states into both buffers (so states_prev can be derived trivially by swapping)
    checkCuda(cudaMemcpy(d_states_A, h_states.data(), N * sizeof(unsigned int), cudaMemcpyHostToDevice), "memcpy states_A");
    checkCuda(cudaMemcpy(d_states_B, h_states.data(), N * sizeof(unsigned int), cudaMemcpyHostToDevice), "memcpy states_B");
    checkCuda(cudaMemcpy(d_infected_steps, h_infected_steps.data(), N * sizeof(int), cudaMemcpyHostToDevice), "memcpy infected_steps");

    // prepare kernel launch params
    int threadsPerBlock = 128;
    int warpsPerBlock = threadsPerBlock / warp_size;
    int numWarps = N; // 1 warp per node
    int blocksNodes = (numWarps + warpsPerBlock - 1) / warpsPerBlock;
    int blocksEdges = (M + threadsPerBlock - 1) / threadsPerBlock;
    int blocksForCount = (N + threadsPerBlock - 1) / threadsPerBlock;

    // ---------- RNG pre-generation timing ----------
    cudaEvent_t ev_gen_start, ev_gen_stop;
    cudaEventCreate(&ev_gen_start);
    cudaEventCreate(&ev_gen_stop);
    cudaEventRecord(ev_gen_start);

    init_rand_states<<<blocksEdges, threadsPerBlock>>>(d_randStates, M, 1234567UL);
    generate_edge_randoms_per_edge<<<blocksEdges, threadsPerBlock>>>(M, MAX_STEPS, d_edge_rands, d_randStates);

    cudaEventRecord(ev_gen_stop);
    cudaEventSynchronize(ev_gen_stop);
    float gen_ms = 0.f;
    cudaEventElapsedTime(&gen_ms, ev_gen_start, ev_gen_stop);
    std::cout << "[Timing] RNG pre-generation: " << gen_ms << " ms\n";

    // ---------- Events reused for per-step timing ----------
    cudaEvent_t ev_step_start, ev_step_mid, ev_step_stop;
    cudaEventCreate(&ev_step_start);
    cudaEventCreate(&ev_step_mid);
    cudaEventCreate(&ev_step_stop);

    // double-buffer pointers: states_prev = A, states_next = B (will swap each step)
    unsigned int *states_prev = d_states_A;
    unsigned int *states_next = d_states_B;

    bool print_per_step = true;

    // ---------- Per-step loop (with per-step timing) ----------
    for (int step = 0; step < MAX_STEPS; ++step) {
        // record start
        cudaEventRecord(ev_step_start);

        // Infection (reads states_prev, writes states_next)
        epidemic_infection_kernel<<<blocksNodes, threadsPerBlock>>>(
            N, d_offsets, d_neighbors, d_edge_rands, states_prev, states_next, d_infected_steps, step, M);

        // mark mid event
        cudaEventRecord(ev_step_mid);

        // Recovery (reads states_prev, may write RECOVERED into states_next, increments infected_steps for previous infected)
        epidemic_recovery_kernel<<<blocksNodes, threadsPerBlock>>>(
            N, states_prev, states_next, d_infected_steps);

        // record stop and synchronize
        cudaEventRecord(ev_step_stop);
        cudaEventSynchronize(ev_step_stop);

        float t_start_mid = 0.f, t_mid_stop = 0.f;
        cudaEventElapsedTime(&t_start_mid, ev_step_start, ev_step_mid); // infection time
        cudaEventElapsedTime(&t_mid_stop, ev_step_mid, ev_step_stop);   // recovery time
        //t_total = t_start_mid + t_mid_stop;
        std::printf("[Timing] Step %2d S->I: %.3f ms, I->R: %.3f ms", step, t_start_mid, t_mid_stop);

        // optionally count states after the step (costly - do only if needed)
        if (print_per_step) {
            checkCuda(cudaMemset(d_counts, 0, 3 * sizeof(unsigned int)), "memset counts");
            count_states_kernel<<<blocksForCount, threadsPerBlock>>>(N, states_next, d_counts);
            unsigned int h_counts[3] = {0,0,0};
            checkCuda(cudaMemcpy(h_counts, d_counts, 3 * sizeof(unsigned int), cudaMemcpyDeviceToHost), "memcpy counts to host");
            std::printf("  |  after step %2d : S=%u I=%u R=%u\n", step, h_counts[SUSCEPTIBLE], h_counts[INFECTED], h_counts[RECOVERED]);
        }

        // swap buffers for next iteration
        std::swap(states_prev, states_next);
    }

    // ---------- Total-only timing loop (same logic but timed as a whole) ----------
    // Reset states back to initial (we kept d_states_A/d_states_B modified) — reload initial host state
    checkCuda(cudaMemcpy(d_states_A, h_states.data(), N * sizeof(unsigned int), cudaMemcpyHostToDevice), "reset states_A");
    checkCuda(cudaMemcpy(d_states_B, h_states.data(), N * sizeof(unsigned int), cudaMemcpyHostToDevice), "reset states_B");
    // reset infected_steps
    checkCuda(cudaMemcpy(d_infected_steps, h_infected_steps.data(), N * sizeof(int), cudaMemcpyHostToDevice), "reset infected_steps");

    // reuse states_prev/states_next variables
    states_prev = d_states_A;
    states_next = d_states_B;

    cudaEvent_t ev_total_start, ev_total_stop;
    cudaEventCreate(&ev_total_start);
    cudaEventCreate(&ev_total_stop);
    cudaEventRecord(ev_total_start);

    for (int step = 0; step < MAX_STEPS; ++step) {
        epidemic_infection_kernel<<<blocksNodes, threadsPerBlock>>>(
            N, d_offsets, d_neighbors, d_edge_rands, states_prev, states_next, d_infected_steps, step, M);
        epidemic_recovery_kernel<<<blocksNodes, threadsPerBlock>>>(
            N, states_prev, states_next, d_infected_steps);
        std::swap(states_prev, states_next);
    }

    cudaEventRecord(ev_total_stop);
    cudaEventSynchronize(ev_total_stop);
    float total_ms = 0.f;
    cudaEventElapsedTime(&total_ms, ev_total_start, ev_total_stop);
    std::cout << "[Timing] Total epidemic loop: " << total_ms << " ms\n";

    // ---------- Final counts (from states_prev because we swapped after last step) ----------
    checkCuda(cudaMemset(d_counts, 0, 3 * sizeof(unsigned int)), "memset counts final");
    count_states_kernel<<<blocksForCount, threadsPerBlock>>>(N, states_prev, d_counts);
    unsigned int h_counts[3] = {0,0,0};
    checkCuda(cudaMemcpy(h_counts, d_counts, 3 * sizeof(unsigned int), cudaMemcpyDeviceToHost), "memcpy final counts");
    std::printf("Final states: S=%u I=%u R=%u\n", h_counts[SUSCEPTIBLE], h_counts[INFECTED], h_counts[RECOVERED]);

    // Cleanup
    cudaFree(d_offsets);
    cudaFree(d_neighbors);
    cudaFree(d_states_A);
    cudaFree(d_states_B);
    cudaFree(d_edge_rands);
    cudaFree(d_infected_steps);
    cudaFree(d_randStates);
    cudaFree(d_counts);

    cudaEventDestroy(ev_gen_start);
    cudaEventDestroy(ev_gen_stop);
    cudaEventDestroy(ev_step_start);
    cudaEventDestroy(ev_step_mid);
    cudaEventDestroy(ev_step_stop);
    cudaEventDestroy(ev_total_start);
    cudaEventDestroy(ev_total_stop);

    return 0;
}

5. Results and Experiments

The research utilized real-world social network data from the SNAP database—specifically the Facebook social graph containing 4,039 users and 176,468 friendship connections. This dataset, available through https://snap.stanford.edu/data/facebook_combined.txt.gz, represents authentic human social interactions, providing a realistic foundation for testing our epidemic models after conversion to CSR format.
To evaluate the performance of the different implementations, a series of experiments was conducted.

Experimental Setup:

Graph: A model graph with 4,039 vertices and 176,468 edges.
Model Parameters: Infection probability p = 0.05, number of simulation steps steps = 50.
Hardware Configuration:
- GPU: NVIDIA Tesla T4 with 2,560 CUDA cores, 15360 MB GDDR6 memory, Compute Capability 7.5
- CPU: Intel Xeon CPU @ 2.00GHz, 2 cores/4 threads, 38.5 MB L3 cache
- Software: CUDA 12.5, Driver Version 550.54.15

Execution Time Comparison:

The table below presents the average total execution time for each implementation.

Version	Algorithm Characteristics	Total simulation time	Additional costs	Notes
CPU (Naive)	One thread handles the vertices sequentially	4.11 ms	Simple implementation	Baseline
GPU #1 — Naive (vertex-based, atomics)	One thread per node, per-node loop over neighbors, heavy atomics, strong divergence	5.58 ms	Minor init overhead	Slowest version; severe divergence and atomics cause serialization
GPU #2 — Edge-Based	One thread per edge, two kernels per step (`infection`, `mark_recovered`)	5.45 ms	Many small kernel launches	Good load balance; graph too small to saturate GPU
GPU #3 — Atomic-Free	Pre-generated randomness, no atomics, two kernels	5.28 ms	7.4 ms RNG pregen	Per-step faster, but pre-generation dominates total time
GPU #4 — Warp-Batch	One warp processes a node, neighbors processed in warp batches	2.47 ms	7.4 ms RNG pregen	First major speedup; ~2.2× faster than naive
GPU #5 — Warp-Pull (two kernels)	Warp-pull model with separate infection and recovery kernels	0.821 ms	7.44 ms RNG pregen	Fastest version; 3× faster than warp-batch, 6× faster than naive

Hardware-Specific Performance Analysis:

CPU Context: The Intel Xeon processor operates at 2.0GHz with only 2 physical cores, representing a modest server-grade CPU. The naive CPU implementation serves as a realistic baseline for entry-level cloud computing environments.
GPU #1 (Naive): The Tesla T4, despite its 2,560 CUDA cores, shows worse performance than CPU due to inefficient utilization. This demonstrates that simply porting CPU algorithms to GPU without architectural optimization can yield poor results, especially on datacenter-grade GPUs like T4.
GPU #2 & #3 (Intermediate): These implementations begin to leverage GPU parallelism but are limited by memory access patterns and synchronization overhead. The T4's memory bandwidth (320 GB/s) is underutilized due to uncoalesced accesses.
GPU #4 (Warp-Level Batch Model): This implementation marks the first major architectural shift that begins to fully exploit the Tesla T4’s internal parallelism. Instead of assigning a single thread to each vertex, GPU #4 assigns an entire warp (32 threads) to process one node cooperatively.

Architectural Advantages:
- Warp-per-node parallelism: Every warp scans a node’s adjacency list in a coordinated fashion, significantly reducing intra-warp divergence compared to prior versions.
- Coalesced memory access: Neighbor lists are consumed in warp-aligned batches, improving effective memory bandwidth utilization within the constraints of the graph’s irregular degree distribution.
- No atomics: Because each warp exclusively “owns” a vertex, there is no write contention and no need for atomic operations.
- Reduced synchronization overhead: The algorithm relies purely on natural warp-synchronous behavior; no explicit warp intrinsics were needed in this version.
Outcome:
GPU #4 demonstrates a ~2.2× speedup relative to earlier approaches—even though the graph is extremely small by GPU standards (4k nodes). This clearly shows that restructuring computation to respect warp execution patterns is significantly more impactful than raw FLOPs or core count.

Energy Efficiency Considerations:
The Tesla T4’s 70W TDP combined with a highly parallel warp-batch design means the GPU completes the simulation faster and returns to low-power idle state sooner, reducing energy consumption in real-world server deployments.
GPU #5 (Warp-Pull Model, Two-Kernel Architecture): The fifth implementation represents the culmination of all optimizations, combining a warp-synchronous infection model with a clean separation of infection (S→I) and recovery (I→R) transitions.
This design transforms the algorithm into a memory-efficient, branch-free pipeline.

Architectural Advantages:
- Warp-per-node “pull” traversal: Instead of infected nodes pushing infection outward, susceptible nodes pull infection status from their neighbors. This aligns perfectly with warp processing: all 32 lanes in the warp inspect different neighbors of the same node.
- Guaranteed coalesced memory access: Adjacency lists are consumed in tightly strided access patterns, minimizing cache misses and maximizing throughput on the T4’s GDDR6 memory subsystem.
- Zero atomics and conflict-free writes: Only lane 0 of each warp writes to its corresponding node’s state, eliminating all write contention.
- Two-kernel design reduces branching:
  - Infection kernel handles only susceptible nodes
- Recovery kernel handles only infected nodes This eliminates large conditional branches that previous versions suffered from.
  - Deterministic and cache-friendly double-buffering: Using states_prev → states_next eliminates read/write hazards and ensures stable data for each phase.
Outcome:
GPU #5 achieves the fastest total simulation time (0.82 ms)—a 6.4× improvement over the naive GPU version and 3× improvement over the first warp-level approach.

Despite the graph being relatively small, GPU #5 achieves near-optimal memory behavior for a Turing GPU executing an irregular graph algorithm.

Energy and Datacenter Context:
Due to extremely short kernel runtimes and reduced divergence, GPU #5 enables the Tesla T4 to return to its 70W idle envelope almost immediately.
In large-scale simulations typical for epidemiology (millions of nodes), this architectural pattern will scale nearly linearly and provide even greater energy and cost savings in datacenter environments.

6. Conclusion

In this work, we set out to understand how to run epidemic simulations efficiently on the GPU — and we quickly discovered that raw GPU power means little without the right algorithmic structure.

Our initial versions were intentionally naive, and they taught us an important lesson: a GPU with thousands of cores can easily be slower than a small CPU if memory access is irregular. The early implementations were heavily memory-bound and suffered from warp divergence, so the hardware simply couldn’t show its strengths.

The turning point came when we reorganized the simulation around warp-level cooperation. Instead of assigning work to individual threads, we let an entire warp process one node together. This single architectural shift dramatically improved memory coherence, reduced divergence, and aligned the algorithm with how GPUs naturally execute code.

The final “warp-pull” version pushed the idea further. By splitting infection and recovery into two simple kernels and letting susceptible nodes pull information from neighbors, we removed synchronisation hazards and eliminated unnecessary atomics. As a result, the GPU finally worked at full efficiency — and the simulation time dropped to under a millisecond.

Overall, we learned that the biggest speedups don’t come from increasing the number of threads, but from restructuring the algorithm so the GPU can use its memory system and warps effectively. Once we respected the hardware’s execution model, performance followed almost automatically.

7. References

Kermack W.O., McKendrick A.G. (1927). A contribution to the mathematical theory of epidemics. Proceedings of the Royal Society A.
NVIDIA. CUDA C Programming Guide. (2024).
Pastor-Satorras R., Vespignani A. (2001). Epidemic spreading in scale-free networks. Physical Review Letters.

Top comments (1)

Anima Candida • Nov 13 • Edited

Thank you, this was very helpful, the approach is quite relevant for the solving problem. Good job!