Custom CUDA Kernels Outperforming cuBLAS: Deep Dive into GPU Memory Optimization for Small-Batch ML Workloads

shreshth kapai — Fri, 25 Jul 2025 20:09:22 +0000

Developed specialized CUDA kernels for financial ML inference that achieve 93,563 operations/second with 0.011ms median latency on consumer GTX 1650 hardware, demonstrating 7.3× performance improvement over PyTorch's cuBLAS-backed implementations through targeted memory hierarchy exploitation and vectorization techniques.

Architecture-Specific Optimization Philosophy
Memory Hierarchy Exploitation Techniques
Vectorization and Alignment Strategies
Thread Mapping and Occupancy Analysis
Performance Analysis and Bottleneck Identification
Architectural Constraints and Modern GPU Limitations
Comparative Analysis: Specialized vs General-Purpose Libraries

Architecture-Specific Optimization Philosophy

Most GPU acceleration libraries target large-scale deep learning workloads with massive batch sizes (512-4096) and high-dimensional operations. Financial ML inference presents fundamentally different characteristics:

Batch sizes: 8-128 samples (real-time inference constraints)
Feature dimensions: 16-128 elements (factor models, risk metrics)
Latency requirements: Sub-millisecond response times
Memory patterns: Frequent small operations vs. infrequent large operations

This mismatch creates optimization opportunities that general-purpose libraries cannot exploit due to their broader target scope.

GTX 1650 Hardware Constraints Analysis

Hardware: GTX 1650 (Turing TU117)
Compute Capability: 7.5
CUDA Cores: 896 @ 1485MHz base, 1665MHz boost  
Memory: 4GB GDDR6, 128-bit bus, 192 GB/s bandwidth
SMs: 14 Streaming Multiprocessors (64 cores/SM)
L2 Cache: 1MB unified
Shared Memory: 64KB per SM (configurable with L1)
Register File: 65,536 32-bit registers per SM

The limited memory bandwidth (192 GB/s) and moderate core count necessitate aggressive memory access optimization and careful resource utilization strategies.

Memory Hierarchy Exploitation Techniques

Shared Memory Staging Architecture

The primary optimization revolves around using shared memory as a staging area for vectorized global memory access patterns:

__global__ void batched_gemv_kernel(
    const float* __restrict__ weights,
    const float* __restrict__ inputs,
    float* __restrict__ outputs,
    int batch_size, int input_dim, int output_dim
) {
    extern __shared__ float shared_input[];

    // Vectorized memory access with runtime alignment checking
    if ((input_dim & 3) == 0 && ((uintptr_t)input_ptr & 15) == 0) {
        float4* shared_input4 = (float4*)shared_input;
        const float4* input_ptr4 = (const float4*)input_ptr;

        // 4× bandwidth utilization through vectorization
        for (int i = tid; i * 4 < input_dim; i += num_threads) {
            shared_input4[i] = __ldg(&input_ptr4[i]);
        }
    } else {
        // Fallback to scalar loads with read-only cache utilization
        for (int i = tid; i < input_dim; i += num_threads) {
            shared_input[i] = __ldg(&input_ptr[i]);
        }
    }
    __syncthreads();
}

Technical Analysis of Memory Access Patterns

Alignment Verification Logic:

(input_dim & 3) == 0: Ensures dimension divisible by 4 for float4 operations
((uintptr_t)input_ptr & 15) == 0: Verifies 16-byte alignment for 128-bit loads
Runtime branching overhead: Minimal due to uniform branching within warps

Memory Coalescing Optimization:

Consecutive threads access consecutive float4 elements
128-byte cache line utilization: 32 float elements per cache line
Shared memory banking: Stride-1 access eliminates bank conflicts

Read-Only Data Cache Exploitation:
The __ldg() intrinsic bypasses L1 cache, utilizing read-only texture cache for streaming access patterns, crucial for memory-bandwidth-bound operations on GTX 1650.

Vectorization and Alignment Strategies

Float4 Vectorization Implementation

// Dynamic shared memory allocation with alignment optimization
int shared_mem = ((input_dim * sizeof(float) + 127) & ~127);

// Launch configuration optimized for GTX 1650 occupancy
dim3 grid(batch_size);
dim3 block(min(output_dim, 1024));

Alignment Calculation Breakdown:

input_dim * sizeof(float): Raw memory requirement
+ 127: Maximum padding for 128-byte alignment
& ~127: Bitwise AND with 128-byte mask (128 = 0x80, ~127 = 0xFF80)

This ensures shared memory allocations align with cache line boundaries, optimizing memory controller efficiency.

Instruction-Level Parallelism Through Manual Unrolling

// 8-way manual loop unrolling for multiply-accumulate operations
int i = 0;
for (; i <= input_dim - 8; i += 8) {
    result += shared_input[i] * weight_row[i * output_dim] +
              shared_input[i+1] * weight_row[(i+1) * output_dim] +
              shared_input[i+2] * weight_row[(i+2) * output_dim] +
              shared_input[i+3] * weight_row[(i+3) * output_dim] +
              shared_input[i+4] * weight_row[(i+4) * output_dim] +
              shared_input[i+5] * weight_row[(i+5) * output_dim] +
              shared_input[i+6] * weight_row[(i+6) * output_dim] +
              shared_input[i+7] * weight_row[(i+7) * output_dim];
}

Performance Impact Analysis:

Instruction overhead reduction: 87.5% (8 operations per loop iteration vs 1)
Register pressure management: Compiler can schedule 8 multiply-accumulate operations across available ALUs
Pipeline utilization: Multiple outstanding memory operations mask arithmetic latency
Branch prediction: Reduced branch misprediction penalty

Thread Mapping and Occupancy Analysis

Thread-Per-Output-Element Strategy

int batch_idx = blockIdx.x;    // Grid-stride batch processing
int output_idx = threadIdx.x;  // Thread-per-output mapping

const float* weight_row = weights + batch_idx * input_dim * output_dim + output_idx;

Design Rationale:

Load balancing: Each thread computes exactly one output element
Memory access pattern: Enables coalesced weight matrix access
Warp utilization: Output dimensions typically multiples of 32 (warp size)
Divergence minimization: Uniform computation across thread block

Occupancy Optimization for GTX 1650

// Conservative block sizing to prevent resource exhaustion
dim3 block(min(output_dim, 1024));

// Shared memory calculation accounting for 64KB SM limitation  
int shared_mem = ((input_dim * sizeof(float) + 127) & ~127);

Resource Utilization Analysis:

Theoretical occupancy: 14 SMs × 2048 threads/SM = 28,672 concurrent threads
Practical occupancy: Limited by shared memory usage (64KB/SM)
Register pressure: 65,536 registers/SM ÷ threads/block = register allocation per thread

Performance Analysis and Bottleneck Identification

Comprehensive Benchmarking Results

Hardware: GTX 1650 (Turing TU117, 896 CUDA cores, 192 GB/s bandwidth)
Methodology: 1000 trials, 50 warmup iterations, CUDA hardware timers

GEMV Operations (Batch=32, Input=64, Output=32):
├── Throughput: 93,563 ops/sec
├── Median Latency: 0.011ms  
├── P95 Latency: 0.076ms
├── Standard Deviation: ±0.032ms
└── Speedup vs PyTorch: 7.3× (629.5% improvement)

GEMV Operations (Batch=32, Input=64, Output=64):
├── Throughput: 82,672 ops/sec
├── Median Latency: 0.012ms
├── P95 Latency: 0.147ms  
├── Standard Deviation: ±0.049ms
└── Speedup vs PyTorch: 5.2× (424.2% improvement)

Softmax Normalization (Batch=32, Dimension=64):
├── Throughput: 24,357 ops/sec
├── Median Latency: 0.041ms
├── P95 Latency: 0.178ms
├── Standard Deviation: ±0.042ms
└── Speedup vs PyTorch: 1.3× (29.7% improvement)

Memory Bandwidth Utilization Analysis

Theoretical Peak Performance:

GTX 1650 Memory Bandwidth: 192 GB/s
GEMV Memory Access Pattern: (batch_size × input_dim + batch_size × input_dim × output_dim) × sizeof(float)
For b32_i64_o32: (32×64 + 32×64×32) × 4 bytes = 270,336 bytes per operation

Achieved Bandwidth Utilization:

93,563 ops/sec × 270,336 bytes = 25.3 GB/s effective bandwidth
Utilization: 25.3 GB/s ÷ 192 GB/s = 13.2% of theoretical peak

This relatively low utilization indicates compute-bound rather than memory-bound performance characteristics, suggesting successful cache utilization and vectorization effectiveness.

Architectural Constraints and Modern GPU Limitations

Tensor Core Utilization Barriers

// Current FP32 implementation for numerical stability
float result = 0.0f;
result += shared_input[i] * weight_row[i * output_dim];

// Tensor Core requirements (unavailable on GTX 1650):
// - Compute Capability 7.0+ (GTX 1650 = 7.5, but lacks Tensor Cores)
// - Mixed precision: __half storage, float accumulation  
// - Matrix dimensions: 16×16×16 WMMA fragments
// - Memory layout: Row-major A, Column-major B matrices

Tensor Core Integration Challenges:

Precision constraints: Financial calculations require FP32 accuracy for regulatory compliance
Matrix dimension requirements: 16×16 tile size may not align with typical ML layer dimensions
Memory layout conversion: Row-major to column-major transformation overhead
Hardware availability: GTX 1650 lacks Tensor Core units despite compute capability 7.5

Advanced Memory Optimization Limitations

// Current shared memory constraint (GTX 1650: 64KB per SM)
extern __shared__ float shared_input[]; // Limited to ~16K float elements

// Professional GPU capabilities (A100: 164KB per SM):
// - Larger tile sizes for matrix blocking
// - Multi-level shared memory hierarchies  
// - Advanced prefetching strategies

Shared Memory Scaling Analysis:

GTX 1650: 64KB ÷ 4 bytes = 16,384 float elements maximum
A100: 164KB ÷ 4 bytes = 42,496 float elements maximum
Blocking potential: A100 enables 2.6× larger tile sizes for cache blocking algorithms

Warp-Level Primitive Opportunities

// Current reduction implementation (tree-based)
for (int stride = blockDim.x / 2; stride > 0; stride >>= 1) {
    if (tid < stride) {
        sdata[tid] = fmaxf(sdata[tid], sdata[tid + stride]);
    }
    __syncthreads();
}

// Warp shuffle alternative (not implemented):
// float val = __shfl_down_sync(0xffffffff, local_max, 16);
// val = fmaxf(val, local_max);
// Eliminates shared memory usage and synchronization overhead

Comparative Analysis: Specialized vs General-Purpose Libraries

cuBLAS Performance Characteristics

cuBLAS Optimization Focus:

Large matrix operations (M, N, K > 1024)
High arithmetic intensity workloads
Batch sizes optimized for maximum throughput
Matrix-matrix operations (GEMM) over matrix-vector (GEMV)

Small-Batch Workload Disadvantages:

Kernel launch overhead amortization requires larger operations
Memory access patterns optimized for large stride operations
Thread block configurations target high occupancy over low latency
Algorithm selection favors throughput over response time

Specialized Kernel Advantages

Cache Locality Exploitation:

Input vectors fit entirely in shared memory (64KB)
Weight matrix rows accessed with spatial locality
Reduced global memory transactions per operation

Launch Overhead Reduction:

Single kernel launch per batch vs multiple cuBLAS calls
Simplified memory layout requirements
Direct device memory manipulation without library abstraction

Resource Utilization Optimization:

Thread mapping optimized for specific dimension ranges
Shared memory allocation tailored to workload characteristics
Register allocation aligned with computational requirements

Scaling Projections and Professional GPU Potential

A100/H100 Architecture Enhancements

NVIDIA A100 (Ampere GA100):
├── Memory: 40/80GB HBM2e, 1,555-2,039 GB/s bandwidth
├── SMs: 108 Streaming Multiprocessors  
├── Tensor Cores: 3rd generation, mixed-precision acceleration
├── Shared Memory: 164KB per SM
└── Compute Capability: 8.0

NVIDIA H100 (Hopper GH100):  
├── Memory: 80GB HBM3, 3,350 GB/s bandwidth
├── SMs: 132 Streaming Multiprocessors
├── Tensor Cores: 4th generation with Transformer Engine
├── Shared Memory: 228KB per SM  
└── Compute Capability: 9.0

Conservative Performance Scaling Estimates:

Memory bandwidth scaling: 1,555 GB/s ÷ 192 GB/s = 8.1× theoretical improvement
SM parallelism scaling: 108 SMs ÷ 14 SMs = 7.7× parallel processing capability
Realistic performance scaling: 4-6× improvement accounting for memory controller contention

Projected A100 Performance:

GEMV operations: 375,000-560,000 ops/sec (4-6× current performance)
Median latency: 0.002-0.003ms (3-5× latency reduction)

Mixed-Precision Acceleration Potential

// Tensor Core integration possibility (A100/H100)
#include <mma.h>
using namespace nvcuda;

__global__ void batched_gemv_tensor_core(
    const __half* __restrict__ weights,    // FP16 storage
    const __half* __restrict__ inputs,     // FP16 storage  
    float* __restrict__ outputs,           // FP32 accumulation
    int batch_size, int input_dim, int output_dim
) {
    // 16×16×16 matrix fragments for WMMA operations
    wmma::fragment<wmma::matrix_a, 16, 16, 16, __half, wmma::row_major> a_frag;
    wmma::fragment<wmma::matrix_b, 16, 16, 16, __half, wmma::col_major> b_frag;  
    wmma::fragment<wmma::accumulator, 16, 16, 16, float> c_frag;

    // Warp-level matrix multiply-accumulate
    wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);
}

Tensor Core Theoretical Performance:

A100: 312 TOPS mixed-precision throughput
H100: 989 TOPS mixed-precision throughput
Realistic acceleration: 4-16× improvement for compatible workloads

Technical Insights and Optimization Principles

Domain-Specific Optimization Philosophy

The work demonstrates several key principles for specialized GPU kernel development:

Workload Characterization Priority: Understanding specific memory access patterns, computational intensity, and resource requirements enables targeted optimization strategies impossible in general-purpose libraries.
Architecture-Constraint-Driven Design: GTX 1650's memory bandwidth limitations forced aggressive shared memory utilization and vectorization strategies that proved highly effective.
Algorithm-Architecture Co-design: Thread mapping strategies (thread-per-output-element) align algorithmic structure with hardware execution characteristics for optimal resource utilization.

Memory Access Pattern Analysis

// Optimal access pattern for small-batch GEMV
const float* weight_row = weights + batch_idx * input_dim * output_dim + output_idx;

// Memory layout: [batch][input_dim][output_dim]
// Access stride: output_dim (non-unit, but predictable)
// Cache behavior: Spatial locality within weight rows
// Bandwidth utilization: Coalesced across thread block

Memory Access Efficiency Factors:

Coalescing effectiveness: Consecutive threads access elements separated by output_dim stride
Cache line utilization: 32-element cache lines partially utilized due to stride pattern
Prefetching potential: Predictable access pattern enables hardware prefetching
Bank conflict analysis: Shared memory accesses use unit stride, eliminating conflicts

Benchmarking Methodology and Statistical Rigor

Measurement Infrastructure

// Hardware-based timing for sub-microsecond precision
cudaEvent_t start_event, end_event;
cudaEventCreate(&start_event);
cudaEventCreate(&end_event);

// Synchronous measurement protocol
cudaEventRecord(start_event, stream);
kernel_function<<<grid, block, shared_mem, stream>>>(args...);
cudaEventRecord(end_event, stream);
cudaEventSynchronize(end_event);

float milliseconds;
cudaEventElapsedTime(&milliseconds, start_event, end_event);

Statistical Analysis Protocol

Experimental Design:

Warmup iterations: 50 trials for thermal stabilization and cache population
Measurement trials: 1000 iterations for statistical significance
Memory layout consistency: Identical tensor formats across all comparisons
Environmental controls: Fixed GPU frequency, isolated measurement process

Statistical Metrics:

Central tendency: Median latency (robust to outliers)
Variability: Standard deviation for performance consistency assessment
Tail behavior: P95/P99 percentiles for SLA compliance analysis
Comparative analysis: Relative speedup calculations with confidence intervals

Future Enhancement Opportunities

Multi-GPU Scaling Architecture

// NCCL-based distribution for professional deployments
#include <nccl.h>

class DistributedGPUScaler {
    ncclComm_t* comms;
    cudaStream_t* streams;
    int num_gpus;

public:
    void distributed_batch_processing(
        float** device_weights, float** device_inputs, float** device_outputs,
        int total_batch_size
    ) {
        int batch_per_gpu = total_batch_size / num_gpus;

        // Parallel kernel launches across GPUs
        for (int gpu = 0; gpu < num_gpus; gpu++) {
            cudaSetDevice(gpu);
            launch_batched_gemv(
                device_weights[gpu], device_inputs[gpu], device_outputs[gpu],
                batch_per_gpu, input_dim, output_dim, streams[gpu]
            );
        }

        // All-reduce for global operations if required
        for (int gpu = 0; gpu < num_gpus; gpu++) {
            ncclAllReduce(device_outputs[gpu], device_outputs[gpu],
                         batch_per_gpu * output_dim, ncclFloat, ncclSum,
                         comms[gpu], streams[gpu]);
        }
    }
};

Kernel Fusion Optimization

// Vertical fusion: GEMV + Softmax + Processing pipeline
__global__ void fused_ml_pipeline(
    const float* weights, const float* inputs,
    float* features, int batch_size, int input_dim, int output_dim
) {
    // Stage 1: GEMV computation with shared memory staging
    // Stage 2: In-place softmax normalization  
    // Stage 3: Feature extraction and output writing
    // Eliminates intermediate global memory transactions
}

Conclusion

This exploration demonstrates how architectural constraints can drive innovation in specialized GPU computing. The GTX 1650's memory bandwidth limitations necessitated aggressive optimization strategies that achieved performance characteristics competitive with professional hardware.

Key technical contributions include:

Vectorized shared memory staging for memory-bandwidth-constrained architectures
Manual loop unrolling strategies that outperform compiler optimization for specific workload patterns
Thread mapping optimization aligned with small-batch ML inference characteristics
Comprehensive benchmarking methodology ensuring statistical validity and reproducible results

The work validates the principle that domain-specific optimization can outperform general-purpose libraries when workload characteristics differ significantly from the target optimization profile.

Discussion Questions

Algorithmic specialization: What other computational domains could benefit from moving beyond general-purpose library implementations toward workload-specific kernel optimization?
Architecture evolution: How will emerging GPU architectures (RDNA, Intel Xe, ARM Mali) influence optimization strategies for specialized workloads?
Precision trade-offs: What methodologies can balance numerical stability requirements with mixed-precision acceleration opportunities in financial computing applications?

Repository: GitHub - CUDA Financial ML Kernels

Technical Analysis: Medium - Sub-millisecond GPU Task Queue

If you found this technical deep-dive valuable, consider following for more GPU optimization content and performance engineering insights.

DEV Community: shreshth kapai