Ahmed Raza Idrisi

Posted on Jun 2

Why 1 Million GoRoutines Don't Use Your GPU: Understanding CPUs, GPUs, Go's GMP Scheduler, and GPU Computing

#programming #go #architecture #webdev

Article Goal:
Write a deep technical article for software engineers explaining why millions of Go goroutines still execute on CPU cores instead of GPU cores, even though both ultimately process binary instructions. The article should start from first principles and progressively build understanding of CPU architecture, GPU architecture, Go's GMP scheduler, operating system threads, concurrency vs parallelism, and GPU computing.

Sections to Cover

1. Introduction

Discuss a common misconception:

"If a GPU has thousands of cores and Go can create millions of goroutines, why doesn't Go automatically run goroutines on the GPU?"

Introduce the core idea:

Goroutines run on CPU threads.
GPUs are specialized compute devices.
Concurrency is different from GPU parallelism.

2. Understanding CPU Architecture

Explain:

Modern CPUs (Intel i5, i7, AMD Ryzen).
Small number of powerful cores.
Branch prediction.
Large caches.
Out-of-order execution.
Context switching.

Include diagram:

CPU
├── Core 1
├── Core 2
├── Core 3
└── Core 4

Discuss why CPUs excel at:

APIs
Databases
Filesystems
Operating system tasks
Business logic

3. Understanding GPU Architecture

Explain:

NVIDIA RTX 4090
CUDA cores
Streaming Multiprocessors
Massive parallelism
Throughput-oriented design

Diagram:

GPU
├── SM 1
│   ├── 100s of threads
├── SM 2
│   ├── 100s of threads
...
└── SM N

Discuss why GPUs excel at:

Matrix multiplication
AI
Simulations
Image processing
Scientific computing

4. Go's GMP Scheduler Explained

Explain:

G = Goroutine

M = Machine (OS Thread)

P = Processor (Logical Scheduler)

Diagram:

1,000,000 Goroutines
        ↓
 Go Scheduler
        ↓
     P Pool
        ↓
    M Threads
        ↓
    CPU Cores

Clarify:

P is NOT a GPU processor.
P is a scheduling context inside Go runtime.

5. What Happens When You Launch One Million Goroutines

Example:

for i := 0; i < 1000000; i++ {
    go worker(i)
}

Explain:

Go does not create 1 million OS threads.
Goroutines are multiplexed onto a small number of CPU threads.
Scheduler performs work stealing.
Lightweight stack allocation.

Memory comparison:

OS Thread ≈ 1MB–8MB
Goroutine ≈ 2KB initial stack

6. Why API Calls Don't Benefit From GPUs

Example:

go fetchMarketData()

Break down:

DNS
TCP
TLS
Socket operations
Waiting for remote server

Timeline:

CPU Work: 0.5ms
Network Wait: 99.5ms

Explain why GPU acceleration provides no benefit.

7. When GPUs Actually Shine

Examples:

Black-Scholes calculations
Greeks calculations
Monte Carlo simulations
Option chain analytics
Risk analysis
AI trading models

Diagram:

1 Million Option Contracts
        ↓
GPU Threads
        ↓
Delta
Gamma
Theta
Vega

8. Go + CUDA Architecture

Show architecture:

Go
├── Market Data
├── APIs
├── WebSockets
├── Risk Engine
└── GPU Dispatcher
          ↓
      CUDA
          ↓
        RTX 4090

Explain that Go orchestrates work while CUDA executes calculations.

9. Python vs Go vs Rust for GPU Workloads

Compare:

Language	CPU Performance	GPU Integration	Typical Use
Python	Low	Excellent	AI/ML
Go	High	Limited	Services
Rust	Very High	Growing	Systems
C++	Maximum	Native	HPC

Explain that most Python GPU workloads ultimately execute C++/CUDA kernels.

10. Real Trading System Architecture

Design a production-grade option screener:

Market Feed
      ↓
Go Services
      ↓
Kafka
      ↓
GPU Cluster
      ↓
Risk Engine
      ↓
Trade Signals
      ↓
Execution Engine

Discuss:

Scalability
Latency
GPU utilization
Cost considerations

11. Key Takeaways

Summarize:

Goroutines run on CPUs.
GMP's P is not a GPU processor.
GPUs are not replacements for CPUs.
GPUs accelerate mathematical workloads.
Go is excellent for orchestration.
CUDA/Rust/C++ are used for extreme numerical workloads.
Modern trading systems combine CPU concurrency and GPU acceleration.

DEV Community

Why 1 Million GoRoutines Don't Use Your GPU: Understanding CPUs, GPUs, Go's GMP Scheduler, and GPU Computing

Sections to Cover

1. Introduction

2. Understanding CPU Architecture

3. Understanding GPU Architecture

4. Go's GMP Scheduler Explained

5. What Happens When You Launch One Million Goroutines

6. Why API Calls Don't Benefit From GPUs

7. When GPUs Actually Shine

8. Go + CUDA Architecture

9. Python vs Go vs Rust for GPU Workloads

10. Real Trading System Architecture

11. Key Takeaways

Top comments (0)