DEV Community

Cover image for Why 1 Million GoRoutines Don't Use Your GPU: Understanding CPUs, GPUs, Go's GMP Scheduler, and GPU Computing
Ahmed Raza Idrisi
Ahmed Raza Idrisi

Posted on

Why 1 Million GoRoutines Don't Use Your GPU: Understanding CPUs, GPUs, Go's GMP Scheduler, and GPU Computing

Article Goal:
Write a deep technical article for software engineers explaining why millions of Go goroutines still execute on CPU cores instead of GPU cores, even though both ultimately process binary instructions. The article should start from first principles and progressively build understanding of CPU architecture, GPU architecture, Go's GMP scheduler, operating system threads, concurrency vs parallelism, and GPU computing.


Sections to Cover

1. Introduction

Discuss a common misconception:

"If a GPU has thousands of cores and Go can create millions of goroutines, why doesn't Go automatically run goroutines on the GPU?"

Introduce the core idea:

  • Goroutines run on CPU threads.
  • GPUs are specialized compute devices.
  • Concurrency is different from GPU parallelism.

2. Understanding CPU Architecture

Explain:

  • Modern CPUs (Intel i5, i7, AMD Ryzen).
  • Small number of powerful cores.
  • Branch prediction.
  • Large caches.
  • Out-of-order execution.
  • Context switching.

Include diagram:

CPU
├── Core 1
├── Core 2
├── Core 3
└── Core 4
Enter fullscreen mode Exit fullscreen mode

Discuss why CPUs excel at:

  • APIs
  • Databases
  • Filesystems
  • Operating system tasks
  • Business logic

3. Understanding GPU Architecture

Explain:

  • NVIDIA RTX 4090
  • CUDA cores
  • Streaming Multiprocessors
  • Massive parallelism
  • Throughput-oriented design

Diagram:

GPU
├── SM 1
│   ├── 100s of threads
├── SM 2
│   ├── 100s of threads
...
└── SM N
Enter fullscreen mode Exit fullscreen mode

Discuss why GPUs excel at:

  • Matrix multiplication
  • AI
  • Simulations
  • Image processing
  • Scientific computing

4. Go's GMP Scheduler Explained

Explain:

G = Goroutine

M = Machine (OS Thread)

P = Processor (Logical Scheduler)

Diagram:

1,000,000 Goroutines
        ↓
 Go Scheduler
        ↓
     P Pool
        ↓
    M Threads
        ↓
    CPU Cores
Enter fullscreen mode Exit fullscreen mode

Clarify:

  • P is NOT a GPU processor.
  • P is a scheduling context inside Go runtime.

5. What Happens When You Launch One Million Goroutines

Example:

for i := 0; i < 1000000; i++ {
    go worker(i)
}
Enter fullscreen mode Exit fullscreen mode

Explain:

  • Go does not create 1 million OS threads.
  • Goroutines are multiplexed onto a small number of CPU threads.
  • Scheduler performs work stealing.
  • Lightweight stack allocation.

Memory comparison:

OS Thread ≈ 1MB–8MB
Goroutine ≈ 2KB initial stack
Enter fullscreen mode Exit fullscreen mode

6. Why API Calls Don't Benefit From GPUs

Example:

go fetchMarketData()
Enter fullscreen mode Exit fullscreen mode

Break down:

  • DNS
  • TCP
  • TLS
  • Socket operations
  • Waiting for remote server

Timeline:

CPU Work: 0.5ms
Network Wait: 99.5ms
Enter fullscreen mode Exit fullscreen mode

Explain why GPU acceleration provides no benefit.


7. When GPUs Actually Shine

Examples:

  • Black-Scholes calculations
  • Greeks calculations
  • Monte Carlo simulations
  • Option chain analytics
  • Risk analysis
  • AI trading models

Diagram:

1 Million Option Contracts
        ↓
GPU Threads
        ↓
Delta
Gamma
Theta
Vega
Enter fullscreen mode Exit fullscreen mode

8. Go + CUDA Architecture

Show architecture:

Go
├── Market Data
├── APIs
├── WebSockets
├── Risk Engine
└── GPU Dispatcher
          ↓
      CUDA
          ↓
        RTX 4090
Enter fullscreen mode Exit fullscreen mode

Explain that Go orchestrates work while CUDA executes calculations.


9. Python vs Go vs Rust for GPU Workloads

Compare:

Language CPU Performance GPU Integration Typical Use
Python Low Excellent AI/ML
Go High Limited Services
Rust Very High Growing Systems
C++ Maximum Native HPC

Explain that most Python GPU workloads ultimately execute C++/CUDA kernels.


10. Real Trading System Architecture

Design a production-grade option screener:

Market Feed
      ↓
Go Services
      ↓
Kafka
      ↓
GPU Cluster
      ↓
Risk Engine
      ↓
Trade Signals
      ↓
Execution Engine
Enter fullscreen mode Exit fullscreen mode

Discuss:

  • Scalability
  • Latency
  • GPU utilization
  • Cost considerations

11. Key Takeaways

Summarize:

  • Goroutines run on CPUs.
  • GMP's P is not a GPU processor.
  • GPUs are not replacements for CPUs.
  • GPUs accelerate mathematical workloads.
  • Go is excellent for orchestration.
  • CUDA/Rust/C++ are used for extreme numerical workloads.
  • Modern trading systems combine CPU concurrency and GPU acceleration.

Top comments (0)