I've come across so many blogs on this topic, but most of them explain it in such a complicated way that it's almost impossible to follow.
So I decided to break it down in simple, easy-to-understand language, that actually helps you diagnose slowdowns and fix them.
Here are the 5 reasons why your GPU starts fast but gradually becomes slower during training
1๏ธโฃ Your workload becomes memory-bound instead of compute-bound
2๏ธโฃ Your workload becomes less parallelizable
3๏ธโฃ Your tensor shapes stop aligning with GPU-friendly sizes
4๏ธโฃ Thermal throttling: GPU heats up and automatically slows down to protect itself
These excerpts are taken from my book "Building a Small Language Model from Scratch: A Practical Guide". If you'd like to dive deeper into the topic, feel free to check out the book.
โ
Gumroad: https://plakhera.gumroad.com/l/BuildingASmallLanguageModelfromScratch
โ
Amazon: https://www.amazon.com/dp/B0G64SQ4F8/
โ
Leanpub: https://leanpub.com/buildingasmalllanguagemodelfromscratch/
๐ Blog link: https://www.linkedin.com/pulse/why-your-gpu-gets-slower-during-training-even-though-nothing-lakhera-wblsc/

Top comments (0)