I've come across so many blogs on this topic, but most of them explain it in such a complicated way that it's almost impossible to follow.
So I decided to break it down in simple, easy-to-understand language, that actually helps you diagnose slowdowns and fix them.
Here are the 5 reasons why your GPU starts fast but gradually becomes slower during training
1️⃣ Your workload becomes memory-bound instead of compute-bound
2️⃣ Your workload becomes less parallelizable
3️⃣ Your tensor shapes stop aligning with GPU-friendly sizes
4️⃣ Thermal throttling: GPU heats up and automatically slows down to protect itself
These excerpts are taken from my book "Building a Small Language Model from Scratch: A Practical Guide". If you'd like to dive deeper into the topic, feel free to check out the book.
✅ Gumroad: https://plakhera.gumroad.com/l/BuildingASmallLanguageModelfromScratch
✅ Amazon: https://www.amazon.com/dp/B0G64SQ4F8/
✅ Leanpub: https://leanpub.com/buildingasmalllanguagemodelfromscratch/
🔗 Blog link: https://www.linkedin.com/pulse/why-your-gpu-gets-slower-during-training-even-though-nothing-lakhera-wblsc/

Top comments (0)