Fine-tuning LLM on a laptop: VRAM - Shared Memory - GPU Load - Performance

#ai #machinelearning #genai #programming

I have been playing with Supervised Fine Tuning and LORA using my laptop with NVIDIA RTX 4060 8GB. The subject of SFT is vast, picking the correct training hyperparams is more magic than science, and there's a good deal of experimentation...

Yet let me share one small finding. GPU utilization and Shared Memory effect on training speed.

I used Stable LM 2 1.6B base model and turned it into a chat model using 4400 samples from OASTT2 dataset. Here is the training file.

Below is a screenshot from W&B showing system metrics for 2 runs. The only difference was the batch size, all other params were the same:

1) Batch size 1 (brown line) - 16.5 minutes per epoch
2) Batch size 2 (the other color) - 25.0 minutes per epoch

Pay attention to GPU Power Usage and how the brown line fluctuates around 82W while the 2nd run averages 65W. Apparently, during the 1st run, the GPU was loaded better which resulted in 50% faster completion of a single epoch. Even despite the fact that large batch size is supposed to speed up the training.

The reason for that is the use of system RAM instead of VRAM - when GPU is out of fast memory it will happily spill over its data into slower system memory. It took just 0.9GB excess to slow down the whole process by 45%. The larger the portion of data in system RAM - the slower the training.

The takeaway is you should be considerate of GPU load and be on the watch-out for the training job getting outside of VRAM - doing it silently without any warnings and extending your training to another day :) If it happens that you are slightly short of VRAM - you better play with quantization or batch size and see if there's a way to fit all the data into VRAM and don't have anything in RAM.

P.S.>

GPU load is not the only factor and is not the most important one in determining the total execution time. I.e. I could see 100W usage when trying out Galore PEFT method. Yet it was way slower with the same dataset and similar params.

the gap/straight line is due to a missing internet connection during this period.

Besides I could see a 50% with LORA by simply enabling Flash Attention 2 - though has to use WSL2 and run the job there since Windows is not supported yet. Latter I switched to SDPA (scaled dot product attention).

P.P.S.>

A life hack to quickly see if the GPU is fully utilized - check the GPU temp in Windows Task Manager. If it is just a few degrees above idle temp (e.g. 50-60°C), you are underutilized. If you hit 80°C degrees, you are good. E.g. RTX 4060 has a throttling temperature at 87°C degrees which means 100% utilisation. The advice won't hold for desktops - there's typically an extra cooling capacity and the GPU won't be thermally throttled.

DEV Community

Fine-tuning LLM on a laptop: VRAM - Shared Memory - GPU Load - Performance

Top comments (0)

Read next

Why Run LLM's /SLM's locally

Building an AI Fix SonarQube Dashboard with Vaadin and Spring Boot

Why APIs Are the Unsung Heroes of App Development

Mental Health Application