Summary
- Faster and more reliable hugging face downloads with
aria2
andGNU Parallel
. - Use
aria2
to download Hugging Face models and datasets in parallel. If errors occur during the download, you can resume the download from where it left off. - Use
GNU Parallel
to quickly verify the hashes of the downloaded files in parallel using multiple CPU cores.
Introduction
Downloading machine learning models and datasets from Hugging Face is time-consuming and unreliable. It is especially slow when dealing with large files or unstable internet connections. Follow this guide to speed up and improve the reliability of your Hugging Face downloads using two powerful command-line tools: aria2
and GNU Parallel
.
Prerequisites
Before we get started, make sure you have the following tools installed on your system:
- Git Large File Storage (git-lfs): An open source Git extension for versioning large files.
- aria2: A lightweight multi-protocol & multi-source command-line download utility.
- GNU Parallel: A shell tool for executing jobs in parallel using one or more computers.
- sha256sum: A command to compute checksums of files using the SHA-256 algorithm. Note: This command is available on typical Linux distributions. macOS's users can install it using Homebrew.
Ubuntu, macOS or Conda users can install these tools using the following commands:
Ubuntu
sudo apt install git-lfs aria2 parallel -y
macOS
brew install git-lfs aria2 parallel
brew install coreutils # for sha256sum command
Conda Environment
source ~/miniconda3/bin/activate
conda create -n hf_dl -y
conda activate hf_dl
conda install conda-forge::git-lfs conda-forge::aria2 conda-forge::parallel -y
Downloading Hugging Face Models
In this section, we will see how to download Hugging Face models (e.g. Qwen/Qwen2.5-72B-Instruct
) using aria2
.
First, let's clone a Hugging Face repository using git
. To avoid downloading the large files, we set the GIT_LFS_SKIP_SMUDGE
environment variable to 1
.
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Qwen/Qwen2.5-72B-Instruct
cd Qwen2.5-72B-Instruct
The git lfs ls-files
command lists the files tracked by git-lfs
. With the -l
option it will show the OID (SHA256 hash) and the filename. We will use this information to download the files using aria2
and to verify the SHA256 hashes.
git lfs ls-files -l
# 18d5d2b73010054d1c9fc4a1ba777d575e871b10f1155f3ae22481b7752bc425 - model-00001-of-00037.safetensors
# 802a3abf41ccdeb01931c5e40eb177ea114a1c47f68cb251d75c2de0fe196677 - model-00002-of-00037.safetensors
# c3a2ab093723d4981dcc6b20c7f48c444ccd9d8572b59f0bf7caa632715b7d36 - model-00003-of-00037.safetensors
# 5f35d5475cc4730ca9a38f958f74b5322d28acbd4aec30560987ed12e2748d8f - model-00004-of-00037.safetensors
# b7f066aef57e0fe29b516ef743fec7a90518151bd5a9df19263dfdee214dfe4d - model-00005-of-00037.safetensors
# ...
With the -n
option, git lfs ls-files
will only show the filenames.
git lfs ls-files -n
# model-00001-of-00037.safetensors
# model-00002-of-00037.safetensors
# model-00003-of-00037.safetensors
# model-00004-of-00037.safetensors
# model-00005-of-00037.safetensors
# ...
git lfs ls-files -n | wc -l # 37
Next, we create a list of files (files.txt
) to download with aria2
. We use xargs
to generate the download URL and the output filename for the list.
git lfs ls-files -n | xargs -d "\n" -I {} echo -e "https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/resolve/main/{}\n out={}" >> files.txt
head files.txt
# https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/resolve/main/model-00001-of-00037.safetensors
# out=model-00001-of-00037.safetensors
# https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/resolve/main/model-00002-of-00037.safetensors
# out=model-00002-of-00037.safetensors
# https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/resolve/main/model-00003-of-00037.safetensors
# out=model-00003-of-00037.safetensors
# https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/resolve/main/model-00004-of-00037.safetensors
# out=model-00004-of-00037.safetensors
# https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/resolve/main/model-00005-of-00037.safetensors
# out=model-00005-of-00037.safetensors
wc -l files.txt # 74 files.txt
Before downloading the files, we need to remove the files that are already in the directory. Otherwise, aria2
will add a suffix to the downloaded files.
git lfs ls-files -n | xargs -d '\n' rm
If the model or dataset requires authentication, you will need to log in to Hugging Face using the huggingface-cli login
command. This command will store the authentication token in the file ~/.cache/huggingface/token
. We can use this token to download the files using aria2
.
huggingface-cli login
Finally, we download the files using aria2
. The -j
option specifies the number of simultaneous downloads. The appropriate values will depend on your network speed and the server's capabilities, but I recommend starting with 4 to around 12. Be careful not to hit the server's rate limit.
aria2c -j 8 -i files.txt --header="Authorization: Bearer $(cat ~/.cache/huggingface/token)"
Verifying the SHA256 Hashes
After downloading the files, we need to verify the SHA256 hashes to ensure the integrity of the files. We use the sha256sum
command to calculate the SHA256 hash of each file and compare it with the expected hash.
Unfortunately, sha256sum
takes longer time to compute the hash for large files. We can speed up the process by using GNU Parallel (parallel
command) to compute the hashes in parallel.
First, we create files to store the expected SHA256 hashes for each file.
git lfs ls-files -l | awk '{print $1 " " $3 > $3".sha256"}'
find . -name "*.sha256" -print | wc -l # 37
cat model-00001-of-00037.safetensors.sha256
# 18d5d2b73010054d1c9fc4a1ba777d575e871b10f1155f3ae22481b7752bc425 model-00001-of-00037.safetensors
Let's compute the SHA256 hash of a first file using the sha256sum
command.
sha256sum model-00001-of-00037.safetensors
# 18d5d2b73010054d1c9fc4a1ba777d575e871b10f1155f3ae22481b7752bc425 model-00001-of-00037.safetensors
We can speed up the process by using GNU Parallel (parallel
command) to compute the hashes in parallel using multiple CPU cores. The -j
option specifies the number of parallel jobs to run. You can set it to the number of CPU cores on your system. In Linux, you can use the nproc
command to find out the number of CPU cores.
time find . -name "*.sha256" -print | sort | parallel -j 8 -u "sha256sum -c {} 2>&1" | tee sha256sum.log
# model-00001-of-00037.safetensors: OK
# model-00007-of-00037.safetensors: OK
# model-00003-of-00037.safetensors: OK
# ...
Let's check the contents of the file sha256sum.log
. It should contain the results of the SHA256 hash verification for each file. The OK
message indicates that the hash verification was successful.
wc -l sha256sum.log # 37 sha256sum.log
sort sha256sum.log | nl
# 1 model-00001-of-00037.safetensors: OK
# 2 model-00002-of-00037.safetensors: OK
# 3 model-00003-of-00037.safetensors: OK
# ...
# 35 model-00035-of-00037.safetensors: OK
# 36 model-00036-of-00037.safetensors: OK
# 37 model-00037-of-00037.safetensors: OK
OK! All the files have been successfully downloaded and verified!
Conclusion
- By using
aria2
to download files in parallel andGNU Parallel
to compute the SHA256 hashes in parallel, you can speed up and improve the reliability of your Hugging Face downloads. - These tools are particularly useful when dealing with large files and/or unstable internet connections.
- Remember to adjust the number of parallel downloads or jobs based on your network speed and the server capabilities.
Citation
@software{tange_2024_14550073,
author = {Tange, Ole},
title = {GNU Parallel 20241222 ('Bashar')},
month = Dec,
year = 2024,
note = {{GNU Parallel is a general parallelizer to run
multiple serial command line programs in parallel
without changing them.}},
publisher = {Zenodo},
doi = {10.5281/zenodo.14550073},
url = {https://doi.org/10.5281/zenodo.14550073}
}
Top comments (0)