Susumu OTA

Posted on Mar 23 • Edited on Mar 24

Faster and More Reliable Hugging Face Downloads Using aria2 and GNU Parallel

#huggingface #llm #python #ai

Summary

Faster and more reliable hugging face downloads with aria2 and GNU Parallel.
Use aria2 to download Hugging Face models and datasets in parallel. If errors occur during the download, you can resume the download from where it left off.
Use GNU Parallel to quickly verify the hashes of the downloaded files in parallel using multiple CPU cores.

Introduction

Downloading machine learning models and datasets from Hugging Face is time-consuming and unreliable. It is especially slow when dealing with large files or unstable internet connections. Follow this guide to speed up and improve the reliability of your Hugging Face downloads using two powerful command-line tools: aria2 and GNU Parallel.

Prerequisites

Before we get started, make sure you have the following tools installed on your system:

Git Large File Storage (git-lfs): An open source Git extension for versioning large files.
aria2: A lightweight multi-protocol & multi-source command-line download utility.
GNU Parallel: A shell tool for executing jobs in parallel using one or more computers.
sha256sum: A command to compute checksums of files using the SHA-256 algorithm. Note: This command is available on typical Linux distributions. macOS's users can install it using Homebrew.

Ubuntu, macOS or Conda users can install these tools using the following commands:

Ubuntu

sudo apt install git-lfs aria2 parallel -y

macOS

brew install git-lfs aria2 parallel
brew install coreutils  # for sha256sum command

Conda Environment

source ~/miniconda3/bin/activate
conda create -n hf_dl -y
conda activate hf_dl

conda install conda-forge::git-lfs conda-forge::aria2 conda-forge::parallel -y

Downloading Hugging Face Models

In this section, we will see how to download Hugging Face models (e.g. Qwen/Qwen2.5-72B-Instruct) using aria2.

First, let's clone a Hugging Face repository using git. To avoid downloading the large files, we set the GIT_LFS_SKIP_SMUDGE environment variable to 1.

GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Qwen/Qwen2.5-72B-Instruct
cd Qwen2.5-72B-Instruct

The git lfs ls-files command lists the files tracked by git-lfs. With the -l option it will show the OID (SHA256 hash) and the filename. We will use this information to download the files using aria2 and to verify the SHA256 hashes.

git lfs ls-files -l
# 18d5d2b73010054d1c9fc4a1ba777d575e871b10f1155f3ae22481b7752bc425 - model-00001-of-00037.safetensors
# 802a3abf41ccdeb01931c5e40eb177ea114a1c47f68cb251d75c2de0fe196677 - model-00002-of-00037.safetensors
# c3a2ab093723d4981dcc6b20c7f48c444ccd9d8572b59f0bf7caa632715b7d36 - model-00003-of-00037.safetensors
# 5f35d5475cc4730ca9a38f958f74b5322d28acbd4aec30560987ed12e2748d8f - model-00004-of-00037.safetensors
# b7f066aef57e0fe29b516ef743fec7a90518151bd5a9df19263dfdee214dfe4d - model-00005-of-00037.safetensors
# ...

With the -n option, git lfs ls-files will only show the filenames.

git lfs ls-files -n
# model-00001-of-00037.safetensors
# model-00002-of-00037.safetensors
# model-00003-of-00037.safetensors
# model-00004-of-00037.safetensors
# model-00005-of-00037.safetensors
# ...

git lfs ls-files -n | wc -l  # 37

Next, we create a list of files (files.txt) to download with aria2. We use xargs to generate the download URL and the output filename for the list.

git lfs ls-files -n | xargs -d "\n" -I {} echo -e "https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/resolve/main/{}\n    out={}" >> files.txt

head files.txt
# https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/resolve/main/model-00001-of-00037.safetensors
#     out=model-00001-of-00037.safetensors
# https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/resolve/main/model-00002-of-00037.safetensors
#     out=model-00002-of-00037.safetensors
# https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/resolve/main/model-00003-of-00037.safetensors
#     out=model-00003-of-00037.safetensors
# https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/resolve/main/model-00004-of-00037.safetensors
#     out=model-00004-of-00037.safetensors
# https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/resolve/main/model-00005-of-00037.safetensors
#     out=model-00005-of-00037.safetensors

wc -l files.txt  # 74 files.txt

Before downloading the files, we need to remove the files that are already in the directory. Otherwise, aria2 will add a suffix to the downloaded files.

git lfs ls-files -n | xargs -d '\n' rm

If the model or dataset requires authentication, you will need to log in to Hugging Face using the huggingface-cli login command. This command will store the authentication token in the file ~/.cache/huggingface/token. We can use this token to download the files using aria2.

huggingface-cli login

Finally, we download the files using aria2. The -j option specifies the number of simultaneous downloads. The appropriate values will depend on your network speed and the server's capabilities, but I recommend starting with 4 to around 12. Be careful not to hit the server's rate limit.

aria2c -j 8 -i files.txt --header="Authorization: Bearer $(cat ~/.cache/huggingface/token)"

Verifying the SHA256 Hashes

After downloading the files, we need to verify the SHA256 hashes to ensure the integrity of the files. We use the sha256sum command to calculate the SHA256 hash of each file and compare it with the expected hash.

Unfortunately, sha256sum takes longer time to compute the hash for large files. We can speed up the process by using GNU Parallel (parallel command) to compute the hashes in parallel.

First, we create files to store the expected SHA256 hashes for each file.

git lfs ls-files -l | awk '{print $1 "  " $3 > $3".sha256"}'

find . -name "*.sha256" -print | wc -l  # 37

cat model-00001-of-00037.safetensors.sha256
# 18d5d2b73010054d1c9fc4a1ba777d575e871b10f1155f3ae22481b7752bc425  model-00001-of-00037.safetensors

Let's compute the SHA256 hash of a first file using the sha256sum command.

sha256sum model-00001-of-00037.safetensors
# 18d5d2b73010054d1c9fc4a1ba777d575e871b10f1155f3ae22481b7752bc425  model-00001-of-00037.safetensors

We can speed up the process by using GNU Parallel (parallel command) to compute the hashes in parallel using multiple CPU cores. The -j option specifies the number of parallel jobs to run. You can set it to the number of CPU cores on your system. In Linux, you can use the nproc command to find out the number of CPU cores.

time find . -name "*.sha256" -print | sort | parallel -j 8 -u "sha256sum -c {} 2>&1" | tee sha256sum.log
# model-00001-of-00037.safetensors: OK
# model-00007-of-00037.safetensors: OK
# model-00003-of-00037.safetensors: OK
# ...

Let's check the contents of the file sha256sum.log. It should contain the results of the SHA256 hash verification for each file. The OK message indicates that the hash verification was successful.

wc -l sha256sum.log  # 37 sha256sum.log

sort sha256sum.log | nl
#      1  model-00001-of-00037.safetensors: OK
#      2  model-00002-of-00037.safetensors: OK
#      3  model-00003-of-00037.safetensors: OK
# ...
#     35  model-00035-of-00037.safetensors: OK
#     36  model-00036-of-00037.safetensors: OK
#     37  model-00037-of-00037.safetensors: OK

OK! All the files have been successfully downloaded and verified!

Conclusion

By using aria2 to download files in parallel and GNU Parallel to compute the SHA256 hashes in parallel, you can speed up and improve the reliability of your Hugging Face downloads.
These tools are particularly useful when dealing with large files and/or unstable internet connections.
Remember to adjust the number of parallel downloads or jobs based on your network speed and the server capabilities.

Citation

@software{tange_2024_14550073,
      author       = {Tange, Ole},
      title        = {GNU Parallel 20241222 ('Bashar')},
      month        = Dec,
      year         = 2024,
      note         = {{GNU Parallel is a general parallelizer to run
                       multiple serial command line programs in parallel
                       without changing them.}},
      publisher    = {Zenodo},
      doi          = {10.5281/zenodo.14550073},
      url          = {https://doi.org/10.5281/zenodo.14550073}
}

DEV Community