DEV Community

Cover image for Faster and More Reliable Hugging Face Downloads Using aria2 and GNU Parallel
Susumu OTA
Susumu OTA

Posted on • Edited on

Faster and More Reliable Hugging Face Downloads Using aria2 and GNU Parallel

Summary

  • Faster and more reliable hugging face downloads with aria2 and GNU Parallel.
  • Use aria2 to download Hugging Face models and datasets in parallel. If errors occur during the download, you can resume the download from where it left off.
  • Use GNU Parallel to quickly verify the hashes of the downloaded files in parallel using multiple CPU cores.

Introduction

Downloading machine learning models and datasets from Hugging Face is time-consuming and unreliable. It is especially slow when dealing with large files or unstable internet connections. Follow this guide to speed up and improve the reliability of your Hugging Face downloads using two powerful command-line tools: aria2 and GNU Parallel.

Prerequisites

Before we get started, make sure you have the following tools installed on your system:

  • Git Large File Storage (git-lfs): An open source Git extension for versioning large files.
  • aria2: A lightweight multi-protocol & multi-source command-line download utility.
  • GNU Parallel: A shell tool for executing jobs in parallel using one or more computers.
  • sha256sum: A command to compute checksums of files using the SHA-256 algorithm. Note: This command is available on typical Linux distributions. macOS's users can install it using Homebrew.

Ubuntu, macOS or Conda users can install these tools using the following commands:

Ubuntu

sudo apt install git-lfs aria2 parallel -y
Enter fullscreen mode Exit fullscreen mode

macOS

brew install git-lfs aria2 parallel
brew install coreutils  # for sha256sum command
Enter fullscreen mode Exit fullscreen mode

Conda Environment

source ~/miniconda3/bin/activate
conda create -n hf_dl -y
conda activate hf_dl

conda install conda-forge::git-lfs conda-forge::aria2 conda-forge::parallel -y
Enter fullscreen mode Exit fullscreen mode

Downloading Hugging Face Models

In this section, we will see how to download Hugging Face models (e.g. Qwen/Qwen2.5-72B-Instruct) using aria2.

First, let's clone a Hugging Face repository using git. To avoid downloading the large files, we set the GIT_LFS_SKIP_SMUDGE environment variable to 1.

GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Qwen/Qwen2.5-72B-Instruct
cd Qwen2.5-72B-Instruct
Enter fullscreen mode Exit fullscreen mode

The git lfs ls-files command lists the files tracked by git-lfs. With the -l option it will show the OID (SHA256 hash) and the filename. We will use this information to download the files using aria2 and to verify the SHA256 hashes.

git lfs ls-files -l
# 18d5d2b73010054d1c9fc4a1ba777d575e871b10f1155f3ae22481b7752bc425 - model-00001-of-00037.safetensors
# 802a3abf41ccdeb01931c5e40eb177ea114a1c47f68cb251d75c2de0fe196677 - model-00002-of-00037.safetensors
# c3a2ab093723d4981dcc6b20c7f48c444ccd9d8572b59f0bf7caa632715b7d36 - model-00003-of-00037.safetensors
# 5f35d5475cc4730ca9a38f958f74b5322d28acbd4aec30560987ed12e2748d8f - model-00004-of-00037.safetensors
# b7f066aef57e0fe29b516ef743fec7a90518151bd5a9df19263dfdee214dfe4d - model-00005-of-00037.safetensors
# ...
Enter fullscreen mode Exit fullscreen mode

With the -n option, git lfs ls-files will only show the filenames.

git lfs ls-files -n
# model-00001-of-00037.safetensors
# model-00002-of-00037.safetensors
# model-00003-of-00037.safetensors
# model-00004-of-00037.safetensors
# model-00005-of-00037.safetensors
# ...

git lfs ls-files -n | wc -l  # 37
Enter fullscreen mode Exit fullscreen mode

Next, we create a list of files (files.txt) to download with aria2. We use xargs to generate the download URL and the output filename for the list.

git lfs ls-files -n | xargs -d "\n" -I {} echo -e "https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/resolve/main/{}\n    out={}" >> files.txt

head files.txt
# https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/resolve/main/model-00001-of-00037.safetensors
#     out=model-00001-of-00037.safetensors
# https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/resolve/main/model-00002-of-00037.safetensors
#     out=model-00002-of-00037.safetensors
# https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/resolve/main/model-00003-of-00037.safetensors
#     out=model-00003-of-00037.safetensors
# https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/resolve/main/model-00004-of-00037.safetensors
#     out=model-00004-of-00037.safetensors
# https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/resolve/main/model-00005-of-00037.safetensors
#     out=model-00005-of-00037.safetensors

wc -l files.txt  # 74 files.txt
Enter fullscreen mode Exit fullscreen mode

Before downloading the files, we need to remove the files that are already in the directory. Otherwise, aria2 will add a suffix to the downloaded files.

git lfs ls-files -n | xargs -d '\n' rm
Enter fullscreen mode Exit fullscreen mode

If the model or dataset requires authentication, you will need to log in to Hugging Face using the huggingface-cli login command. This command will store the authentication token in the file ~/.cache/huggingface/token. We can use this token to download the files using aria2.

huggingface-cli login
Enter fullscreen mode Exit fullscreen mode

Finally, we download the files using aria2. The -j option specifies the number of simultaneous downloads. The appropriate values will depend on your network speed and the server's capabilities, but I recommend starting with 4 to around 12. Be careful not to hit the server's rate limit.

aria2c -j 8 -i files.txt --header="Authorization: Bearer $(cat ~/.cache/huggingface/token)"
Enter fullscreen mode Exit fullscreen mode

Verifying the SHA256 Hashes

After downloading the files, we need to verify the SHA256 hashes to ensure the integrity of the files. We use the sha256sum command to calculate the SHA256 hash of each file and compare it with the expected hash.

Unfortunately, sha256sum takes longer time to compute the hash for large files. We can speed up the process by using GNU Parallel (parallel command) to compute the hashes in parallel.

First, we create files to store the expected SHA256 hashes for each file.

git lfs ls-files -l | awk '{print $1 "  " $3 > $3".sha256"}'

find . -name "*.sha256" -print | wc -l  # 37

cat model-00001-of-00037.safetensors.sha256
# 18d5d2b73010054d1c9fc4a1ba777d575e871b10f1155f3ae22481b7752bc425  model-00001-of-00037.safetensors
Enter fullscreen mode Exit fullscreen mode

Let's compute the SHA256 hash of a first file using the sha256sum command.

sha256sum model-00001-of-00037.safetensors
# 18d5d2b73010054d1c9fc4a1ba777d575e871b10f1155f3ae22481b7752bc425  model-00001-of-00037.safetensors
Enter fullscreen mode Exit fullscreen mode

We can speed up the process by using GNU Parallel (parallel command) to compute the hashes in parallel using multiple CPU cores. The -j option specifies the number of parallel jobs to run. You can set it to the number of CPU cores on your system. In Linux, you can use the nproc command to find out the number of CPU cores.

time find . -name "*.sha256" -print | sort | parallel -j 8 -u "sha256sum -c {} 2>&1" | tee sha256sum.log
# model-00001-of-00037.safetensors: OK
# model-00007-of-00037.safetensors: OK
# model-00003-of-00037.safetensors: OK
# ...
Enter fullscreen mode Exit fullscreen mode

Let's check the contents of the file sha256sum.log. It should contain the results of the SHA256 hash verification for each file. The OK message indicates that the hash verification was successful.

wc -l sha256sum.log  # 37 sha256sum.log

sort sha256sum.log | nl
#      1  model-00001-of-00037.safetensors: OK
#      2  model-00002-of-00037.safetensors: OK
#      3  model-00003-of-00037.safetensors: OK
# ...
#     35  model-00035-of-00037.safetensors: OK
#     36  model-00036-of-00037.safetensors: OK
#     37  model-00037-of-00037.safetensors: OK
Enter fullscreen mode Exit fullscreen mode

OK! All the files have been successfully downloaded and verified!

Conclusion

  • By using aria2 to download files in parallel and GNU Parallel to compute the SHA256 hashes in parallel, you can speed up and improve the reliability of your Hugging Face downloads.
  • These tools are particularly useful when dealing with large files and/or unstable internet connections.
  • Remember to adjust the number of parallel downloads or jobs based on your network speed and the server capabilities.

Citation

@software{tange_2024_14550073,
      author       = {Tange, Ole},
      title        = {GNU Parallel 20241222 ('Bashar')},
      month        = Dec,
      year         = 2024,
      note         = {{GNU Parallel is a general parallelizer to run
                       multiple serial command line programs in parallel
                       without changing them.}},
      publisher    = {Zenodo},
      doi          = {10.5281/zenodo.14550073},
      url          = {https://doi.org/10.5281/zenodo.14550073}
}
Enter fullscreen mode Exit fullscreen mode

Image of Quadratic

Python + AI + Spreadsheet

Chat with your data and get insights in seconds with the all-in-one spreadsheet that connects to your data, supports code natively, and has built-in AI.

Try Quadratic free

Top comments (0)

AWS Q Developer image

Your AI Code Assistant

Automate your code reviews. Catch bugs before your coworkers. Fix security issues in your code. Built to handle large projects, Amazon Q Developer works alongside you from idea to production code.

Get started free in your IDE

AWS GenAI LIVE!

GenAI LIVE! is a dynamic live-streamed show exploring how AWS and our partners are helping organizations unlock real value with generative AI.

Tune in to the full event

DEV is partnering to bring live events to the community. Join us or dismiss this billboard if you're not interested. ❤️