DEV Community

Adam Mateusz Brożyński
Adam Mateusz Brożyński

Posted on

How to copy lots of data fast?

The best way to copy a lot of data in Linux fast is using tar. It's much faster than cp or copying in any file manager. Here's a simple command with progress bar (pv needs to be installed) that needs to be executed inside a folder that you want to copy recursively:

$ tar cf - . | pv | (cd /destination/path; tar xf -)

Top comments (3)

Collapse
 
bbkr profile image
Paweł bbkr Pabian • Edited

This is not universal advice.

Tar (for younger audience - Tape ARchiving ) is a way to present set of files as single continuous file that can be streamed to/from magnetic tape storage. So for this method to be advantageous additional CPU time on both sides must be lower than doing full synchronous disk operations sequentially for each file. So tar | xxx | tar:

  • Will be faster in network transfers, much faster than scp that needs to proces each file sequentially and wait for network confirmation from other side.
  • Will be slightly slower for filesystems with very fast SSDs.
  • Will be much slower for filesystems with Copy on Write, that do not copy actual file until some changes were made to copied version.

I just did some quick benchmark on 20G repository with 3000 files on APFS filesystem on PCIe NVMe 3.0 disk and:

  • tar | tar took 20s
  • regular cp -r managed to finish in 12s
  • cp -r -c (Copy on Write) finished in 1.3s
Collapse
 
ordigital profile image
Adam Mateusz Brożyński • Edited

I had to copy files from HDD to SSD, SSD to SSD, SSD to NVMe 3 and NVMe3 to NVMe3. It was from 80GB to 2TB data folders. Tar was always extremely faster. What tar did in minutes, it took cp hours. So from my perspective it is universal advice if someone has lots of data (which for me is more than 100GB) and there's a lot of small files there (cp fails completely in this case). I guess if someone is just copying few big files cp could work, but it's not the case in most backup situations.

Collapse
 
bbkr profile image
Paweł bbkr Pabian

Hours? I think something was really off with your system configuration (journaling issues? incomplete RAID array? not enough PCIe lanes? kernel setup?). That was extremely slow even for SATA 3.0 SSD which should crunch 2TB folder with moderate amount of files in it in ~1h using pure cp.

Anyway - tar is helpful when full, synchronous roundabout of copying single file is costly. But for those cases I prefer find | parallel combo because:

  • it performs nearly identical to tar
  • can take advantage of CoW if copy is on the same disk
  • actual method of copying can be easily swapped - cp, scp, rsync, etc.