In some machine learning tasks (e.g. stochastic gradient descent), data is often shuffled before training. Sometimes data size can be huge, from 10GB to 100GB in size. Shuffling data in your laptop may require a lot of RAM, having a negative performance impact for your overall tasks.
Here I introduce a CLI tool resolving this issue:
rhuffle
rhuffle is a line shuffler for large text files, which works with limited memory.
rhuffle works very fast, and supports skipping head line mainly for CSV/TSV formatted files.
rhuffle uses TEMPDIR
for storing temporal files, reducing RAM usage (TEMPDIR
and available RAM size are both configurable).
Installation
This CLI tool is written in Rust, you need to install Rust first.
After running this,
$ cargo install rhuffle
we can use rhuffle
.
Feedbacks and proposals are welcome in https://github.com/ctylim/rhuffle/issues.
Top comments (0)