DEV Community

ctylim
ctylim

Posted on • Edited on

rhuffle: Line shuffler for huge text file which does not fit in memory

In some machine learning tasks (e.g. stochastic gradient descent), data is often shuffled before training. Sometimes data size can be huge, from 10GB to 100GB in size. Shuffling data in your laptop may require a lot of RAM, having a negative performance impact for your overall tasks.

Here I introduce a CLI tool resolving this issue:

rhuffle

rhuffle is a line shuffler for large text files, which works with limited memory.

rhuffle_demo

rhuffle works very fast, and supports skipping head line mainly for CSV/TSV formatted files.
rhuffle uses TEMPDIR for storing temporal files, reducing RAM usage (TEMPDIR and available RAM size are both configurable).

Installation

This CLI tool is written in Rust, you need to install Rust first.

After running this,

$ cargo install rhuffle

we can use rhuffle.

Feedbacks and proposals are welcome in https://github.com/ctylim/rhuffle/issues.

Top comments (0)