DEV Community

Cover image for Reuse Data for Training During Upstream CPU Bottlenecks
Zac J.Q. Yap
Zac J.Q. Yap

Posted on • Originally published at pragmaticcs.substack.com

Reuse Data for Training During Upstream CPU Bottlenecks

Alt Text

Upstream operations (eg. Disk I/O and data preprocessing) in the neural network training pipeline do not run on hardware accelerators.

“Data echoing” reuses intermediate outputs from earlier pipeline stages when the training pipeline has an upstream bottleneck. This maximises hardware utilisation. The number of times data is reused is set as the echoing factor. The effectiveness of this approach challenges the idea that use of repeated data for SGD updates is useless or even harmful.

Echoing can be done:

  1. Before batching – data is repeated and shuffled at the training example level, increasing the likelihood that nearby batches will be different. This has the risk of duplicating examples within a batch.

  2. After batching

  3. Before augmentation – allows repeated data to be transformed differently, potentially making repeated data more akin to fresh data

  4. After augmentation – other methods like dropout that add noise during the SGD update can make repeated data appear different

Data echoing reduces the number of fresh examples required for training and the training time, without harming predictive performance (up to a upper bound on the echoing factor). There is also empirical evidence of data echoing performing better with larger batch sizes and more shuffling.

Source paper: https://arxiv.org/pdf/1907.05550.pdf

Liked this post?
This summary first appeared in the Pragmatic CS newsletter. Subscribers got it first!

Top comments (0)