Reuse Data for Training During Upstream CPU Bottlenecks

#machinelearning #datascience

Upstream operations (eg. Disk I/O and data preprocessing) in the neural network training pipeline do not run on hardware accelerators.

“Data echoing” reuses intermediate outputs from earlier pipeline stages when the training pipeline has an upstream bottleneck. This maximises hardware utilisation. The number of times data is reused is set as the echoing factor. The effectiveness of this approach challenges the idea that use of repeated data for SGD updates is useless or even harmful.

Echoing can be done:

Before batching – data is repeated and shuffled at the training example level, increasing the likelihood that nearby batches will be different. This has the risk of duplicating examples within a batch.
After batching
Before augmentation – allows repeated data to be transformed differently, potentially making repeated data more akin to fresh data
After augmentation – other methods like dropout that add noise during the SGD update can make repeated data appear different

Data echoing reduces the number of fresh examples required for training and the training time, without harming predictive performance (up to a upper bound on the echoing factor). There is also empirical evidence of data echoing performing better with larger batch sizes and more shuffling.

Source paper: https://arxiv.org/pdf/1907.05550.pdf

Liked this post?
This summary first appeared in the Pragmatic CS newsletter. Subscribers got it first!

DEV Community

Reuse Data for Training During Upstream CPU Bottlenecks

Top comments (0)

Read next

LLaVA-o1: Transforming How We Think with Visual Language Models (VLMs)

Pre-trained AI Models Explained: Implementation Guide with BERT, GPT & Stable Diffusion

From Data to Decisions: How Machine Learning Works in 2025

New Adam Modification Unlocks Optimal Convergence for Any Beta2 Value