Reading the article Prioritized Training on Points that are learnable, Worth Learning, and Not Yet Learnt which was accepted this week in ICML 2022 they propose the Reducible Holdout Loss Selection (RHO-LOSS)
They assume that training time is a bottleneck
but data is abundant and possibly has outliers which is frequent scenario in web-scraped data. Since the data is so abundant it's also frequent to achieve SOTA in less than half of one epoch. To obtain this data selection they use the following algorithm:
Given the training set and the holdout set two models are trained: the main one which we optimze with gradient descent and a second one which is simpler and is only trained on the holdout set. Given this second simpler model has a holdout set with the same distribution of the training set it's possible to use its loss to filter out the next batch examples for the main model. Given this smaller trained model and a large batch to filter we pre-calculate the IrreducibleLoss given the loss of this smaller model for all the training set. Then we select a large random batch from the training set and compute the Loss using the main model. We then compute the RHO-LOSS subtracting the loss and the respective irreducibleLoss and sort the samples in terms of it filtering the top n samples to be passed to the small batch to the main model.
This approach has 3 main impacts on the data selection:
1) Redundant Pointes: It filters examples that would be too easy (i.e.: low loss) to the main model. Given that the model already perform well in those there's no real reason to loose training time with them again. And even if the model "unlearn" it's possible to recover this in the next batch of samples.
2) Noisy points: Other works focus on selecting data that has high training loss but thos points might be ambiguous or incorrect data, specially given the assumption that the data was web-scrapped and quality is not assured. These would have high IL and low reducible loss, putting them in lower positions of the ordered list
3) Less relevant points: Another pitfall of selecting only high loss data is that these might be outliers and shouldn't be prioritized. The holdout is drawn from the same distribution of the true data and since it's smaller expected to have less outliers. Given this situation where either models perform bad on these outliers points the RHO-LOSS tends to be small and less prioritized.
The paper point out that this approach works either on large and small datasets given that the small ones are at least doubled. In the experiments they use a large batch 10x larger than the small batch output (the samples that will be forwarded to the main model).
The same IL model can be use to optimize multiple larger models at once and can be trained even without a holdout set. To achieve this train two IL model with half of the holdout set each and let each decide for half of the selected data. Using this approach on multiple models it's possible to speed up hyperparameter sweep
Top comments (0)