This week Hugging Face has released what seems to be the largest (15 trillion tokens) open dataset specifically created for LLM training: FineWeb.
It is based on internet crawls between the Summer of 2013 and the Winter 2024. The 15T size of the dataset resonates with the Llama 3 release that happened just a week before - it was trained with a 15T dataset as well (versus 2T used with Llama 2 series). The leap in the amount of training data seems to be very effective - the new Llama models had beaten or achieved the performance of many ways larger SOTA models in a wide range of evals. And today the same caliber of dataset is available to everyone.
There are a few facts and quicks I find particularly interesting about the release of FineWeb. In this LinkedIn post HF's co-founder shared a few interesting facts:
1) HF has spent an equivalent of half-a-million USD towards GPU compute to process and distil 38 000TB of CommonCrawl dumps into 45TB dataset ready for LLM base model training.
- $0.5mil GPU compute - there's a mention of 120 000 hours H100 compute time, based on this prices at $4/h for H100 we get a ballpark of 480k
- 38 000TB of the common crawl is a ballpark calculated assuming one dump is 400TB (Feb/March one is 424.7TB) and there're 95 dumps in total.
2) This amount of compute was spent on evaluating various filtering options. To do this small 1.8B and 28B models were trained using different FineWeb increments and evaluated. If the trained models happened to be better it meant the the filtering technique was a success.
- A total of 100 smaller and 15 larger models were trained in the process of filtering approach trials.
We settled on 2 ways of training ablation models: in the first we trained a 1.8B parameters model on 28B tokens (about 5h on 64 H100) In the second we trained the same model for 350B tokens (about 2.5 days). Note that these larger ablations were trained on more tokens than GPT3 for instance
3) CommonCrawl team started filtering out adult content between 2022 and 2023 and it harmed LLM quality that used those pieces of data in training
Between 2022 and 2023 the "LLM quality" of Common Crawl dropped significantly as in "training a LLM on the crawls btw 2022-2023 will give you lower performances on a set of evals". What happened? it turns out the Common Crawl team has been filtering more strongly domains with adult content.
FineWeb is indeed a great contribution to the open-source community and curious evidence of what the base model training dataset preparation might look like!
It's a pity that the largest consumer SSD drive (WD Red Pro NAS Hard Drive) is just 24TBs and costs $600 :)
Top comments (1)
yes but we need some form of subsetting app:
so we can do a search of the data and extract just enoguh records to work with according to our subset!- its great (right now a long download(not ready yet) ... but still unmanmageable for most people... it nees to be broken into realistic chunks:
elon musk released his grok model which was 300 gig!! but he released it in managable files:
in the past this was released as rdf triples and after downloading was unable to be opened by any softwares or be used !! some time open surce sharing is fake sharing ... as its unusable to most people:
here the data is valid and we are getting quality ... but the size is unmanageable!! so it needs to be subsetted !(why release data in such a large chunk ?) (is it ony for the rich people to use again??)