VelocityAI

Posted on Jun 30

The Data That Built the Brain: What Actually Went Into Training GPT-4 (and What Was Left Out)

#promptengineering #ai #chatgpt

You ask an AI a question. It answers. It sounds smart. It sounds like it has read everything. It has. Not everything, but close. The model was trained on a vast, invisible library. It contains millions of books, billions of web pages, and trillions of words. But it is not a complete library. It has gaps. It has biases. It has blind spots. The data that built the brain is not neutral. It is a curated slice of human knowledge.

We do not know exactly what went into GPT-4. OpenAI has not released the full dataset. But we can reconstruct it. We can look at the sources, the languages, the time periods. We can see what was included and what was left out.

The Scale of the Corpus
The GPT-4 training corpus is massive.

The Numbers:

Tokens: Trillions.

Books: Millions.

Web Pages: Billions.

Languages: Dozens.

The Sources:

Books: Public domain books, digitized libraries, and pirated book collections.

Websites: Common Crawl, Wikipedia, Reddit, GitHub, and news archives.

Other: Scientific papers, patents, and government documents.

A Contrarian Take: The Corpus Is Not a Library. It Is a Filter.

We think of the training data as a comprehensive archive. It is not. It is a filtered subset.

The data is chosen by engineers. It is cleaned by algorithms. It is prioritized by what is available and what is cheap. The corpus is not a mirror of humanity. It is a photograph taken from a specific angle.

The Language Bias
The corpus is overwhelmingly English.

The Breakdown:

English: ~70-80%.

Chinese: ~5-10%.

Spanish: ~3-5%.

Other Languages: The remainder.

The Consequence:

The model is more fluent in English than in other languages.

It has deeper knowledge of Western culture.

It reflects Western values and biases.

A Contrarian Take: The Language Bias Is Not a Bug. It Is a Reflection of the Internet.

The internet is mostly English. The AI is trained on the internet. The bias is not a failure. It is a statistical reality.

The problem is not the AI. The problem is the uneven distribution of human knowledge online.

The Temporal Bias
The corpus is heavily skewed toward the recent past.

The Breakdown:

2000-2020: The vast majority.

Pre-2000: A small fraction.

Pre-1900: A tiny fraction.

The Consequence:

The model is better at understanding contemporary issues.

It is weaker on historical context.

It reflects the values of the 21st century.

A Contrarian Take: The Temporal Bias Is Not a Bug. It Is a Reflection of Relevance.

The AI is trained on what is available and what is relevant. The internet is mostly recent. The AI is mostly recent.

The model does not "forget" history. It just has less data on it.

The Genre Bias
The corpus is heavily skewed toward certain genres.

The Breakdown:

News Articles: High.

Wikipedia: High.

Reddit: High.

Fiction: Moderate.

Poetry: Low.

Academic Papers: Moderate.

The Consequence:

The model is better at expository writing than creative writing.

It is better at explaining than at evoking.

It is better at summarizing than at imagining.

A Contrarian Take: The Genre Bias Is a Feature, Not a Bug.

The AI is designed to be a helpful assistant. It is trained on helpful text.

The bias toward expository writing is not an accident. It is intentional.

The Missing Voices
What is missing from the corpus is as important as what is included.

The Gaps:

Oral Traditions: Languages and stories that were never written down.

Marginalized Communities: Voices that are underrepresented online.

Non-Western Perspectives: Views that do not fit the dominant narrative.

Private Conversations: The texture of everyday life.

The Consequence:

The model is blind to certain experiences.

It reflects the dominant culture.

It erases the margins.

A Contrarian Take: The Missing Voices Are Not an Accident. They Are a Consequence of Power.

The internet is not a neutral space. It is a space shaped by power, money, and access.

The missing voices are not missing by chance. They are missing because they were silenced, ignored, or erased.

What You Can Do
You cannot change the corpus. But you can be aware of its limits.

Recognize the Bias:

The AI is not neutral. It reflects its training data.

Ask: "What is missing from this answer?"

Verify:

Do not trust the AI blindly.

Cross-reference with other sources.

Demand Transparency:

Ask: "What data was used to train this model?"

Support open datasets.

Contribute:

If you have access to underrepresented texts, digitize them.

Make them available for future models.

The Last Page
The last page of the library is not written. It is waiting.

You ask: "What is the most important book you have ever read?"
The model says: "I have not read any books. I have processed patterns."
You realize: The model does not have a memory. It has a map.

If you could add one book, one website, or one voice to the training corpus, what would it be? And why is it missing?

DEV Community

The Data That Built the Brain: What Actually Went Into Training GPT-4 (and What Was Left Out)

Top comments (0)