DEV Community

Queued_
Queued_

Posted on • Originally published at queued.hashnode.dev

I built the algorithm behind ChatGPT from scratch — here's what I learned

 By Tushar Singla | First Year BTech CSE (AI/ML) Student


"What I cannot create, I do not understand."
— Richard Feynman

That quote hit different when I was staring at my screen at 2am, watching my tokenizer learn the word "the" by merge #17.

Let me explain.


The origin story

Every time you type something into ChatGPT, Claude, or any LLM, something happens before the AI even sees your message. Your text gets tokenized.

"Hello, how are you?" 
→ [15496, 11, 703, 389, 345, 30]
Enter fullscreen mode Exit fullscreen mode

Those numbers are what the model actually sees. Not your words. And the thing doing this conversion? A tokenizer.

I wanted to understand exactly how it works. Not from a YouTube video. Not from a HuggingFace tutorial. I wanted to build one myself, from scratch, in pure Python.

So I did.

Meet TewToken — a bilingual BPE tokenizer trained on English + Hindi text, built with zero ML libraries.

pip install git+https://github.com/tusharinqueue/tewtoken.git
Enter fullscreen mode Exit fullscreen mode

Wait, what even is tokenization?

Computers do not understand language. They understand numbers. So the first step in any NLP pipeline is converting text to numbers.

The naive approach? Assign a number to every word:

{"hello": 0, "world": 1, "machine": 2, "learning": 3, ...}
Enter fullscreen mode Exit fullscreen mode

Problem 1: What about "running", "runner", "ran"? They are all forms of "run" but you would need separate entries for each.

Problem 2: New words (like "ChatGPT") would be completely unknown to your model.

Problem 3: Your vocabulary would need millions of entries.

The smarter approach is subword tokenization — split words into meaningful pieces:

"running"  → ["run", "##ning"]
"ChatGPT"  → ["Chat", "G", "PT"]
"unbelievable" → ["un", "believ", "able"]
Enter fullscreen mode Exit fullscreen mode

Even words your model never saw during training can be handled — by combining subwords it has seen.

BPE (Byte Pair Encoding) is one of the most popular subword tokenization algorithms. It is used in GPT-2, GPT-3, GPT-4, LLaMA, and dozens of other production LLMs.


The plan

Here is what I needed to build:

Raw text corpus
    ↓
Clean + merge into one file
    ↓
Build character vocabulary
    ↓
Count word frequencies
    ↓
BPE merge loop (the actual ML part)
    ↓
Save vocab.json + merges.txt
    ↓
encode() and decode() functions
    ↓
Importable Python package
Enter fullscreen mode Exit fullscreen mode

Simple enough on paper. Let's go.


Step 1: Getting the data

I needed a text corpus. I went with YouTube transcripts — real speech patterns, easy to collect with youtube-transcript-api, and bilingual-friendly since I could grab both English and Hindi channels.

from youtube_transcript_api import YouTubeTranscriptApi

transcript = YouTubeTranscriptApi.get_transcript("VIDEO_ID")
text = " ".join([t["text"] for t in transcript])
Enter fullscreen mode Exit fullscreen mode

I collected ~50-60 transcripts — a mix of tech and educational content. Enough to get started, not enough to be production-grade. (We will get to that.)


Step 2: Cleaning the data (train.py)

Raw YouTube transcripts are messy. Like, really messy.

[Music] so today we are gonna talk about [Applause] 
neural  networks  which are  [laughter] basically...
Enter fullscreen mode Exit fullscreen mode

Two regex patterns fixed 90% of it:

import re

def clean_transcript(text):
    # Remove [Music], [Applause], [laughter] etc.
    text = re.sub(r"\[.*?\]", "", text)

    # Normalize whitespace
    text = re.sub(r"\s+", " ", text)

    text = text.strip()
    return text
Enter fullscreen mode Exit fullscreen mode

Quick regex breakdown:

  • r"\[.*?\]" — matches anything inside square brackets. The ? makes it non-greedy
  • r"\s+" — matches one or more whitespace characters, collapses to single space

Then I merged all 50 files into one corpus.txt, one video per line.


Step 3: Understanding the corpus (vocab.py)

Before training, I wanted to see what I was actually working with.

chars = sorted(set(text))
print(f"Total unique characters: {len(chars)}")
Enter fullscreen mode Exit fullscreen mode

Output included Hindi characters. Some of my transcripts were bilingual. There was also a corrupted byte that slipped through cleaning.

Decision point: go English-only, or keep it bilingual?

I kept it bilingual. More interesting to build, and honestly a better story.

Word frequency check confirmed the bilingual corpus:

the: 39580
i: 33185
hai: 8426     (Hindi for "is")
to: 5051      (Hindi discourse marker)
mein: 4295    (Hindi for "in")
Enter fullscreen mode Exit fullscreen mode

Step 4: The BPE algorithm (bpe.py)

Let me explain BPE simply, then show you the code.

BPE explained

Imagine you are creating a shorthand language. You start with individual letters:

h e l l o   w o r l d
Enter fullscreen mode Exit fullscreen mode

You look at your entire text and ask: what two things appear next to each other most often?

You find that t and h appear together 142,000 times. So you create a new symbol th. Then you ask again — this time th and e appear 39,000 times together. New symbol: the.

You repeat this 8,000 times. By the end, common words are single tokens. Rare words get split into subwords.

No neural network. No gradient descent. Just frequency counting.

The code

def build_vocab(word_freqs):
    vocab = {}
    for word, freq in word_freqs.items():
        chars = tuple(list(word) + ["</w>"])
        vocab[chars] = freq
    return vocab
Enter fullscreen mode Exit fullscreen mode

That </w> marker is crucial — it tells the tokenizer where words end. Without it, BPE cannot tell "the" as a standalone word from "the" as a prefix of "there".

def get_pair_freqs(vocab):
    pairs = defaultdict(int)
    for word_tuple, freq in vocab.items():
        for i in range(len(word_tuple) - 1):
            pairs[(word_tuple[i], word_tuple[i+1])] += freq
    return pairs
Enter fullscreen mode Exit fullscreen mode
NUM_MERGES = 8000
merges = []

for i in range(NUM_MERGES):
    pair_freqs = get_pair_freqs(vocab)
    best_pair = max(pair_freqs, key=pair_freqs.get)

    if pair_freqs[best_pair] < 2:
        break

    vocab = merge_pair(best_pair, vocab)
    merges.append(best_pair)
Enter fullscreen mode Exit fullscreen mode

That is the entire learning algorithm. ~20 lines of Python.

Watching it learn in real time

Merge 1:   e + </w>   → e</w>     (freq: 205878)
Merge 3:   t + h      → th        (freq: 142233)
Merge 15:  y + ou     → you       (freq: 42526)
Merge 17:  th + e</w> → the</w>   (freq: 39655)
Merge 101: (first Hindi virama combination)
Enter fullscreen mode Exit fullscreen mode

By merge 3, BPE discovered th. By merge 17, it learned the as a complete word. By merge 101, it was figuring out Hindi virama combinations.

Nobody told it anything about English or Hindi grammar. It got there purely from counting — and I was not ready for how satisfying that would feel at 2am.


Step 5: The performance problem

Here is where it got humbling.

8,000 merges took 320 seconds in pure Python.

I tried heaps. I tried incremental updates. Still slow.

The root cause: every single iteration, get_pair_freqs() rescans the entire vocabulary from scratch. With a large corpus, that gets expensive fast.

Production tokenizers like HuggingFace's tokenizers library do 8,000 merges in ~2 seconds — because they are written in Rust.

Implementation Time for 8k merges
TewToken (Python) 320 seconds
HuggingFace tokenizers (Rust) ~2 seconds

The algorithm is literally identical. The language makes all the difference for performance-critical code. This is why v2.0.0 is getting a C++ trainer.


Step 6: Saving the tokenizer

import json

token_to_id = {token: idx for idx, token in enumerate(sorted(all_tokens))}
with open("vocab.json", "w", encoding="utf-8") as f:
    json.dump(token_to_id, f, ensure_ascii=False, indent=2)

with open("merges.txt", "w", encoding="utf-8") as f:
    for pair in merges:
        f.write(f"{pair[0]} {pair[1]}
")
Enter fullscreen mode Exit fullscreen mode

ensure_ascii=False is critical — without it, Hindi characters get saved as ugly unicode escapes instead of actual readable characters.

These two files are your tokenizer. Everything else is just loading and applying them.


Step 7: Encode and decode (tokeniser.py)

The encode function replays the training process on new text — applying all 8,000 merge rules in the same order they were learned. Decode does the reverse: look up each ID, join the tokens, replace </w> with spaces.

encode("machine learning is amazing")
# → [2839, 2700, 2506, 368]

decode([2839, 2700, 2506, 368])
# → "machine learning is amazing"
Enter fullscreen mode Exit fullscreen mode

Both languages work. Round-trip perfect.


Step 8: Making it a real package

tewtoken/
├── __init__.py       ← makes it importable
├── tokeniser.py      ← encode/decode + 13 utility functions
├── vocab.json        ← trained vocabulary
└── merges.txt        ← 8000 merge rules
Enter fullscreen mode Exit fullscreen mode

__init__.py:

from .tokeniser import (
    encode, decode, tokenize, count_tokens,
    encode_batch, decode_batch, vocab_size,
    get_vocab, get_token_id, get_id_token,
    is_known_token, truncate, encoding_info
)
Enter fullscreen mode Exit fullscreen mode

setup.py:

setup(
    name="tewtoken",
    version="1.0.0",
    packages=find_packages(),
    package_data={"tewtoken": ["vocab.json", "merges.txt"]},
)
Enter fullscreen mode Exit fullscreen mode

Now anyone can do:

pip install git+https://github.com/tusharinqueue/tewtoken.git
Enter fullscreen mode Exit fullscreen mode
from tewtoken import encode, decode, count_tokens

print(encode("machine learning"))     # [2839, 2700]
print(decode([2839, 2700]))           # machine learning
print(count_tokens("hello world"))    # 2
Enter fullscreen mode Exit fullscreen mode

That's a real Python package. Installable. Importable. Usable in any project.


The 13 utility functions

Inspired by tiktoken, TewToken ships with 13 functions:

Function What it does
encode(text) text to token IDs
decode(ids) token IDs to text
tokenize(text) text to token strings
count_tokens(text) count tokens in text
encode_batch(texts) encode list of texts
decode_batch(ids_list) decode list of ID lists
truncate(text, max) truncate to N tokens
vocab_size() total vocab size
get_vocab() full vocabulary dict
get_token_id(token) ID for a token string
get_id_token(id) token string for an ID
is_known_token(token) check if token exists
encoding_info() tokenizer summary

What actually clicked

Before building TewToken, I "knew" about tokenization the way you know a word from a dictionary — definition memorized, never really used it. After building it, I understood it the way you understand a road after driving it at night.

The thing that surprised me most: BPE has no neural network in it. No loss function, no backpropagation. It is a greedy algorithm that picks the most frequent pair at each step and calls it done. I kept expecting something more complex under the hood, and it just was not. Simple and effective, which somehow felt more satisfying than the alternative.

The </w> marker broke me for a couple of hours before it clicked. Without word boundary markers, your encode/decode roundtrip silently breaks in ways that are annoying to debug. Lesson: the small implementation details are where the actual understanding lives, not the high-level concept.

Data cleaning mattered more than I expected. Removing [Music] tags and normalizing whitespace improved training quality more than throwing in extra raw transcripts. I watched it happen in the frequency counts — corrupted data leaves fingerprints.

The performance gap is also real. 320 seconds vs 2 seconds is not a Python-is-slow meme — it is why HuggingFace rewrote their entire tokenizers library in Rust. Same algorithm, completely different story at scale.

And the one I keep coming back to: merge order matters more than the merges themselves. Rule 3 (t+h -> th) has to happen before rule 17 (th+e -> the). Shuffle the order and everything breaks — no recovering from it. The whole tokenizer is just a list of rules in a very specific sequence, and that sequence is what you are actually training.


What is next

v2.0.0 is in progress: training data scaled to ~1GB (Simple English Wikipedia at 178MB + Hindi Wikipedia at 742MB with 172k articles), 32,000 merges to match GPT-2, and a C++ trainer to fix the speed problem.

After that: tests with pytest, a PyPI release, a Streamlit demo, and maybe a blog series on extending it to handle special tokens.


The takeaway

If you are learning ML and want to really understand how LLMs work under the hood, do not just read about tokenization. Build a tokenizer.

You will spend a few days confused. You will hit weird bugs. Your first version will be slow. But when you finally see:

Merge 17: th + e</w> → the</w>  (freq: 39655)
Enter fullscreen mode Exit fullscreen mode

And realize an algorithm you wrote just discovered the most common word in English purely from counting — that feeling is worth every confused hour.


Try it yourself

pip install git+https://github.com/tusharinqueue/tewtoken.git
Enter fullscreen mode Exit fullscreen mode
from tewtoken import encode, decode, encoding_info

print(encoding_info())
# {
#   "vocab_size": 7915,
#   "num_merges": 8000,
#   "type": "BPE",
#   "language": "bilingual (English + Hindi)"
# }
Enter fullscreen mode Exit fullscreen mode

GitHub: https://github.com/tusharinqueue/tewtoken


— Tushar Singla

Tags: #MachineLearning #NLP #Python #OpenSource #BuildInPublic #BPE #Tokenizer #LLM #Hindi

Top comments (0)