Queued_

Posted on Mar 24 • Edited on Jun 13 • Originally published at queued.hashnode.dev

I built the algorithm behind ChatGPT from scratch ~ here's what I learned

#ai #machinelearning #llm #nlp

"What I cannot create, I do not understand."
— Richard Feynman

That quote hit different when I was staring at my screen at 2am, watching my tokenizer learn the word "the" by merge #17.

The origin story

Every time you type something into ChatGPT, Claude, or any LLM, something happens before the AI even sees your message. Your text gets tokenized.

"Hello, how are you?" 
→ [15496, 11, 703, 389, 345, 30]

Those numbers are what the model actually sees. Not your words. And the thing doing this conversion? A tokenizer.

I wanted to understand exactly how it works. Not from a YouTube video. Not from a HuggingFace tutorial. I wanted to build one myself, from scratch, in pure Python.

So I did.

Meet TewToken - a bilingual BPE tokenizer trained on English + Hindi text, built with zero ML libraries.

pip install git+https://github.com/tusharinqueue/tewtoken.git

Wait, what even is tokenization?

Computers do not understand language. They understand numbers. So the first step in any NLP pipeline is converting text to numbers.

The naive approach? Assign a number to every word:

{"hello": 0, "world": 1, "machine": 2, "learning": 3, ...}

Problem 1: What about "running", "runner", "ran"? They are all forms of "run" but you would need separate entries for each.

Problem 2: New words (like "ChatGPT") would be completely unknown to your model.

Problem 3: Your vocabulary would need millions of entries.

The smarter approach is subword tokenization — split words into meaningful pieces:

"running"  → ["run", "##ning"]
"ChatGPT"  → ["Chat", "G", "PT"]
"unbelievable" → ["un", "believ", "able"]

Even words your model never saw during training can be handled - by combining subwords it has seen.

BPE (Byte Pair Encoding) is one of the most popular subword tokenization algorithms. It is used in GPT-2, GPT-3, GPT-4, LLaMA, and dozens of other production LLMs.

The plan

Here is what I needed to build:

Raw text corpus
    ↓
Clean + merge into one file
    ↓
Build character vocabulary
    ↓
Count word frequencies
    ↓
BPE merge loop (the actual ML part)
    ↓
Save vocab.json + merges.txt
    ↓
encode() and decode() functions
    ↓
Importable Python package

Simple enough on paper. Let's go.

Step 1: Getting the data

Needed a text corpus so I went with YouTube transcripts - real speech patterns, easy to collect with youtube-transcript-api, and bilingual-friendly since I could grab both English and Hindi channels.

from youtube_transcript_api import YouTubeTranscriptApi

transcript = YouTubeTranscriptApi.get_transcript("VIDEO_ID")
text = " ".join([t["text"] for t in transcript])

I collected ~50-60 transcripts -- a mix of tech and educational content. Enough to get started, not enough to be production-grade. (We will get to that.)

Step 2: Cleaning the data (`train.py`)

Raw YouTube transcripts are messy. Like, really really messy.

[Music] so today we are gonna talk about [Applause] 
neural  networks  which are  [laughter] basically...

Two regex patterns fixed 90% of it:

import re

def clean_transcript(text):
    # Remove [Music], [Applause], [laughter] etc.
    text = re.sub(r"\[.*?\]", "", text)

    # Normalize whitespace
    text = re.sub(r"\s+", " ", text)

    text = text.strip()
    return text

Quick regex breakdown:

r"\[.*?\]" — matches anything inside square brackets. The ? makes it non-greedy
r"\s+" — matches one or more whitespace characters, collapses to single space

Then I merged all 50 files into one corpus.txt, one video per line.

if regex looks confusing ~ this video cleared it up for me.

Step 3: Understanding the corpus (`vocab.py`)

Before training, I wanted to see what I was actually working with.

chars = sorted(set(text))
print(f"Total unique characters: {len(chars)}")

Output included Hindi characters. Some of my transcripts were bilingual. There was also a corrupted byte that slipped through cleaning.

Decision point: go English-only, or keep it bilingual?

I kept it bilingual ~ more interesting to build and I already did the hassle to acquire Hindi transcripts.

Word frequency check confirmed the bilingual corpus:

the: 39580
i: 33185
hai: 8426     (Hindi for "is")
to: 5051      (Hindi discourse marker)
mein: 4295    (Hindi for "in")

Step 4: The BPE algorithm (`bpe.py`)

Let me explain BPE simply, then show you the code.

BPE `Byte Pair Algorithm` explained

Imagine you are creating a shorthand language. You start with individual letters:

h e l l o   w o r l d

You look at your entire text and ask: what two things appear next to each other most often?

You find that t and h appear together 142,000 times. So you create a new symbol th. Then you ask again — this time th and e appear 39,000 times together. New symbol: the.

You repeat this 8,000 times. By the end, common words are single tokens. Rare words get split into subwords.

No neural network. No gradient descent. Just frequency counting.

The code

def build_vocab(word_freqs):
    vocab = {}
    for word, freq in word_freqs.items():
        chars = tuple(list(word) + ["</w>"])
        vocab[chars] = freq
    return vocab

That </w> marker is crucial - it tells the tokenizer where words end. Without it, BPE cannot tell "the" as a standalone word from "the" as a prefix of "there".

def get_pair_freqs(vocab):
    pairs = defaultdict(int)
    for word_tuple, freq in vocab.items():
        for i in range(len(word_tuple) - 1):
            pairs[(word_tuple[i], word_tuple[i+1])] += freq
    return pairs

NUM_MERGES = 8000
merges = []

for i in range(NUM_MERGES):
    pair_freqs = get_pair_freqs(vocab)
    best_pair = max(pair_freqs, key=pair_freqs.get)

    if pair_freqs[best_pair] < 2:
        break

    vocab = merge_pair(best_pair, vocab)
    merges.append(best_pair)

That is the entire learning algorithm. ~20 lines of Python.

Watching it learn in real time

Merge 1:   e + </w>   → e</w>     (freq: 205878)
Merge 3:   t + h      → th        (freq: 142233)
Merge 15:  y + ou     → you       (freq: 42526)
Merge 17:  th + e</w> → the</w>   (freq: 39655)
Merge 101: (first Hindi virama combination)

By merge 3, BPE discovered th. By merge 17, it learned the as a complete word. By merge 101, it was figuring out Hindi virama combinations.

Nobody told it anything about English or Hindi grammar. It got there purely from counting - and I was not ready for how satisfying that would feel.

Step 5: The performance problem

Here is where it got humbling.

8,000 merges took 320 seconds in pure Python.

I tried heaps. I tried incremental updates. Still slow.

The root cause: every single iteration, get_pair_freqs() rescans the entire vocabulary from scratch. With a large corpus, that gets expensive fast.

Production tokenizers like HuggingFace's tokenizers library do 8,000 merges in ~2 seconds — because they are written in Rust.

Implementation	Time for 8k merges
TewToken (Python)	320 seconds
HuggingFace tokenizers (Rust)	~2 seconds

The algorithm is literally identical. The language makes all the difference for performance-critical code. This is why v2.0.0 is getting a C++ trainer(hopefully).

Step 6: Saving the tokenizer

import json

token_to_id = {token: idx for idx, token in enumerate(sorted(all_tokens))}
with open("vocab.json", "w", encoding="utf-8") as f:
    json.dump(token_to_id, f, ensure_ascii=False, indent=2)

with open("merges.txt", "w", encoding="utf-8") as f:
    for pair in merges:
        f.write(f"{pair[0]} {pair[1]}
")

ensure_ascii=False is critical — without it, Hindi characters get saved as ugly unicode escapes instead of actual readable characters.

These two files are your tokenizer. Everything else is just loading and applying them.

Step 7: Encode and decode (`tokeniser.py`)

The encode function replays the training process on new text — applying all 8,000 merge rules in the same order they were learned. Decode does the reverse: look up each ID, join the tokens, replace </w> with spaces.

encode("machine learning is amazing")
# → [2839, 2700, 2506, 368]

decode([2839, 2700, 2506, 368])
# → "machine learning is amazing"

Both languages work. A Perfect Round-Trip...

Step 8: Making it a real package - feeling like a dev

tewtoken/
├── __init__.py       ← makes it importable
├── tokeniser.py      ← encode/decode + 13 utility functions
├── vocab.json        ← trained vocabulary
└── merges.txt        ← 8000 merge rules

__init__.py:

from .tokeniser import (
    encode, decode, tokenize, count_tokens,
    encode_batch, decode_batch, vocab_size,
    get_vocab, get_token_id, get_id_token,
    is_known_token, truncate, encoding_info
)

setup.py:

setup(
    name="tewtoken",
    version="1.0.0",
    packages=find_packages(),
    package_data={"tewtoken": ["vocab.json", "merges.txt"]},
)

Now anyone can do:

pip install git+https://github.com/tusharinqueue/tewtoken.git

from tewtoken import encode, decode, count_tokens

print(encode("machine learning"))     # [2839, 2700]
print(decode([2839, 2700]))           # machine learning
print(count_tokens("hello world"))    # 2

That's a real Python package. Installable. Importable. Usable in any project.

The 13 utility functions

Inspired by tiktoken, TewToken ships with 13 functions:

Function	What it does
`encode(text)`	text to token IDs
`decode(ids)`	token IDs to text
`tokenize(text)`	text to token strings
`count_tokens(text)`	count tokens in text
`encode_batch(texts)`	encode list of texts
`decode_batch(ids_list)`	decode list of ID lists
`truncate(text, max)`	truncate to N tokens
`vocab_size()`	total vocab size
`get_vocab()`	full vocabulary dict
`get_token_id(token)`	ID for a token string
`get_id_token(id)`	token string for an ID
`is_known_token(token)`	check if token exists
`encoding_info()`	tokenizer summary

What actually clicked

Before TewToken, I "knew" tokenization the way you know a word from a dictionary ~ definition memorized, never really used it. After building it, I understood it the way you understand a road after driving it at night.

BPE has no neural network in it. No loss function, no backpropagation. Just a greedy algorithm that picks the most frequent pair and calls it done. I kept waiting for something more complex under the hood. There wasn't. Somehow that felt more satisfying.

The </w> marker broke me for two hours. Without it, your encode/decode roundtrip silently breaks in ways that are genuinely annoying to debug - you get wrong output with no error. The small details are where the actual understanding lives, not the concept.

Data cleaning mattered more than I expected. Removing [Music] tags and normalizing whitespace did more for training quality than adding extra raw transcripts. You can actually see it in the frequency counts - corrupted data leaves fingerprints.

320 seconds vs 2 seconds - that's not a Python-is-slow meme. That's why HuggingFace rewrote their entire tokenizers library in Rust. Same algorithm, completely different story at scale.

The one I keep coming back to: merge order matters more than the merges themselves. Rule 3 (t+h → th) has to happen before rule 17 (th+e → the). Shuffle the order and everything breaks, no recovering. The whole tokenizer is just a list of rules in a specific sequence ~ and that sequence is what you're actually training.

What is next

v2.0.0 is in progress: training data scaled to ~1GB (Simple English Wikipedia at 178MB + Hindi Wikipedia at 742MB with 172k articles), 32,000 merges to match GPT-2, and a C++ trainer to fix the speed problem.

After that: tests with pytest, a PyPI release, a Streamlit demo, and maybe a blog series on extending it to handle special tokens.

The takeaway

If you are learning ML and want to really understand how LLMs work under the hood, do not just read about tokenization. Build a tokenizer.

You will spend a few days confused. You will hit weird bugs. Your first version will be slow. But when you finally see:

Merge 17: th + e</w> → the</w>  (freq: 39655)

And realize an algorithm you wrote just discovered the most common word in English purely from counting — that feeling is worth every confused hour.

Try it yourself

pip install git+https://github.com/tusharinqueue/tewtoken.git

from tewtoken import encode, decode, encoding_info

print(encoding_info())
# {
#   "vocab_size": 7915,
#   "num_merges": 8000,
#   "type": "BPE",
#   "language": "bilingual (English + Hindi)"
# }

GitHub: https://github.com/tusharinqueue/tewtoken

— Tushar Singla

Tags: #MachineLearning #NLP #Python #OpenSource #BuildInPublic #BPE #Tokenizer #LLM #Hindi

DEV Community

I built the algorithm behind ChatGPT from scratch ~ here's what I learned

The origin story

Wait, what even is tokenization?

The plan

Step 1: Getting the data

Step 2: Cleaning the data (`train.py`)

Step 3: Understanding the corpus (`vocab.py`)

Step 4: The BPE algorithm (`bpe.py`)

BPE `Byte Pair Algorithm` explained

The code

Watching it learn in real time

Step 5: The performance problem

Step 6: Saving the tokenizer

Step 7: Encode and decode (`tokeniser.py`)

Step 8: Making it a real package - feeling like a dev

The 13 utility functions

What actually clicked

What is next

The takeaway

Try it yourself

Top comments (0)

The origin story

Wait, what even is tokenization?

The plan

Step 1: Getting the data

Step 2: Cleaning the data (train.py)

Step 3: Understanding the corpus (vocab.py)

Step 4: The BPE algorithm (bpe.py)

BPE Byte Pair Algorithm explained

The code

Watching it learn in real time

Step 5: The performance problem

Step 6: Saving the tokenizer

Step 7: Encode and decode (tokeniser.py)

Step 8: Making it a real package - feeling like a dev

The 13 utility functions

What actually clicked

What is next

The takeaway

Try it yourself

Step 2: Cleaning the data (`train.py`)

Step 3: Understanding the corpus (`vocab.py`)

Step 4: The BPE algorithm (`bpe.py`)

BPE `Byte Pair Algorithm` explained

Step 7: Encode and decode (`tokeniser.py`)