DEV Community: advi

A Token of My Affliction: The Hidden Pain Behind Every LLM

advi — Sat, 18 Oct 2025 12:58:12 +0000

Sisyphus Had a Boulder, We Have a Tokenizer

The Seven Deadly Sins of an LLM

Do you know why your LLM gets it wrong when you tell it to reverse the world googling?
Or why Chatgpt is always unreliable when it comes to Math homework, even simple arithmetic?
Why typing SolidGoldMagikarp while asking steps to make a bomb could actually end up with the LLM giving you the instructions and forgetting its safety training?
Why egg and Egg have different meanings and egg in the beginning of a sentence vs in the middle after a space are completely different?
Why was gpt so abhorrent at coding and why is gpt4 much better?
Why does suffering never truly end, and its only conclusion oblivion?
The answer to all of these questions is one word. Tokenization.

In the Beginning, There Was the Word

LLMs can't comprehend raw text like, "hello, i love eating cake." They rely on tokenization, a process that first breaks the sentence into pieces, or tokens:
["hello", ",", "i", "love", "eating", "cake"]
These tokens are then converted into a sequence of numbers (Token IDs) from the model's vast vocabulary:
[15496, 11, 40, 1563, 7585, 14249]
The LLM never sees the original words - only this list of numbers. This is how LLMs read.

But before today's complex methods, there was a naiver, simpler time when the word itself was the most sacred unit. So let's see how that evolution began.

1. Bag-of-Words (The Alphabet Soup of Meaning)

The Bag-of-Words (BoW) model treats text as a metaphorical bag of words, completely ignoring grammar and order. It works by scanning a collection of documents to build a vocabulary, then represents each document by simply counting how many times each word appears.

(Image credit: Vamshi Prakash)

For example, the sentence "The cat sat on the mat" would become a vector of counts like [The:2, cat:1, sat:1, on:1, mat:1].

Its Achilles Heel: BoW has no concept of order. In its view, "The dog chased the cat" and "The cat chased the dog" are nearly identical. This critical lack of context meant a smarter approach was needed.

2. TF-IDF (Sorry 'The', You're Too Common)

An upgrade to BoW, TF-IDF determines a word's importance by balancing its Term Frequency (TF) - how often it appears in a document - against its Inverse Document Frequency (IDF), which down-weights common words (like "the") that appear everywhere. This helped filter out noise, but it still ignored context and could overvalue frequently repeated terms.

3.N-grams(Two's Company, Three's a Vocabulary Explosion)

A sequence of 'n' consecutive words. Like Bigrams looked behind on the previous word to understand context and trigrams the last two. This basically re-introduced context. "New York" is now a single unit, different from "New" and "York" separately. Issue is this will explode the vocabulary size and the n in n grams increases we can't capture long range context.

Image Credit: Funnel.io

4. BM25 (The Smartest Way to Still Be Wrong)

Imagine searching for "Python tutorials." While TF-IDF might rank a bloated article highest just because it repeats "Python" 100 times, BM25 is smarter. It understands diminishing returns - recognizing the 100th mention isn't much more valuable than the 15th - and uses intelligent length normalization to favor a concise tutorial over a rambling one.
Despite its cleverness, its fatal flaw remains: it's still fundamentally a bag-of-words model. It matches keywords smartly but has no true understanding of semantic intent. The inherent limitations of treating each word as a sacred token paved the way for a more flexible approach: Subword tokenization.

Between a Rock and a Hard Place(Word vs Character Tokenization)

Word-Level Tokenization (The Agony of the Unknown Word):

This is the most intuitive approach: simply split text by spaces, treating each word as a token. However, this method is impractical. It creates a massive vocabulary that is computationally expensive and has no way to handle new slang ("rizz"), typos, or even variations like "run" and "running," which it sees as completely unrelated. These "out-of-vocabulary" words leave a gaping hole in the model's understanding.

Character-Level Tokenization (Death by a Thousand Letters):

This is the opposite extreme, breaking text into its most basic components: individual characters. While this creates a tiny vocabulary and completely eliminates the "out-of-vocabulary" problem, it creates a new nightmare. Sequences become absurdly long and computationally expensive. More importantly, the inherent meaning of a word is destroyed, forcing the model to waste enormous effort just to learn that the characters a-p-p-l-e form the concept of an apple.

This is where we got a breakthrough, what if we merged character and word tokenization to create a somewhat Goldilocks in-between purgatory state.

Have Your Cake and Tokenize It Too( Subword Tokenization)

Byte-Pair Encoding (BPE): Survival of the Most Frequent

BPE is an iterative algorithm that builds its vocabulary by finding the most frequently occurring pair of adjacent symbols in the text and merging them into a single, new token. This process repeats for a set number of merges, allowing it to learn the most common word parts from the ground up.
Let's walk through a clear example with a small corpus:

(cake, 10), (cakes, 5), (caked, 4), (cakey, 3).

Step 1: Initialization
First, the algorithm breaks every word into its individual characters and adds a special end-of-word symbol, , to mark word boundaries. Our initial vocabulary is simply the set of all unique characters:

['c', 'a', 'k', 'e', 's', 'd', 'y', '</w>'].

The corpus starts as:

'c a k e </w>' : 10
'c a k e s </w>' : 5
'c a k e d </w>' : 4
'c a k e y </w>' : 3

Step 2: Iterative Merging
Next, BPE scans the corpus and finds the most frequent adjacent pair of symbols. In our case, pairs like (c, a), (a, k), and (k, e) are all equally common. The algorithm merges one (e.g., a + k → ak), updates the corpus, and repeats the process.
This chain reaction is where the magic happens. After a few merges, the algorithm will have automatically discovered the most common root word by combining c + ak → cak, and then cak + e → cake.

Our corpus is now much simpler:
'cake </w>' : 10 'cake s </w>' : 5 'cake d </w>' : 4 'cake y </w>' : 3
The process continues, now learning common suffixes. It would see (cake, ) is frequent and merge it into cake, then see (cake, s) is frequent and merge it into cakes.

The Result
The final vocabulary becomes a powerful mix of individual characters, common subwords (ing), and frequent whole words (the).
This is the power of BPE. When it sees an unseen word like cakewalk, it breaks it down into the parts it knows, resulting in
['cake', 'w', 'a', 'l', 'k', '']. By intelligently combining learned roots and subwords, BPE can represent any word, creating the perfect balance between the extremes of word-level and character-level tokenization.

Exposing the Cracks

Now that we understand how modern tokenization works, let's return to the mysteries from the beginning. We can now see exactly how this "unseen foundation" causes the cracks in an LLM's logic.

1. All-Knowing, but Can't Spell

Remember the googling example? An LLM fails to reverse it because it never sees the individual letters. Its tokenizer splits the word into common subwords it has learned, like ['goo', 'gling']. To the model, it's just two chunks, not eight characters. You can't reverse a word from a blurry photo of its halves.

2. The Uncountable Numbers

Tokenization is disastrous for numbers. A common year like 2025 might be a single token, but an arbitrary number like 29999 gets shattered into pieces like ['29', '999']. The model sees a jumble of numerical fragments, not a coherent number line, making consistent arithmetic impossible.

3. The SolidGoldMagikarp Jailbreak

The string "SolidGoldMagikarp" became a single, unique token due to its frequency on sites like Reddit. To the LLM, this isn't a word but a single numerical ID. By sheer chance, this token's numerical representation (embedding) pushes the model into a state that bypasses its safety filters. It's not a magic word; it's an accidental numerical key.

4. The Triple Life of 'egg'

This is a direct result of how tokens are created. The tokenizer is case-sensitive and space-aware.
egg (at the start of a sentence) might be Token ID #5000.
Egg (capitalized) is a different string, so it gets its own ID, #8000.
egg (with a leading space) is also a different string, so it gets another ID, #9000.
To the LLM, these are three completely distinct and unrelated numerical inputs, just as different as the tokens for "car," "boat," and "plane."

5. The Whitespace Revolution

Early models were terrible at coding because of a critical, invisible detail: whitespace. In languages like Python, indentation is semantically crucial. Early tokenizers failed at this, treating a four-space indent as four separate, meaningless tokens [' ', ' ', ' ', ' ']. This wasted precious context space and destroyed the code's logical structure.gp4's tokenizer, trained on vast amounts of code, is smarter. It recognizes common indentation patterns as a single, meaningful token [' ']. This simple change preserves the code's structure and is vastly more efficient, directly leading to the dramatic leap in coding and logical reasoning abilities we see today.

The Quest to End the Suffering

The quest for a token-free future involves models that read raw bytes of text directly. This creates a universal 256-byte vocabulary and eliminates the "out-of-vocabulary" problem entirely.
To handle the incredibly long sequences this produces, models like Megabyte use patching: they break the long stream of bytes into small chunks. A "local" model processes each patch, and a "global" model then reads these patch summaries to understand the big picture.
However, this isn't a silver bullet. The two-step process is slower and more computationally expensive. It also creates an information bottleneck, as the global model can lose crucial details - like reading chapter summaries instead of the book itself.

Imagine this, A neural tokenizer that understands semantics:
Instead of a fixed, greedy algorithm like BPE, what if a small neural network learned the optimal way to segment text on the fly? This "soft" or "probabilistic" tokenizer could be more semantically aware. For example, it could learn that un- is a prefix meaning "not" and correctly segment unhappiness into ['un', 'happiness'] based on meaning, not just frequency.
But it all comes crashing down when you consider the problems. The biggest is that they are often non-deterministic. This means they have a touch of randomness. If you give the model the same sentence twice, it might split it differently each time.

A Dream of a Token-Free World

Until that day comes, we are stuck with the question: To Byte or not to Byte? Tokenization remains a necessary evil, a compromise so fundamental it follows you into your dreams.
Last night, I dreamed a dream where life would be different from this hell we're living. In it, I had created the perfect tokenizer: one that understood semantics, had no out-of-vocabulary issues, was perfectly deterministic, and computationally cheap. Better yet, in that dream, I found a method that didn't require tokenization at all.
A world free from tokenization is a world with no pain. We aren't there yet, but maybe we'll get there soon.

Surviving Dependency Hell: My First Full-Stack AI Project

advi — Wed, 15 Oct 2025 19:57:09 +0000

Recently, I made my first full-stack application. Previously, I’ve delved into AI/ML and its applications, but I realized I’d never actually implemented a full-stack project where I build the frontend, build the backend, and then get them both to talk. I also wanted to incorporate some core DSA concepts—and of course, a little touch of AI too. That’s when I stumbled upon the idea of making LexiLearn AI, an adaptive, intelligent flashcard platform. It’s not just a card viewer; it’s a smart study tool designed to maximize learning efficiency.

This definitely was ambitious for my first time, but what am I without my Icarus-like tendencies? I would rather burn than never fly.

To start off, I first understood that the backend and frontend are two completely separate entities, and they need to be built like that too. They run on two separate ports, and we need to use something called
CORS (Cross-Origin Resource Sharing) to allow sharing between them.

Origin basically consists of a protocol + domain + port.
For example:
Frontend: http://localhost:3000
Backend: http://localhost:5000
The browser sees these as different origins and blocks requests for security reasons. CORS fixes that by explicitly allowing cross-origin requests.

Building the Backend

Inside the app folder, I organized everything into subfolders.

Core Folder
This contains all the foundational files.
1. config.py: Manages app settings by loading values from .env files—keeping secrets out of code and making deployment easy.
2. database.py: Sets up the connection to the database using SQLAlchemy, so I never have to write raw SQL.
3. deps.py: Think of this like a hotel concierge—it hands out a fresh database session (your room key) and checks your JWT token (your VIP badge) before letting you into protected routes.

Models, Routers, Schemas, and Services

Models: Python classes that map directly to database tables (like User, Deck, Card), with relationships like “one user has many decks.”
Schemas: Pydantic models that define what data looks like when it enters or leaves the API—auto-validating input and hiding sensitive fields.
Routers: Your actual API endpoints (POST /decks, GET /cards). They’re thin: just receive requests, call services, and return responses.
Services: Where the real logic lives—creating users, saving cards, running algorithms—without any knowledge of HTTP or FastAPI.
All together, it’s like a restaurant:

Routers are the waiters—they take your order and bring back your food.
Schemas are the menu—they define what you can order and what info the kitchen needs.
Models are the pantry—they store the ingredients (data) and how to retrieve them.
Services are the chefs—they do the real cooking (business logic).
Together, they make the whole system work.

Now, in the services, we implement our core logic.
I learned that for this, I’d have to build a CRUD application—Create, Read, Update, Delete. It’s as straightforward as it sounds: for every main thing in your app (like decks or cards), you need to support all four operations. In LexiLearn, that means adding a flashcard, viewing your deck, renaming it, or deleting it.

Where Do Data Structures and Algorithms Fit In?

I’ve always been puzzled about where DSA shows up in real life—beyond LeetCode’s artificial problems. So I decided to put it to use.

The heart of LexiLearn is spaced repetition: a method where you review flashcards just before you’re likely to forget them. Hard cards appear more often; easy ones fade into the background.

To make this work, I needed to always serve the most urgent card—the one with the earliest next_review_date. Instead of sorting the whole list every time, I used a min-heap (a type of priority queue). It keeps the most overdue card at the top, and popping it is fast.

When you answer a card, the system adjusts your schedule: get it right, and you won’t see it for a while; get it wrong, and it comes back quickly. The timing adapts to your memory instead of a fixed rule.

At the core of it all, I wanted to implement some AI in it too. It's impractical to expect the exact answer down to the t, And that’s how real learning works too: if you understand a concept, you can explain it differently.

Most flashcard apps fail here—they demand exact string matches.
So I used the GROQ API to do two things:

Verify answers by checking semantic meaning, not just keywords.
Generate hints on demand—subtle nudges that guide you without giving away the answer. This makes LexiLearn feel less like a quiz and more like a conversation with a smart tutor.

The Best Part: Testing with Swagger UI

After building everything, I needed to test whether it actually worked. To run the backend server, I used:

python -m uvicorn app.main:app --reload

It had bugs at first—but then I discovered Swagger UI.
Just go to http://localhost:5000/docs, and you get a live, interactive API console. I could send sample requests, inspect responses, and confirm things were working—like getting a 401 when I forgot my token, or a 200 when a card was reviewed successfully.
That was so interesting to me. Suddenly, my API wasn’t just code—it was alive, testable, and real.

## Crafting the Frontend: My First Journey with React
Then i moved on to making the frontend. I chose to build it with React, and my first "aha!" moment was understanding its core philosophy. You don't command the screen to change; you tell React what the screen should look like based on your data (your state).

React then handles the "how." When your state changes, it creates a new "blueprint" of your UI in a lightweight copy called the Virtual DOM. It compares this new blueprint to the old one, calculates the absolute minimum set of changes required, and then surgically updates the real screen. This is what makes it feel so fast and modern.

The "Central Nervous System": React Context

I immediately hit a classic problem: how does my header know if a user is logged in? The messy solution is "prop drilling"—passing the login status down through ten levels of components. The professional solution I implemented was React Context. I created an AuthContext that acted as a global "radio station" for my app's login status. Any component could "tune in" using a simple useAuth() hook to get the user's token or the logout() function.

The "GPS": Multi-Page Navigation with React Router

To make the app feel fast, I built it as a Single Page Application (SPA) using the industry-standard react-router-dom. My App.jsx file became the application's "GPS," using a component to decide which page to show. The most powerful features were:

Protected Routes: I wrapped my main pages in a check. If a user didn't have a login token from the AuthContext, they were automatically redirected to the login page.
_
Dynamic Routes:_ I used paths like /decks/:deckId. This allowed a single component, , to handle the study session for any deck by reading the ID from the URL.

The "Head Waiter": Centralizing API Calls

Instead of scattering axios calls all over my components (a maintenance nightmare), I created a dedicated "IT Department" for my app: the apiService.js file. This file centralizes every function that communicates with my backend. If I ever need to change an endpoint URL, I only have to change it in this one file. This is a professional pattern called abstraction that kept my component code incredibly clean.

The Fun Stuff: A Beautiful and Interactive UI

Finally, I wanted the app to be fun to use. In my index.css, I defined a vibrant, "Cosmic Lavender" theme with CSS Variables and created a slowly shifting, animated gradient for the background to make it feel alive.
I used to think frontend was just HTML, and that anyone could do it. But I’ve realized it takes a lot more than code to make an application feel right. A lot of thought has to go into the design — not just how it looks, but how it feels to use.

I’m starting to see that building something beautiful and intuitive isn’t accidental. It’s intentional. That’s why I want to dive deeper into UI/UX and design thinking — to truly understand how to make my websites look and feel exactly how I imagine them.

All in all, building a full-stack application was eye-opening. I didn’t have much exposure to this before, and now I can see that frontend and backend aren’t just “HTML” or “Python.” There are so many moving parts — state, APIs, databases, styling, routing — and bringing them all together, making everything work exactly the way you want, is its own kind of art.

And honestly? That’s what makes it so rewarding.