Prince Raj

Posted on Apr 27

Part 3: Turning Text Into Numbers - Bag of Words, Keywords, and Embeddings Without the Magic

#ai #machinelearning #nlp #backend

The question every beginner eventually asks

At some point in every AI project, you run into the same confusing idea:

How does a sentence become something a model can actually use?

Humans read:

"refund nahi mila yet"

A model does not "read" that in the human sense. A model receives numbers. So we need a bridge between human language and machine math. That bridge in this project is the feature extraction pipeline.

Start with the least magical explanation

Here is the simple version:

Text
  ->
Normalize it
  ->
Split it into tokens
  ->
Create several kinds of numeric clues
  ->
Join those clues into one vector

That is it. The interesting part is what kinds of clues we create.

Step 1: Normalize the text

Before counting anything, we reduce noise.

Examples:

Refund Nahi Mila!!! -> refund not received
my email is abc@x.com -> my email is <email>
call me at 12345 -> call me at <num>

Why do this?

Because models are sensitive to surface variation.

If one user writes PLZ refund and another writes please refund. We usually want the system to treat those as the same idea.

This project normalizes:

Lowercase
URLs
Emails
Numbers
Hinglish shortcuts
Punctuation and spacing

Plain-English version:

We clean away formatting noise so the model can focus on meaning.

Step 2: Tokenize the text

After normalization, we split the text into tokens. For this project, tokenization is intentionally simple:

whitespace split

Example:

"payment failed and money got deducted"

becomes:

["payment", "failed", "and", "money", "got", "deducted"]

This is not the fanciest tokenizer in the world. That is okay. For a narrow support-ticket domain, simple tokenization can work surprisingly well.

This is a good beginner lesson:

"simple" is not the same as "bad"

The project uses a hybrid feature vector

This is where the system gets interesting. Instead of relying on only one representation of text, it builds three:

Bag-of-words
Keyword flags
Averaged embeddings

Then it concatenates them into one big feature vector. Why hybrid? Because each representation has different strengths.

Feature type 1: Bag-of-words

Bag-of-words is one of the oldest and simplest ideas in NLP. The name sounds odd, but the concept is easy:

Keep a vocabulary of important words and count how many times each one appears.

If the vocabulary contains:

refund
payment
error
pricing

and the ticket is:

"refund for duplicate payment"

then the bag-of-words vector might look like:

[1, 1, 0, 0]

Plain-English version:

It is a checklist of words that showed up.

This project also applies log1p to those counts. Why?

Because raw counts can get too large. log1p compresses the scale so repeated words still matter, but not too aggressively.

You can think of it like this:

Seeing a word 3 times is more important than seeing it once, but not 3x more important.

Feature type 2: Keyword flags

Keyword flags are even simpler. For each important phrase, we ask:

Is this present or not?

Examples of keywords in the project:

refund
cancel
not working
pricing
demo
api
refund chahiye

If a keyword appears, its flag becomes 1. Otherwise, it stays 0.

Why keep keyword flags if we already have bag-of-words? Because business signals often deserve direct emphasis.

For example:

refund
close account
pricing

are not just words. They are operationally meaningful patterns.

Plain-English version:

Keyword flags are the model’s "red flag" and "green flag" indicators.

Feature type 3: Token embeddings

This is the part that usually sounds magical, but it can be explained simply. An embedding is just a learned vector for a token.

Instead of saying:

the word refund is token 127

we say:

the word refund also has a learned numeric representation that captures patterns from training

In this project:

each token gets an ID
the ID looks up a small embedding vector
the ticket’s embedding vectors are averaged

So if a sentence has tokens:

["refund", "money", "not", "received"]

the model looks up four vectors and averages them.

Plain-English version:

Bag-of-words tells us what words were present.
Embeddings help the model learn what kinds of words behave similarly.

Technical term:

This is embedding lookup with average pooling.

Why average the embeddings?

Because this is a tiny model. We are not building a giant sequence model with attention. We are building something cheap, fast, and deployable in pure Go.

Averaging embeddings gives us:

low cost
low complexity
useful semantic signal

without needing a much bigger model.

This is a recurring theme in the whole project:

choose the smallest method that solves the problem well enough

The final feature vector

After building all three parts, we join them together:

[ bag_of_words | keyword_flags | pooled_embedding ]

That becomes the input to the neural network.

Plain-English version:

We combine direct word evidence, domain-specific business hints, and learned semantic context into one numeric summary.

Why not just use one representation?

This is worth pausing on.

If we only used bag-of-words:

we would miss softer semantic patterns

If we only used embeddings:

we might weaken explicit domain phrases like refund or pricing

If we only used keyword flags:

the model would be too brittle and too hand-written

The hybrid setup works because the three feature families complement each other.

That is a very useful design pattern in applied AI:

let simple signals and learned signals work together

A backend analogy

If you come from backend engineering, think of the feature vector like a request context object.

It contains:

raw facts
derived facts
domain-specific hints

No single field tells the whole story. But together, they make downstream decision logic much stronger.

Why the Go inference engine had to mirror this exactly

This part matters more than many beginners expect.

The production Go service cannot do "approximately the same preprocessing."
It has to do the same preprocessing.

If training used:

Hinglish normalization
log1p bag-of-words
<unk> token fallback
max token truncation

then inference has to do those too.

Otherwise you get a mismatch:

the model learned one world, but production serves another

That is why the exported artifact includes preprocessing metadata, vocabularies, keywords, and embedding info.

If you only remember one thing from this article

Remember this:

AI models do not work on "text."
They work on representations of text.

The quality of that representation often matters as much as the model itself.

What comes next

In Part 4, we will take this feature vector and pass it into the actual neural network.

That is where we will cover:

dense layers
ReLU
shared base + multiple heads
loss functions
class weights
validation metrics
early stopping

In plain language first, of course.

DEV Community

Part 3: Turning Text Into Numbers - Bag of Words, Keywords, and Embeddings Without the Magic

The question every beginner eventually asks

Start with the least magical explanation

Step 1: Normalize the text

Step 2: Tokenize the text

The project uses a hybrid feature vector

Feature type 1: Bag-of-words

Feature type 2: Keyword flags

Feature type 3: Token embeddings

Why average the embeddings?

The final feature vector

Why not just use one representation?

A backend analogy

Why the Go inference engine had to mirror this exactly

If you only remember one thing from this article

What comes next

Top comments (0)