DEV Community

Cover image for 🧠 NLP: From Tokenization to Vectorization (with Practical Insights)
TAQİ EDDİNE EL MAMOUNİ
TAQİ EDDİNE EL MAMOUNİ

Posted on

🧠 NLP: From Tokenization to Vectorization (with Practical Insights)

Natural Language Processing (NLP) bridges the gap between human language and machine intelligence. In this blog, we’ll explore foundational steps like tokenization, stemming, lemmatization, vectorization, and modern tools like Transformers. Whether you're just starting or want a refresher, this is your guide to transforming raw text into machine-readable format.

  1. 🔤 Tokenization Tokenization breaks text into smaller units — called tokens — that could be words, subwords, or sentences.

🔹 Word Tokenization

Input: "Natural Language Processing"
Tokens: ["Natural", "Language", "Processing"]
Enter fullscreen mode Exit fullscreen mode

🔹 Sentence Tokenization

Input: "NLP is fascinating. It has endless applications!"
Tokens: ["NLP is fascinating.", "It has endless applications!"]
Enter fullscreen mode Exit fullscreen mode
  1. ✂️ Stemming Stemming reduces words to their root by stripping prefixes/suffixes — but it may not always produce real words.
Words: "running", "runs", "runner"
Stems: "run", "run", "runner"
Enter fullscreen mode Exit fullscreen mode

🟢 Use Case: Fast text indexing and search systems.

  1. 🧬 Lemmatization Lemmatization brings words to their proper dictionary root (lemma) using morphological analysis.
Words: "running", "ran", "runs"
Lemmas: "run", "run", "run"
Enter fullscreen mode Exit fullscreen mode

🟢 Use Case: Sentiment analysis, text classification.

  1. 🛑 Stop Word Removal Stop words are common words like “the”, “is”, “and” that are usually removed before analysis.
Input: "AI is transforming the world."
Output: "AI transforming world"
Enter fullscreen mode Exit fullscreen mode
  1. 🏷️ Part-of-Speech (POS) Tagging This tags each word with its grammatical role: noun, verb, adjective, etc.
Input: "AI transforms industries."
Output: [('AI', 'NNP'), ('transforms', 'VBZ'), ('industries', 'NNS')]
Enter fullscreen mode Exit fullscreen mode
  1. 🔢 Text Normalization (Often Skipped but Important!) Before further processing, normalize the text:

Lowercasing

Removing punctuation/numbers

Removing extra spaces

import re
text = "AI is Changing the WORLD! 2025."
clean = re.sub(r"[^a-zA-Z\s]", "", text.lower())
# Result: "ai is changing the world"
Enter fullscreen mode Exit fullscreen mode
  1. 🔠 TF-IDF (Vectorization) TF-IDF weighs words by importance across documents.
from sklearn.feature_extraction.text import TfidfVectorizer

docs = ["AI is the future", "AI transforms industries"]
tfidf = TfidfVectorizer()
matrix = tfidf.fit_transform(docs)

print(tfidf.get_feature_names_out())
Enter fullscreen mode Exit fullscreen mode
  1. 🌐 Word Embeddings (Word2Vec, GloVe, FastText) These convert words to dense vectors with semantic meaning.

Model Description
Word2Vec Learns words from context
GloVe Combines local + global context
FastText Captures subword information (e.g., prefixes)

  1. 🤖 Transformers (BERT, RoBERTa, GPT) Modern NLP uses transformer-based models that understand context much better than traditional methods.
from transformers import pipeline
clf = pipeline("sentiment-analysis")
print(clf("I love NLP and transformers!"))
Enter fullscreen mode Exit fullscreen mode

🟢 Use Cases: Sentiment analysis, question answering, summarization, translation, etc.

🔧 10. Build a Simple NLP Pipeline (Practical Example)

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

model = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english')),
    ('clf', LogisticRegression())
])

X = ["I love this product", "This is terrible"]
y = [1, 0]

model.fit(X, y)
print(model.predict(["Awesome experience"]))  # Output: [1]
Enter fullscreen mode Exit fullscreen mode

🚀 Where to Go From Here?
Topic Description
🔍 NER: Recognize names, organizations, locations
🧩 Dependency Parsing: Understand how words relate
🏷️ Text Classification: Categorize emails, reviews, etc.
📚 Topic Modeling: Discover themes in documents
🤖 Transformers: BERT, GPT for deep understanding
📝 Summarization: Shorten long documents
💬 Chatbots: Build intelligent assistants
🛠 NLP Project: Use spaCy, NLTK, or HuggingFace to combine all steps

Top comments (0)