AJAYA SHRESTHA

Posted on Jun 16

NLP: Tokenization to Vectorization

#nlp #ai

Natural Language Processing (NLP) is a domain that bridges human languages and computer intelligence. In this blog, we’ll explore the crucial steps, from basics like tokenization, stemming, and lemmatization to vectorization, and understanding how text data is transformed into machine-readable formats. Let's break down each of the foundational techniques.

1. Tokenization

Tokenization is the process of breaking text into smaller units called tokens. These tokens can be words, sentences, or even subwords.

Word Tokenization: Splits sentences into words.

# Example
Input: "Natural Language Processing"
Tokens: ["Natural", "Language", "Processing"]

Sentence Tokenization: Divides text into sentences, essential for tasks like summarization.

# Example
Input: "NLP is fascinating. It has endless applications!"
Tokens: ["NLP is fascinating.", "It has endless applications!"]

2. Stemming

Stemming reduces words to their root forms by removing suffixes or prefixes. It’s fast but can produce roots that aren’t actual words.

# Example:
Words: "running", "runs", "runner"
Stems: "run", "run", "runner"
# Use Case: Information retrieval, indexing.

3. Lemmatization

Lemmatization reduces words to their actual base form (lemma) using vocabulary and morphological analysis. It’s more accurate than stemming.

# Example:
Words: "running", "runs", "ran"
Lemma: "run", "run", "run"
# Use Case: Sentiment analysis, chatbots.

4. Stop Word Removal

Stop words are common, frequently-used words (like "the", "and", "is") that often carry little semantic meaning and can clutter text analysis.

Example:
Original: "AI is changing the world and transforming industries."
After Removal: "AI changing world transforming industries."

5. Part-of-Speech (POS) Tagging

POS tagging classifies words based on grammatical categories (e.g., noun, verb, adjective). This enhances NLP tasks by adding grammatical context to text.

Example:
Input: "AI transforms industries."
POS Tags: [('AI', 'NNP'), ('transforms', 'VBZ'), ('industries', 'NNS'), ('.', '.')]

Common POS Tags:

NN: Noun, singular or mass

VB: Verb, base form

JJ: Adjective

RB: Adverb

6. Embeddings (Vectorization)

Embeddings convert words into continuous vectors, capturing semantic meaning and relationships between words.

Common Models:

Word2Vec: Learns embeddings based on context.
GloVe: Combines local context (Word2Vec approach) and global statistics from large corpora.
FastText: Enhances embedding by considering subwords, helpful with rare words or multilingual contexts.

Why Embeddings Matter:

Enables models to interpret semantic relationships (e.g., synonyms, antonyms, analogies).
Fundamental for deep learning NLP tasks such as text classification, sentiment analysis, and translation.

Mastering foundational NLP techniques like Tokenization, Stemming and Lemmatization, Stop Word Removal, POS Tagging, and Embeddings provides a strong foundation for advanced text analysis. With these basics, you're now prepared to dive deeper into NLP's exciting complexities.

Recommended Next Approaches:

NER: Detect names, places, organizations in text.
Dependency Parsing: Understand word relationships.
Text Classification: Categorize text (e.g., spam, sentiment).
Topic Modeling: Uncover hidden themes in documents. Transformers (e.g., BERT): Use advanced models for deep language understanding.
Summarization: Create concise versions of long texts.
Q&A and Chatbots: Build systems that answer questions.
Text Generation: Generate human-like content automatically.
Build an NLP Pipeline: Apply all basics using NLTK, spaCy, or Hugging Face.

DEV Community