DEV Community

Cover image for Natural Language Processing with Python: 8 Text Analysis Techniques You Need to Know
Nithin Bharadwaj
Nithin Bharadwaj

Posted on

Natural Language Processing with Python: 8 Text Analysis Techniques You Need to Know

As a best-selling author, I invite you to explore my books on Amazon. Don't forget to follow me on Medium and show your support. Thank you! Your support means the world!

Let's talk about making sense of words with a computer. This is what we call Natural Language Processing, or NLP. It's how we teach machines to read text, understand sentiment, or even chat with us. Python is my go-to tool for this, and I want to share some of the most effective ways I use it to work with text.

Think of raw text like a messy, cluttered room. Before you can find anything useful, you need to clean and organize. Text preprocessing is that essential first sweep. It's about taking a sentence like "The QUICK brown foxes are jumping over the lazy dogs!" and turning it into a neat list of clean words: ['quick', 'brown', 'fox', 'jump', 'lazy', 'dog'].

We do this by following a few steps. First, we make everything lowercase so "The" and "the" are seen as the same word. Then, we remove punctuation and numbers that usually don't help with meaning. Next, we break the sentence into individual words or tokens, a process called tokenization. After that, we filter out common words like "the," "are," and "over" – these are stop words that add little value. Finally, we reduce words to their base form. "Jumping" becomes "jump," and "foxes" becomes "fox." This last step can be done through stemming, which is a crude cut, or lemmatization, which is smarter and uses a dictionary to find the root word.

Here is a class I often use to handle this entire process. It lets me choose which steps to apply.

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
import string
import spacy

# A one-time download for the necessary data files
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

class TextPreprocessor:
    def __init__(self, use_spacy=False):
        # Load common English words to filter out later
        self.stop_words = set(stopwords.words('english'))
        # Tools for reducing words to stems or roots
        self.stemmer = PorterStemmer()
        self.lemmatizer = WordNetLemmatizer()
        self.use_spacy = use_spacy
        # spaCy is a powerful industrial-strength library
        if use_spacy:
            # You'd need to run: python -m spacy download en_core_web_sm
            self.nlp = spacy.load('en_core_web_sm')

    def clean_text(self, text):
        # Standardize case
        text = text.lower()
        # Strip out punctuation characters
        text = text.translate(str.maketrans('', '', string.punctuation))
        # Remove digits - sometimes you keep them, sometimes not.
        text = ''.join([char for char in text if not char.isdigit()])
        return text

    def tokenize(self, text):
        # Use spaCy's accurate tokenizer if requested
        if self.use_spacy:
            doc = self.nlp(text)
            return [token.text for token in doc]
        else:
            # Use NLTK's standard tokenizer
            return word_tokenize(text)

    def remove_stopwords(self, token_list):
        # Filter out any token that's in our stop words list
        return [token for token in token_list if token not in self.stop_words]

    def stem_tokens(self, token_list):
        # Apply stemming: "jumping" -> "jump", "running" -> "run"
        return [self.stemmer.stem(token) for token in token_list]

    def lemmatize_tokens(self, token_list):
        # Apply lemmatization: "better" -> "good", "mice" -> "mouse"
        return [self.lemmatizer.lemmatize(token) for token in token_list]

    def preprocess(self, text, steps=['clean', 'tokenize', 'remove_stopwords', 'lemmatize']):
        # A pipeline to run the text through selected steps
        processed_text = text
        if 'clean' in steps:
            processed_text = self.clean_text(processed_text)

        tokens = self.tokenize(processed_text)

        if 'remove_stopwords' in steps:
            tokens = self.remove_stopwords(tokens)
        if 'stem' in steps:
            tokens = self.stem_tokens(tokens)
        if 'lemmatize' in steps:
            tokens = self.lemmatize_tokens(tokens)

        return tokens

# Let's see it in action
print("=== Example 1: Basic NLTK ===")
basic_preprocessor = TextPreprocessor(use_spacy=False)
sample_text = "The quick brown foxes are jumping over the lazy dogs. They're having fun!"
result_tokens = basic_preprocessor.preprocess(sample_text)
print(f"Original: {sample_text}")
print(f"Cleaned Tokens: {result_tokens}")
print()

print("=== Example 2: Using spaCy ===")
# spaCy is great for accuracy and linguistic features
spacy_preprocessor = TextPreprocessor(use_spacy=True)
# Let's skip the clean step to let spaCy handle punctuation its own way for demonstration
spacy_tokens = spacy_preprocessor.preprocess(sample_text, steps=['tokenize', 'lemmatize'])
print(f"spaCy Lemmatized Tokens: {spacy_tokens}")
print()

print("=== Example 3: Trying Stemming ===")
# Compare stemming vs lemmatization on a tricky sentence
test_sentence = "I believed the stories about the running mice and the better worlds."
stem_result = basic_preprocessor.preprocess(test_sentence, steps=['clean', 'tokenize', 'stem'])
lemma_result = basic_preprocessor.preprocess(test_sentence, steps=['clean', 'tokenize', 'lemmatize'])
print(f"Test Sentence: {test_sentence}")
print(f"Stemmed: {stem_result}")
print(f"Lemmatized: {lemma_result}")
Enter fullscreen mode Exit fullscreen mode

Running this code shows the transformation. The first example gives us ['quick', 'brown', 'fox', 'jump', 'lazy', 'dog', 'fun']. The spaCy example might keep "dogs" as "dog" and "having" as "have" more reliably. The third example clearly shows the difference: stemming turns "believed" to "believ" and "stories" to "stori", while lemmatization correctly gets "believe" and "story". Choosing between them depends on your task; speed favors stemming, meaning favors lemmatization.

Once your text is clean, the next question is how to represent it for a machine. Machines don't understand words; they understand numbers. My second technique is about creating a simple but powerful numerical representation called Bag of Words.

Imagine you take all the unique words from a set of documents and throw them into a bag. For each document, you then create a vector—a list of numbers—that counts how many times each word from the bag appears in it. The order of the words is lost, hence the name "bag." It's simple but surprisingly effective for tasks like spam detection or topic categorization.

Let's build one from scratch to see how it works, then use a professional tool.

from collections import Counter
import pandas as pd

# Our simple corpus: three tiny documents
documents = [
    "The cat sat on the mat.",
    "The dog chased the cat.",
    "The cat and the dog are friends."
]

def simple_bag_of_words(docs):
    """
    Creates a basic Bag of Words model manually.
    """
    # Step 1: Preprocess each document (using our earlier class)
    prep = TextPreprocessor(use_spacy=False)
    processed_docs = []
    for doc in docs:
        tokens = prep.preprocess(doc, steps=['clean', 'tokenize', 'remove_stopwords'])
        processed_docs.append(tokens)

    # Step 2: Build the vocabulary - the set of all unique words
    vocabulary = set()
    for doc_tokens in processed_docs:
        vocabulary.update(doc_tokens)
    vocabulary = sorted(list(vocabulary)) # Sort for consistency
    print(f"Our Vocabulary: {vocabulary}")

    # Step 3: Create vectors by counting word occurrences
    vectors = []
    for doc_tokens in processed_docs:
        word_counts = Counter(doc_tokens)
        # Create a vector with a count for each word in the vocabulary
        vector = [word_counts.get(word, 0) for word in vocabulary]
        vectors.append(vector)

    return vocabulary, vectors

vocab, bow_vectors = simple_bag_of_words(documents)
print("\nManual Bag of Words Vectors:")
for i, vec in enumerate(bow_vectors):
    print(f"Doc {i+1}: {vec}")

# Now, let's do it the easy, professional way with scikit-learn
print("\n" + "="*50 + "\n")
print("Using scikit-learn's CountVectorizer:\n")

from sklearn.feature_extraction.text import CountVectorizer

# Re-create the documents (without preprocessing, CountVectorizer will do it)
raw_docs = [
    "The cat sat on the mat.",
    "The dog chased the cat.",
    "The cat and the dog are friends."
]

# Initialize the vectorizer. We can pass parameters to mimic our preprocessing.
vectorizer = CountVectorizer(lowercase=True, stop_words='english')
# 'fit_transform' learns the vocabulary and creates the vectors in one step
X = vectorizer.fit_transform(raw_docs)

# Convert the result to a readable array
bow_array = X.toarray()
feature_names = vectorizer.get_feature_names_out()

print("Vocabulary (Feature Names):", feature_names)
print("\nDocument-Term Matrix:")
# Use pandas for a nice table view
df = pd.DataFrame(bow_array, columns=feature_names, index=[f"Doc {i+1}" for i in range(len(raw_docs))])
print(df)
Enter fullscreen mode Exit fullscreen mode

The manual function shows the process: we get a vocabulary like ['cat', 'chased', 'dog', 'friends', 'mat', 'sat']. Document 1 "The cat sat on the mat." becomes the vector [1, 0, 0, 0, 1, 1], showing counts for 'cat', 'mat', and 'sat'. The scikit-learn output presents this as a neat table, which is much more practical for real work.

The Bag of Words model has a flaw. Common words like "the" or "is" will have very high counts across all documents, drowning out rarer, more interesting words. This leads me to the third technique: TF-IDF, or Term Frequency-Inverse Document Frequency.

TF-IDF fixes the common-word problem by adjusting the weight of a word. It considers not just how often a word appears in a document (Term Frequency), but also how unique that word is across all documents (Inverse Document Frequency). A word that appears in every document gets a low score. A word that appears many times in one document but rarely elsewhere gets a very high score. This makes TF-IDF excellent for search engines and keyword extraction.

Here’s how you calculate and use it.

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

# Same sample documents
documents = [
    "The cat sat on the mat.",
    "The dog chased the cat.",
    "The cat and the dog are friends."
]

# Initialize the TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(lowercase=True, stop_words='english')

# Learn vocabulary and transform documents into TF-IDF features
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

# Get feature names and the dense matrix
feature_names = tfidf_vectorizer.get_feature_names_out()
dense_matrix = tfidf_matrix.toarray()

print("TF-IDF Feature Names:", feature_names)
print("\nTF-IDF Matrix:")
tfidf_df = pd.DataFrame(np.round(dense_matrix, 3), columns=feature_names, index=[f"Doc {i+1}" for i in range(len(documents))])
print(tfidf_df)

# Let's interpret one score manually for Doc 1, word 'cat'
print("\n--- Manual Check for 'cat' in Document 1 ---")
# Term Frequency (TF) for 'cat' in Doc 1: count of 'cat' / total words in doc
doc1_words = ["cat", "sat", "mat"] # after preprocessing
tf_cat = doc1_words.count('cat') / len(doc1_words)
print(f"TF('cat' in Doc1) = {doc1_words.count('cat')} / {len(doc1_words)} = {tf_cat:.3f}")

# Inverse Document Frequency (IDF) for 'cat'
# Number of documents
N = len(documents)
# Number of documents containing 'cat'
docs_with_cat = sum(1 for doc in documents if 'cat' in doc.lower())
# A common formula for IDF is log( (N+1) / (docs_with_word + 1) ) + 1 (sklearn uses a smoothed version)
idf_cat = np.log((N + 1) / (docs_with_cat + 1)) + 1
print(f"IDF('cat') = log(({N}+1)/({docs_with_cat}+1)) + 1 = {idf_cat:.3f}")

# TF-IDF
manual_tfidf_cat = tf_cat * idf_cat
print(f"Manual TF-IDF('cat', Doc1) ≈ {manual_tfidf_cat:.3f}")
print(f"scikit-learn's value for 'cat' in Doc1: {dense_matrix[0][list(feature_names).index('cat')]:.3f}")
Enter fullscreen mode Exit fullscreen mode

The TF-IDF table will show scores between 0 and 1. You'll notice that "cat" and "dog" have decent scores because they are common but not in every document. A word like "chased," which is unique to document 2, will have a very high score for that document, highlighting its importance as a distinguishing keyword.

These vector representations—Bag of Words and TF-IDF—turn text into a grid of numbers. This grid is often called a document-term matrix. It's the bridge that allows us to apply all the standard machine learning algorithms we use for numerical data to the world of text.

Working with individual words is powerful, but language is about sequences and context. My fourth technique explores capturing word relationships through N-grams.

An N-gram is simply a contiguous sequence of N items from a text. If N=1, it's a unigram (single words). If N=2, it's a bigram (pairs of words), and N=3 is a trigram. The phrase "the quick brown fox" has unigrams: ["the", "quick", "brown", "fox"]; bigrams: ["the quick", "quick brown", "brown fox"]; trigrams: ["the quick brown", "quick brown fox"].

Bigrams and trigrams help us capture phrases and idioms. "New York" as a bigram has a meaning different from just "new" and "york" separately. "Not good" conveys the opposite sentiment of "good". By adding N-grams to our Bag of Words or TF-IDF model, we give our machine a better sense of context.

Let's see how to generate them and add them to our vectorizer.

from nltk import ngrams
from sklearn.feature_extraction.text import CountVectorizer

sample_sentence = "natural language processing is fascinating and natural learning is key"
tokens = word_tokenize(sample_sentence.lower())

print("Original Tokens:", tokens)
print()

# Generate bigrams and trigrams using NLTK
print("Generating N-grams with NLTK:")
bigrams = list(ngrams(tokens, 2))
trigrams = list(ngrams(tokens, 3))
print(f"Bigrams: {[' '.join(bg) for bg in bigrams]}")
print(f"Trigrams: {[' '.join(tg) for tg in trigrams]}")
print()

# The practical way: Use CountVectorizer with an ngram_range parameter
print("Using CountVectorizer with N-gram ranges:\n")

corpus = [
    "I love natural language processing.",
    "I hate boring language tutorials.",
    "Natural learning is the best processing."
]

# This vectorizer will create features for unigrams AND bigrams
vectorizer = CountVectorizer(ngram_range=(1, 2), stop_words='english')
X_ngram = vectorizer.fit_transform(corpus)
ngram_features = vectorizer.get_feature_names_out()
ngram_df = pd.DataFrame(X_ngram.toarray(), columns=ngram_features, index=[f"Text {i+1}" for i in range(3)])

print("Vocabulary includes unigrams and bigrams:")
print(ngram_features)
print("\nDocument-Term Matrix with N-grams:")
print(ngram_df)

# Let's check a specific phrase
print("\n--- Checking for the phrase 'natural language' ---")
if 'natural language' in ngram_features:
    col_index = list(ngram_features).index('natural language')
    print(f"'natural language' is feature #{col_index}")
    print(f"Counts per document: {ngram_df.iloc[:, col_index].values}")
else:
    print("Phrase not found as a bigram feature.")
Enter fullscreen mode Exit fullscreen mode

The output shows our vocabulary now includes single words and pairs. You'll see features like love natural and natural language. In the matrix, you can see that "natural language" appears once in the first document. This simple addition can dramatically improve model performance for tasks like sentiment analysis, where "not happy" is critical to catch.

So far, we've treated words as independent symbols. But words have relationships. "King" is to "man" as "queen" is to "woman." The fifth technique, Word Embeddings, captures these semantic meanings by representing each word as a dense vector in a high-dimensional space, where similar words are close together.

Tools like Word2Vec, GloVe, and FastText create these embeddings by learning from massive amounts of text. The word "python" might be represented by a 300-dimensional vector like [0.25, -0.1, 0.87, ...]. We can then do math with words: vector('king') - vector('man') + vector('woman') results in a vector very close to vector('queen').

Let's load pre-trained GloVe embeddings and explore these relationships.

import numpy as np
from scipy.spatial.distance import cosine

# In practice, you download a GloVe file (e.g., 'glove.6B.100d.txt')
# Let's simulate a tiny embedding space for demonstration
print("=== Simulating a Mini Word Embedding Space ===")

# A tiny dictionary of word vectors (3 dimensions for simplicity)
embedding_index = {
    'king':    np.array([0.8, 0.2, 0.1]),
    'man':     np.array([0.6, 0.3, 0.1]),
    'queen':   np.array([0.7, 0.9, 0.1]),
    'woman':   np.array([0.5, 0.8, 0.1]),
    'python':  np.array([0.1, 0.1, 0.9]),
    'java':    np.array([0.15, 0.1, 0.85]),
    'london':  np.array([0.9, 0.0, 0.0]),
    'paris':   np.array([0.88, 0.05, 0.01]),
}

def find_analogy(word_a, word_b, word_c, embeddings):
    """Finds word_d such that a:b :: c:d."""
    try:
        vec_a = embeddings[word_a]
        vec_b = embeddings[word_b]
        vec_c = embeddings[word_c]
        # The analogy calculation
        vec_d = vec_b - vec_a + vec_c

        # Find the word in our vocab whose vector is closest to vec_d
        best_word = None
        best_sim = -1 # Cosine similarity ranges from -1 to 1
        for word, vec in embeddings.items():
            if word in [word_a, word_b, word_c]:
                continue # Skip the input words
            # Cosine similarity: 1 means identical, 0 means orthogonal
            sim = 1 - cosine(vec_d, vec)
            if sim > best_sim:
                best_sim = sim
                best_word = word
        return best_word, best_sim
    except KeyError as e:
        return f"Word not in vocabulary: {e}", None

# Try the classic analogy
result, sim = find_analogy('king', 'man', 'woman', embedding_index)
print(f"'king' is to 'man' as 'woman' is to '{result}' (similarity: {sim:.3f})")

# Check similarity between related and unrelated words
print("\n--- Cosine Similarities ---")
def print_sim(word1, word2, emb):
    sim = 1 - cosine(emb[word1], emb[word2])
    print(f"cosine_sim('{word1}', '{word2}') = {sim:.3f}")

print_sim('king', 'queen', embedding_index)
print_sim('man', 'woman', embedding_index)
print_sim('python', 'java', embedding_index)
print_sim('king', 'python', embedding_index)
print_sim('london', 'paris', embedding_index)

# Practical use: Convert a sentence to an embedding by averaging its word vectors
print("\n--- Sentence Embedding by Averaging ---")
sentence = "king and queen in london"
words = [w for w in sentence.lower().split() if w in embedding_index]
if words:
    sentence_vector = np.mean([embedding_index[w] for w in words], axis=0)
    print(f"Sentence: '{sentence}'")
    print(f"Words found: {words}")
    print(f"Averaged Sentence Vector: {np.round(sentence_vector, 3)}")
Enter fullscreen mode Exit fullscreen mode

In our mini example, the analogy function should suggest "queen." The similarities show high scores for related pairs (king/queen, london/paris) and a lower score for unrelated ones (king/python). Averaging word vectors to get a sentence vector is a common, simple way to represent longer text with embeddings.

For real projects, you don't simulate. You load a large pre-trained file. Here's a template for using real GloVe embeddings with Gensim.

# Template for using real GloVe embeddings with Gensim
print("\n=== Template for Real GloVe Embeddings (Requires File Download) ===")
"""
# First, convert GloVe format to Word2Vec format for Gensim
from gensim.scripts.glove2word2vec import glove2word2vec
glove_input_file = 'glove.6B.100d.txt'
word2vec_output_file = 'glove.6B.100d.word2vec.txt'
glove2word2vec(glove_input_file, word2vec_output_file)

# Then load and use
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format(word2vec_output_file, binary=False)

# Now you can use it
similar_words = model.most_similar('python', topn=5)
print(similar_words)  # Might show: [('java', 0.85), ('code', 0.82), ...]

result = model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
print(result)  # Hopefully: [('queen', 0.79...)]
"""
print("(Code commented out as it requires the large GloVe file to be downloaded.)")
Enter fullscreen mode Exit fullscreen mode

Word embeddings give us semantic understanding. The sixth technique builds on this to classify the overall feeling of a text: Sentiment Analysis. Is a product review positive or negative? Is a tweet angry or joyful?

We can approach this with simple rule-based methods using sentiment lexicons (dictionaries of words with pre-assigned scores) or with machine learning models trained on labeled data.

Let's try both. First, a lexicon-based approach with the VADER tool, which is great for social media text. Then, a machine learning approach using our TF-IDF vectors.

# Technique 6: Sentiment Analysis
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
import nltk
nltk.download('vader_lexicon')

print("=== Part A: Lexicon-Based with VADER ===")
sid = SentimentIntensityAnalyzer()

sentences = [
    "This movie is absolutely fantastic! I loved every minute of it.",
    "Waste of time. The plot was terrible and the acting was worse.",
    "It was okay. Not great, not terrible.",
    "I'm so excited for the concert!!!",
    "This product broke after two days. Very disappointed."
]

for sentence in sentences:
    scores = sid.polarity_scores(sentence)
    # The 'compound' score aggregates pos, neg, neu into a single value from -1 (neg) to +1 (pos)
    compound = scores['compound']
    sentiment = "positive" if compound >= 0.05 else "negative" if compound <= -0.05 else "neutral"
    print(f"Text: {sentence[:50]}...")
    print(f"  Scores: {scores}")
    print(f"  Judgement: {sentiment} (based on compound: {compound:.3f})\n")

print("="*60)
print("=== Part B: Machine Learning-Based Sentiment ===")
# Let's create a small simulated dataset
print("\n1. Creating a simple dataset...")
# Positive sentences
pos_texts = [
    "I love this product it is amazing",
    "Great service and fast delivery",
    "Excellent quality highly recommend",
    "Very happy with my purchase",
    "Perfect exactly what I wanted"
]
# Negative sentences
neg_texts = [
    "Terrible experience would not buy again",
    "Poor quality broke immediately",
    "Waste of money completely useless",
    "Very disappointed with the service",
    "Bad product do not recommend"
]

texts = pos_texts + neg_texts
# Labels: 1 for positive, 0 for negative
labels = [1]*len(pos_texts) + [0]*len(neg_texts)

print(f"We have {len(texts)} total texts: {len(pos_texts)} positive, {len(neg_texts)} negative.")

print("\n2. Converting text to TF-IDF features...")
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(texts)
print(f"Feature matrix shape: {X.shape}") # (10 documents, X unique words)

print("\n3. Training a simple classifier...")
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.3, random_state=42)
model = LogisticRegression()
model.fit(X_train, y_train)

print("\n4. Making predictions and evaluating...")
y_pred = model.predict(X_test)
print(f"True labels for test set: {y_test}")
print(f"Model predictions:        {y_pred}")
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy on test set: {accuracy:.2f}")

print("\n5. Trying the model on new sentences...")
new_sentences = [
    "This is good and works well",
    "I hate it it is awful",
    "It is okay I guess"
]
new_X = tfidf.transform(new_sentences)
new_preds = model.predict(new_X)
sentiment_map = {1: 'POSITIVE', 0: 'NEGATIVE'}
for sent, pred in zip(new_sentences, new_preds):
    print(f"'{sent}' -> {sentiment_map[pred]}")
Enter fullscreen mode Exit fullscreen mode

VADER gives us a detailed breakdown for each sentence, including positive, negative, neutral, and compound scores. The machine learning model, though trained on tiny data, learns the pattern. It correctly classifies our new examples. In reality, you would train on thousands of labeled reviews for a robust system.

Sometimes we need to go beyond words and sentences to understand the structure of a document. My seventh technique is Topic Modeling, specifically Latent Dirichlet Allocation (LDA). It's a way to discover hidden thematic patterns in a large collection of documents. You don't tell it the topics; it figures them out by grouping words that frequently appear together.

If you feed it 10,000 news articles, it might output topics characterized by words like ['election', 'vote', 'candidate', 'poll'] (Politics), ['stock', 'market', 'price', 'trade'] (Finance), and ['player', 'game', 'team', 'score'] (Sports). Each document is then seen as a mixture of these topics.

Let's apply LDA to a small set of fake news headlines to see it in action.

# Technique 7: Topic Modeling with LDA
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

print("=== Discovering Topics with LDA ===")

# A small corpus of fake headlines
corpus = [
    "stock market reaches all time high today",
    "investors celebrate rising share prices",
    "new election poll shows close race",
    "candidates debate economic policy",
    "football team wins championship final",
    "basketball player scores record points",
    "central bank raises interest rates",
    "trading volume surges on wall street",
    "voters head to the polls tomorrow",
    "sports fans celebrate victory parade"
]

print("Corpus of headlines:")
for i, h in enumerate(corpus):
    print(f"  {i+1:2d}. {h}")

print("\n1. Creating a Bag of Words model...")
# We don't remove stop words here, as they might be part of topic structure.
# But for clean topics, we often do.
vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')
doc_term_matrix = vectorizer.fit_transform(corpus)
feature_names = vectorizer.get_feature_names_out()

print(f"Document-Term Matrix shape: {doc_term_matrix.shape}")
print(f"Number of unique words (features): {len(feature_names)}")

print("\n2. Fitting the LDA model to find 3 topics...")
num_topics = 3
lda_model = LatentDirichletAllocation(n_components=num_topics, random_state=42, max_iter=10)
lda_model.fit(doc_term_matrix)

print("\n3. Displaying the top words for each discovered topic...")
def display_topics(model, feature_names, no_top_words=5):
    topics = {}
    for topic_idx, topic in enumerate(model.components_):
        # Get indices of the top words for this topic
        top_indices = topic.argsort()[:-no_top_words - 1:-1]
        top_words = [feature_names[i] for i in top_indices]
        topics[f"Topic_{topic_idx}"] = top_words
    return topics

topics = display_topics(lda_model, feature_names)
for topic_id, words in topics.items():
    print(f"{topic_id}: {', '.join(words)}")

print("\n4. Checking the topic mixture for the first few documents...")
# Transform the documents to see their topic distribution
doc_topic_dist = lda_model.transform(doc_term_matrix)
print("Document -> Topic Distribution (percentages):")
for i, dist in enumerate(doc_topic_dist[:5]): # First 5 docs
    print(f"Doc {i+1} ('{corpus[i][:30]}...'):")
    for t_idx, percent in enumerate(dist):
        print(f"  Topic {t_idx}: {percent*100:5.1f}%")
    print()

print("\n5. Assigning a dominant topic to each document...")
dominant_topics = doc_topic_dist.argmax(axis=1)
topic_labels = {0: "FINANCE?", 1: "POLITICS?", 2: "SPORTS?"} # Our interpretation
for i, (doc, dom_topic) in enumerate(zip(corpus, dominant_topics)):
    print(f"'{doc}'")
    print(f"  -> Dominant Topic: {dom_topic} ({topic_labels[dom_topic]})\n")
Enter fullscreen mode Exit fullscreen mode

The output will show three lists of words. One might be ['market', 'stock', 'prices', 'share', 'high'] (Finance), another ['election', 'poll', 'candidates', 'voters', 'polls'] (Politics), and a third ['wins', 'football', 'team', 'scores', 'player'] (Sports). The document-topic distribution shows that the first headline is 90% Finance, 5% Politics, 5% Sports. LDA has successfully uncovered the latent themes.

Our final, eighth technique brings us to the current frontier: using deep learning models designed for sequences, like Recurrent Neural Networks (RNNs) and Transformers. While the previous techniques are foundational, models like LSTMs (a type of RNN) and BERT (a Transformer) handle context and long-range dependencies in text far more effectively.

An LSTM has a kind of memory, allowing it to consider previous words when processing the next one, which is ideal for text generation or classification where word order matters deeply. BERT, which stands for Bidirectional Encoder Representations from Transformers, reads the entire sentence at once from both directions, leading to a profound understanding of context. The word "bank" in "river bank" gets a different embedding than in "bank deposit."

Let's implement a simple sentiment classifier using an LSTM with Keras to see how a neural network approaches the problem.

# Technique 8: Deep Learning for Text (LSTM Example)
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from sklearn.model_selection import train_test_split

print("=== Building an LSTM for Text Sentiment ===")

# We'll use a slightly larger simulated dataset
print("1. Preparing the data...")
positive_statements = [
    "this is wonderful and great",
    "I feel amazing and happy today",
    "what a fantastic and perfect result",
    "we are thrilled with the excellent outcome",
    "so good and positive experience",
    "love it absolutely brilliant",
    "outstanding performance very impressed",
    "superb quality highly satisfied"
]
negative_statements = [
    "this is terrible and awful",
    "I feel sad and angry today",
    "what a horrible and disappointing result",
    "we are upset with the poor outcome",
    "so bad and negative experience",
    "hate it absolutely dreadful",
    "unacceptable performance very disappointed",
    "low quality highly dissatisfied"
]

texts = positive_statements + negative_statements
labels = [1] * len(positive_statements) + [0] * len(negative_statements) # 1=pos, 0=neg

print(f"Total texts: {len(texts)}")
print(f"Sample positive: '{positive_statements[0]}'")
print(f"Sample negative: '{negative_statements[0]}'")

print("\n2. Tokenizing the text and creating sequences...")
tokenizer = Tokenizer(num_words=100, oov_token="<OOV>")
tokenizer.fit_on_texts(texts)
word_index = tokenizer.word_index
print(f"Vocabulary size: {len(word_index)}")
print(f"Word index sample: {list(word_index.items())[:5]}")

sequences = tokenizer.texts_to_sequences(texts)
print(f"\nFirst text as sequence: {sequences[0]}")
print(f"Corresponding words: {[list(word_index.keys())[list(word_index.values()).index(idx)] for idx in sequences[0] if idx in word_index.values()]}")

# Pad sequences to ensure uniform length
max_length = 8
padded_sequences = pad_sequences(sequences, maxlen=max_length, padding='post', truncating='post')
print(f"\nPadded sequences shape: {padded_sequences.shape}")
print(f"First padded sequence: {padded_sequences[0]}")

print("\n3. Building the LSTM model...")
model = Sequential([
    # Embedding layer: turns word indices into dense vectors
    Embedding(input_dim=100, output_dim=16, input_length=max_length),
    # LSTM layer with 32 memory units
    LSTM(32, dropout=0.2),
    # Dense output layer with sigmoid activation for binary classification
    Dense(1, activation='sigmoid')
])

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

print("\n4. Training the model...")
X_train, X_val, y_train, y_val = train_test_split(padded_sequences, labels, test_size=0.2, random_state=42)
history = model.fit(X_train, np.array(y_train), epochs=20, validation_data=(X_val, np.array(y_val)), verbose=0)

print("Training complete. Checking final accuracy...")
train_loss, train_acc = model.evaluate(X_train, np.array(y_train), verbose=0)
val_loss, val_acc = model.evaluate(X_val, np.array(y_val), verbose=0)
print(f"Training Accuracy:   {train_acc:.3f}")
print(f"Validation Accuracy: {val_acc:.3f}")

print("\n5. Testing on new sentences...")
test_sentences = [
    "this is good and happy",
    "this is bad and sad",
    "I am feeling wonderful",
    "what a terrible day"
]
test_seq = tokenizer.texts_to_sequences(test_sentences)
test_padded = pad_sequences(test_seq, maxlen=max_length, padding='post', truncating='post')
predictions = model.predict(test_padded)

for sent, pred in zip(test_sentences, predictions):
    sentiment = "POSITIVE" if pred > 0.5 else "NEGATIVE"
    print(f"'{sent}' -> {sentiment} (confidence: {pred[0]:.3f})")
Enter fullscreen mode Exit fullscreen mode

This simple LSTM learns to map sequences of word indices to a sentiment score. The Embedding layer starts with random vectors and learns meaningful ones during training. The LSTM layer processes the sequence, and the final dense layer makes the decision. You'll see it correctly classifies our simple test sentences. For real-world use, you would need much more data, deeper networks, and perhaps pre-trained word embeddings in the first layer.

These eight techniques form a pathway from raw text to intelligent understanding. We start by cleaning the text, then represent it as numbers (Bag of Words, TF-IDF), capture phrases (N-grams), infuse meaning (Word Embeddings), judge feeling (Sentiment Analysis), discover themes (Topic Modeling), and finally employ advanced learning (Deep Learning). Each method has its place. Sometimes a simple TF-IDF with a logistic regression is all you need. Other problems demand the contextual power of an LSTM or a Transformer.

The key is to start simple, understand your data, and iteratively choose more complex tools only when they provide a clear benefit. I often begin with a quick TF-IDF analysis to establish a baseline. This entire process, from messy sentences to a model that can gauge emotion or extract topics, is what makes working with language in Python so engaging. You're not just manipulating strings; you're building a lens to focus on the meaning within.

📘 Checkout my latest ebook for free on my channel!

Be sure to like, share, comment, and subscribe to the channel!


101 Books

101 Books is an AI-driven publishing company co-founded by author Aarav Joshi. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as $4—making quality knowledge accessible to everyone.

Check out our book Golang Clean Code available on Amazon.

Stay tuned for updates and exciting news. When shopping for books, search for Aarav Joshi to find more of our titles. Use the provided link to enjoy special discounts!

Our Creations

Be sure to check out our creations:

Investor Central | Investor Central Spanish | Investor Central German | Smart Living | Epochs & Echoes | Puzzling Mysteries | Hindutva | Elite Dev | Java Elite Dev | Golang Elite Dev | Python Elite Dev | JS Elite Dev | JS Schools


We are on Medium

Tech Koala Insights | Epochs & Echoes World | Investor Central Medium | Puzzling Mysteries Medium | Science & Epochs Medium | Modern Hindutva

Top comments (0)