Hello, I'm Maneshwar. I’m building LiveReview, a private AI code review tool that runs on your LLM key (OpenAI, Gemini, etc.) with highly competitive pricing -- built for small teams. Do check it out and give it a try!
Extracting meaningful patterns, summarizing information, or identifying key themes from this text can be a daunting task.
This is where Natural Language Processing (NLP) comes into play.
One practical and powerful application is keyword extraction and clustering — identifying important terms from text and grouping them based on meaning.
In this blog, we explore how you can achieve this using spaCy, a leading NLP library, combined with clustering techniques.
Why Cluster Keywords?
Before diving into the technical details, let’s understand the benefits of keyword clustering:
1. Discover Hidden Patterns
Grouping similar keywords helps reveal themes or topics that are prevalent in a dataset but might not be obvious through manual analysis.
2. Enhance Search and Recommendations
Clustered keywords can improve search relevance, autocomplete suggestions, or content recommendations by associating similar terms.
3. Summarize Content
By extracting and grouping important terms, you can create concise summaries of long documents, making information easier to digest.
4. Support Decision Making
In business, clustering customer feedback keywords helps prioritize issues, detect trends, and tailor strategies based on user needs.
5. Power AI Applications
Chatbots, virtual assistants, and sentiment analysis tools can leverage clustered keywords for smarter responses and deeper contextual understanding.
What Is spaCy and Why Use It?
spaCy is an open-source library designed for advanced NLP tasks in Python. It’s widely used because it’s:
✔ Fast and Efficient – Processes large datasets quickly.
✔ Accurate – Provides state-of-the-art models for tagging, parsing, and named entity recognition.
✔ Extendable – Easily integrates with other Python tools and ML frameworks.
✔ User-Friendly – Offers intuitive APIs for developers at all levels.
Key Features Used in Keyword Clustering
- Tokenization – Breaks text into individual words, punctuation, etc.
- Part-of-Speech Tagging (POS) – Identifies nouns, verbs, adjectives, etc.
- Lemmatization – Reduces words to their base or dictionary form.
- Word Vectors – Represents words as mathematical vectors capturing semantic meaning.
- Similarity Measures – Compares how alike two words are using their vectors.
By combining these features, spaCy helps extract meaningful terms and compare their relationships.
Common Use Cases of spaCy in Text Processing
- ✅ Content Classification – Grouping documents or articles by topic.
- ✅ Information Retrieval – Enhancing search engines or recommendation systems.
- ✅ Chatbots and Virtual Assistants – Understanding user queries and matching intents.
- ✅ Sentiment Analysis – Extracting opinion keywords and grouping sentiments.
- ✅ Market Research – Analyzing customer feedback or product reviews for trends.
- ✅ Healthcare – Extracting medical terms and grouping symptoms or conditions.
These applications span industries like finance, retail, education, healthcare, and more — wherever textual data needs to be understood at scale.
Step-by-Step: How the Script Works
Input Format
The script expects a CSV file (input.csv
) with one column named text
containing sentences or phrases. For example:
text
The cat sat on the mat.
A dog barked loudly in the park.
She bought a new car yesterday.
The bus was late because of traffic.
...
From this input, the script identifies nouns like "cat", "dog", "car", "bus", etc., and clusters them.
Step 1: Loading spaCy and Reading the Data
nlp = spacy.load("en_core_web_md")
df = pd.read_csv(INPUT_CSV)
texts = df[TEXT_COLUMN].dropna().tolist()
- We use the
en_core_web_md
model because it includes pre-trained word vectors. - The CSV file is read into a pandas DataFrame, and text is extracted for processing.
Step 2: Extracting Nouns
nouns = set()
for doc in nlp.pipe(texts, disable=["ner", "parser"]):
for token in doc:
if token.pos_ in ["NOUN", "PROPN"] and not token.is_stop:
nouns.add(token.lemma_.lower())
- spaCy processes the text and identifies nouns (
NOUN
) and proper nouns (PROPN
). - Lemmatization is applied to group similar forms of words (e.g., “cars” → “car”).
- Stopwords are filtered out to keep only meaningful nouns.
This step helps extract relevant keywords for clustering.
Step 3: Computing Word Vectors
vectors = []
valid_nouns = []
for noun in nouns:
token = nlp(noun)
if token.has_vector:
vectors.append(token.vector)
valid_nouns.append(noun)
- For each noun, we retrieve its vector representation.
- Only words with valid vectors are used for similarity calculations.
These vectors capture semantic relationships based on the model’s training data.
Step 4: Calculating Similarity
similarity_matrix = cosine_similarity(vectors)
distance_matrix = 1 - similarity_matrix
- We calculate the cosine similarity between all pairs of noun vectors.
- The similarity matrix is converted to a distance matrix because clustering algorithms typically work with distances.
Step 5: Clustering with Agglomerative Clustering
clustering = AgglomerativeClustering(
n_clusters=None,
metric="precomputed",
linkage="average",
distance_threshold=1 - SIMILARITY_THRESHOLD,
)
labels = clustering.fit_predict(distance_matrix)
- We use Agglomerative Clustering, a hierarchical clustering method.
- The algorithm groups nouns based on distance (i.e., how dissimilar they are).
- By setting a
distance_threshold
, we allow the algorithm to determine the appropriate number of clusters.
This method works well for datasets where you don’t know how many clusters to expect beforehand.
Step 6: Grouping Nouns
clusters = {}
for label, noun in zip(labels, valid_nouns):
clusters.setdefault(int(label), []).append(noun)
- Each noun is assigned to a cluster based on the algorithm’s output.
- We collect nouns into dictionary groups.
This structure helps us easily see which nouns are semantically related.
Step 7: Outputting the Result as JSON
output = {"clusters": clusters, "total_clusters": len(clusters)}
with open(OUTPUT_JSON, "w") as f:
json.dump(output, f, indent=4)
- The clusters are saved in a JSON file (
output.json
). - The JSON format makes it easy to share, visualize, or use the results in other applications.
Example Output
{
"clusters": {
"0": ["car", "bus"],
"1": ["cat", "dog"],
"2": ["tree", "forest"],
"3": ["river", "valley"],
"4": ["pizza", "pasta"],
...
},
"total_clusters": 5
}
This output shows how related terms are grouped together based on meaning.
Why This Approach Works
- Word vectors capture meaning beyond exact words.
- Agglomerative clustering builds groups without needing predefined cluster numbers.
- spaCy’s NLP pipeline simplifies tokenization, tagging, and vector computation.
- JSON output makes the results portable and usable across different platforms.
Possible Improvements
- Use larger models (
en_core_web_lg
) for better vector accuracy. - Apply additional preprocessing like removing rare or ambiguous words.
- Experiment with different clustering algorithms like KMeans or DBSCAN.
- Visualize clusters using tools like t-SNE or UMAP.
- Extend the pipeline to handle multi-word phrases (noun chunks).
Final Thoughts
This script is a great starting point for anyone interested in semantic keyword analysis or clustering.
By leveraging spaCy’s language models and combining them with clustering algorithms, you can build powerful tools for content analysis, search optimization, and AI-driven applications.
This approach can be expanded, tuned, and adapted to specific domains — from e-commerce to healthcare to social media analysis.
LiveReview helps you get great feedback on your PR/MR in a few minutes.
Saves hours on every PR by giving fast, automated first-pass reviews.
If you're tired of waiting for your peer to review your code or are not confident that they'll provide valid feedback, here's LiveReview for you.
Top comments (0)