Struggling with duplicate content across multiple landing pages? I wrote a quick script to compare text similarity using cosine similarity with TF-IDF vectors. Here's how:
python
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
def check_duplicates(texts):
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(texts)
similarity = cosine_similarity(tfidf_matrix)
for i in range(len(similarity)):
for j in range(i+1, len(similarity)):
if similarity[i][j] > 0.8:
print(f"High similarity between page {i} and {j}: {similarity[i][j]:.2f}")
This flagged several pages that needed rewriting. For larger audits, I've been using a tool that automates this analysis. How do you handle duplicate content detection?
Handle Duplicate Content : https://serpspur.com
Top comments (0)