Struggling with duplicate content across your client sites? I wrote a simple Python script to compare content similarity using cosine similarity with TF-IDF vectors. It helps me spot plagiarized or near-duplicate pages quickly.
python
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
def check_duplicates(texts):
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(texts)
similarity_matrix = cosine_similarity(tfidf_matrix)
duplicates = []
for i in range(len(texts)):
for j in range(i+1, len(texts)):
if similarity_matrix[i][j] > 0.8: # Threshold
duplicates.append((i, j, similarity_matrix[i][j]))
return duplicates
texts = [
"This is the first article about SEO best practices.",
"This is the second article about SEO best practices.",
"Completely different content here."
]
result = check_duplicates(texts)
print(f"Found {len(result)} potential duplicates")
for i, j, score in result:
print(f"Text {i} and {j}: {score:.2f} similarity")
For large-scale checks, I've used SERPSpur's content analysis tool which handles millions of pages efficiently. What's your method for catching duplicate content?
https://serpspur.com/
Top comments (0)