Struggling with duplicate content across your site? I wrote a Python script that uses fuzzy matching to find near-duplicate pages. It's been a lifesaver for my SEO audits:
python
from difflib import SequenceMatcher
import requests
from bs4 import BeautifulSoup
def get_page_text(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
return soup.get_text()
def similarity_ratio(text1, text2):
return SequenceMatcher(None, text1, text2).ratio()
urls = ['https://example.com/page1', 'https://example.com/page2']
texts = [get_page_text(url) for url in urls]
ratio = similarity_ratio(texts[0], texts[1])
print(f'Similarity: {ratio:.2%}')
if ratio > 0.8:
print('Warning: Possible duplicate content!')
For more advanced analysis, tools like SERPSpur's content audit feature can identify duplicates across large sites. But this script is great for quick checks. How do you handle duplicate content issues?
Top comments (0)