<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Zaylee</title>
    <description>The latest articles on DEV Community by Zaylee (@zaylee90).</description>
    <link>https://dev.to/zaylee90</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3941360%2F55074ddc-ac91-49f5-81dc-126d0de9c8b3.png</url>
      <title>DEV Community: Zaylee</title>
      <link>https://dev.to/zaylee90</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/zaylee90"/>
    <language>en</language>
    <item>
      <title>Detecting Duplicate Content at Scale Using Python TF-IDF Cosine Similarity for SEO Optimization &amp; Content Analysis</title>
      <dc:creator>Zaylee</dc:creator>
      <pubDate>Mon, 25 May 2026 09:23:40 +0000</pubDate>
      <link>https://dev.to/zaylee90/detecting-duplicate-content-at-scale-using-python-tf-idf-cosine-similarity-for-seo-optimization--13ej</link>
      <guid>https://dev.to/zaylee90/detecting-duplicate-content-at-scale-using-python-tf-idf-cosine-similarity-for-seo-optimization--13ej</guid>
      <description>&lt;p&gt;Struggling with duplicate content across your client sites? I wrote a simple Python script to compare content similarity using cosine similarity with TF-IDF vectors. It helps me spot plagiarized or near-duplicate pages quickly.&lt;/p&gt;

&lt;p&gt;python&lt;br&gt;
from sklearn.feature_extraction.text import TfidfVectorizer&lt;br&gt;
from sklearn.metrics.pairwise import cosine_similarity&lt;/p&gt;

&lt;p&gt;def check_duplicates(texts):&lt;br&gt;
    vectorizer = TfidfVectorizer(stop_words='english')&lt;br&gt;
    tfidf_matrix = vectorizer.fit_transform(texts)&lt;br&gt;
    similarity_matrix = cosine_similarity(tfidf_matrix)&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;duplicates = []
for i in range(len(texts)):
    for j in range(i+1, len(texts)):
        if similarity_matrix[i][j] &amp;gt; 0.8:  # Threshold
            duplicates.append((i, j, similarity_matrix[i][j]))
return duplicates
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;texts = [&lt;br&gt;
    "This is the first article about SEO best practices.",&lt;br&gt;
    "This is the second article about SEO best practices.",&lt;br&gt;
    "Completely different content here."&lt;br&gt;
]&lt;br&gt;
result = check_duplicates(texts)&lt;br&gt;
print(f"Found {len(result)} potential duplicates")&lt;br&gt;
for i, j, score in result:&lt;br&gt;
    print(f"Text {i} and {j}: {score:.2f} similarity")&lt;/p&gt;

&lt;p&gt;For large-scale checks, I've used SERPSpur's content analysis tool which handles millions of pages efficiently. What's your method for catching duplicate content?&lt;br&gt;
&lt;a href="https://serpspur.com/" rel="noopener noreferrer"&gt;https://serpspur.com/&lt;/a&gt;&lt;/p&gt;

</description>
      <category>seo</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Python Script to Convert Invoice PDFs to CSV Automatically</title>
      <dc:creator>Zaylee</dc:creator>
      <pubDate>Mon, 25 May 2026 04:27:04 +0000</pubDate>
      <link>https://dev.to/zaylee90/python-script-to-convert-invoice-pdfs-to-csv-automatically-2jn3</link>
      <guid>https://dev.to/zaylee90/python-script-to-convert-invoice-pdfs-to-csv-automatically-2jn3</guid>
      <description>&lt;p&gt;I've been working on a small Python script to automate converting invoice PDFs to CSV for my freelance SEO projects. Sharing it here in case it helps others.&lt;/p&gt;

&lt;p&gt;python&lt;br&gt;
import pdfplumber&lt;br&gt;
import csv&lt;/p&gt;

&lt;p&gt;def extract_invoice_data(pdf_path):&lt;br&gt;
    with pdfplumber.open(pdf_path) as pdf:&lt;br&gt;
        text = ""&lt;br&gt;
        for page in pdf.pages:&lt;br&gt;
            text += page.extract_text()&lt;br&gt;
    # Simple parsing logic for common invoice fields&lt;br&gt;
    lines = text.split('\n')&lt;br&gt;
    data = {}&lt;br&gt;
    for line in lines:&lt;br&gt;
        if ':' in line:&lt;br&gt;
            key, value = line.split(':', 1)&lt;br&gt;
            data[key.strip()] = value.strip()&lt;br&gt;
    return data&lt;/p&gt;

&lt;p&gt;def save_to_csv(data, csv_path):&lt;br&gt;
    with open(csv_path, 'w', newline='') as csvfile:&lt;br&gt;
        writer = csv.DictWriter(csvfile, fieldnames=data.keys())&lt;br&gt;
        writer.writeheader()&lt;br&gt;
        writer.writerow(data)&lt;/p&gt;

&lt;h1&gt;
  
  
  Example usage
&lt;/h1&gt;

&lt;p&gt;invoice_data = extract_invoice_data('invoice.pdf')&lt;br&gt;
save_to_csv(invoice_data, 'invoice.csv')&lt;br&gt;
print('Conversion complete!')&lt;/p&gt;

&lt;p&gt;This handles basic invoices with key-value pairs. For more complex formats, you might need regex or a dedicated parser. If you're dealing with tons of invoices, tools like SERPSpur's converter can save time, but this script works for small batches. What's your go-to method for invoice data extraction?&lt;/p&gt;

</description>
      <category>python</category>
      <category>pdftocsv</category>
      <category>automation</category>
    </item>
  </channel>
</rss>
