Aakash Gour

Posted on Apr 27

Auto-Format Blog Posts for SEO Using Python + BeautifulSoup

#python #webdev #seo #tutorial

I had 300 old blog posts that were technically fine — good content, decent keywords — but structurally a mess. H1s used as subheadings. No meta descriptions. Schema markup that hadn't been touched since 2019. Images with alt text that just said "image."

I wasn't going to fix them by hand. So I built a Python pipeline to do it.

Here's the exact system I ended up with: it scrapes raw HTML, corrects the heading hierarchy, generates meta descriptions programmatically, injects schema markup, and outputs clean, SEO-ready files. It handles the boring 80% automatically so you only have to think about the 20% that actually requires judgment.

Why This Is Harder Than It Looks

The naive approach — "just parse the HTML and fix the tags" — hits three real problems fast:

Heading hierarchy is contextual. A post that jumps from H1 to H4 can't be mechanically fixed without understanding the content structure. You need heuristics to infer what was meant to be a section header versus a subpoint.

Meta descriptions can't just be the first 160 characters. That usually catches navigation menus, author bios, or breadcrumbs. You need to identify the actual article body before you extract anything.

Schema markup that's wrong is worse than no schema markup. Google's rich result tests actively penalize malformed JSON-LD. If you're going to inject it programmatically, it needs to be valid every time.

This pipeline handles all three. Let me walk through each part.

Prerequisites

pip install beautifulsoup4 requests lxml

You'll also want Python 3.9+ for the cleaner type hints. I'm testing against lxml as the parser — it's stricter than html.parser and catches malformed HTML that would silently break things downstream.

Step 1: Scrape and Parse the Raw Content

Start by pulling the HTML and isolating the article body. The key here is being explicit about what counts as "the article" — you don't want to scrape navigation, sidebars, or footers into your content analysis.

import requests
from bs4 import BeautifulSoup
from typing import Optional

def fetch_article(url: str) -> Optional[BeautifulSoup]:
    headers = {
        "User-Agent": "Mozilla/5.0 (compatible; SEO-formatter/1.0)"
    }

    try:
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()
    except requests.RequestException as e:
        print(f"Fetch failed for {url}: {e}")
        return None

    soup = BeautifulSoup(response.text, "lxml")
    return soup

def extract_article_body(soup: BeautifulSoup) -> Optional[BeautifulSoup]:
    # Try semantic selectors first — most modern CMS templates use these
    # Fall back to <main> if no <article> exists
    candidates = [
        soup.find("article"),
        soup.find("main"),
        soup.find(class_=lambda c: c and any(
            word in c for word in ["post-content", "article-body", "entry-content", "content"]
        )),
    ]

    for candidate in candidates:
        if candidate:
            return candidate

    return None  # Caller handles missing body

The selector chain matters: <article> > <main> > class heuristics. If none of those match, you want to know — not silently get the whole page body.

Step 2: Fix the Heading Hierarchy

This is the part that took me longest to get right. The naive fix ("if you see H3, change it to H2") creates new problems. You need to understand the relative structure, not just the absolute levels.

My approach: collect all headings in document order, detect the structural intent from their nesting pattern, then remap to a clean H1 → H2 → H3 hierarchy.

from bs4 import Tag

HEADING_TAGS = ["h1", "h2", "h3", "h4", "h5", "h6"]

def get_heading_level(tag: Tag) -> int:
    return int(tag.name[1])

def fix_heading_hierarchy(article: BeautifulSoup, base_title: str = "") -> BeautifulSoup:
    headings = article.find_all(HEADING_TAGS)

    if not headings:
        return article

    # Detect the minimum heading level actually used
    # If the post uses H2, H3, H4 — the "root" is H2, not H1
    levels_used = sorted(set(get_heading_level(h) for h in headings))
    min_level = levels_used[0]

    # If H1 is used inside the article body, it's almost certainly wrong
    # (the page H1 should be the <title>, not inside <article>)
    # Remap everything so the topmost level becomes H2
    target_root = 2
    offset = target_root - min_level

    for heading in headings:
        current_level = get_heading_level(heading)
        new_level = min(current_level + offset, 6)  # Cap at H6
        heading.name = f"h{new_level}"

    return article

The offset calculation is the key insight: rather than hardcoding rules, you find the gap between what exists and what should exist, then shift everything uniformly. A post using H1/H2/H3 becomes H2/H3/H4. A post using H3/H4 (skipping H1 and H2 entirely, which happens more than you'd think) becomes H2/H3.

What this misses: Genuinely broken hierarchies where someone used headings for visual styling rather than structure — like an H4 in the middle of body text because they liked the font size. This handles structural drift, not semantic abuse. For that case, you'd need to check whether the heading has substantial text content around it, which gets complicated fast.

Step 3: Generate Meta Descriptions

The goal: a 150–160 character description that represents the article, not the page chrome. Two approaches I tested:

Approach 1 (what I tried first): Extract the first <p> from the article body and truncate to 155 characters. Clean, simple.

Why it broke: First paragraphs are often intro fluff. "Welcome to our blog. Today we're going to talk about..." gets truncated to a sentence that tells readers nothing.

Approach 2 (what I use now): Find the first <p> with more than 80 characters that doesn't contain any navigation-adjacent terms, then trim to 155 characters at a word boundary.

import re

def generate_meta_description(article: BeautifulSoup, max_length: int = 155) -> str:
    # Skip paragraphs that look like navigation or metadata
    skip_patterns = re.compile(
        r"(published|updated|author|tags:|categories:|share this|follow us|subscribe)",
        re.IGNORECASE
    )

    paragraphs = article.find_all("p")

    candidate = ""
    for p in paragraphs:
        text = p.get_text(strip=True)

        # Too short to be body content
        if len(text) < 80:
            continue

        # Looks like metadata
        if skip_patterns.search(text):
            continue

        candidate = text
        break

    if not candidate:
        # Fallback: concatenate all heading text
        headings = article.find_all(["h2", "h3"])
        candidate = " — ".join(h.get_text(strip=True) for h in headings[:3])

    # Trim to max_length at a word boundary
    if len(candidate) <= max_length:
        return candidate

    trimmed = candidate[:max_length].rsplit(" ", 1)[0]
    return trimmed.rstrip(".,;:") + "..."

The rsplit(" ", 1)[0] is the detail that matters: it finds the last space before your character limit, so you never cut mid-word. Then strip trailing punctuation before the ellipsis — "and the results were impressive,..." looks wrong.

Step 4: Inject Schema Markup

JSON-LD Article schema is what gives you the rich results in Google Search. The markup needs to go in the <head>, and it needs to be valid every time.

import json
from datetime import datetime

def generate_article_schema(
    title: str,
    description: str,
    url: str,
    author_name: str,
    date_published: str,  # ISO 8601 format: "2024-03-15"
    date_modified: str | None = None,
    image_url: str | None = None,
) -> str:
    schema = {
        "@context": "https://schema.org",
        "@type": "Article",
        "headline": title[:110],  # Schema spec: max 110 chars for headline
        "description": description,
        "url": url,
        "author": {
            "@type": "Person",
            "name": author_name,
        },
        "publisher": {
            "@type": "Organization",
            "name": author_name,  # Adjust to org name if applicable
        },
        "datePublished": date_published,
        "dateModified": date_modified or date_published,
    }

    if image_url:
        schema["image"] = {
            "@type": "ImageObject",
            "url": image_url,
        }

    # json.dumps with indent=2 for readability in source, 
    # but ensure_ascii=False to handle non-ASCII author names
    return json.dumps(schema, indent=2, ensure_ascii=False)

def inject_schema(soup: BeautifulSoup, schema_json: str) -> BeautifulSoup:
    script_tag = soup.new_tag("script", type="application/ld+json")
    script_tag.string = schema_json

    head = soup.find("head")
    if head:
        # Remove any existing Article schema to avoid duplicates
        for existing in head.find_all("script", type="application/ld+json"):
            try:
                existing_data = json.loads(existing.string or "")
                if existing_data.get("@type") == "Article":
                    existing.decompose()
            except json.JSONDecodeError:
                pass  # Leave malformed script tags alone

        head.append(script_tag)

    return soup

The existing schema removal is something I wish I'd included from the start. Running this pipeline twice on the same file without deduplication creates duplicate JSON-LD blocks — which Google's structured data validator flags as an error.

Step 5: Fix Image Alt Text

While you're in there: images with missing or meaningless alt text are an easy SEO win that's usually completely ignored.

def fix_image_alt_text(article: BeautifulSoup) -> dict:
    images = article.find_all("img")
    fixed = 0
    needs_manual_review = []

    for img in images:
        alt = img.get("alt", "").strip()
        src = img.get("src", "")

        # No alt attribute at all — add empty string for decorative images
        # (empty alt is correct for decorative; missing alt is wrong)
        if "alt" not in img.attrs:
            img["alt"] = ""
            fixed += 1
            continue

        # Meaningless alt text — flag for manual review
        meaningless = ["image", "photo", "picture", "img", "screenshot", ""]
        if alt.lower() in meaningless:
            needs_manual_review.append(src)

    return {
        "auto_fixed": fixed,
        "needs_review": needs_manual_review
    }

I made an intentional choice here: don't auto-generate alt text from filenames. screenshot-2024-03-14-at-9.32am.png → "Screenshot 2024 03 14 at 9 32am" is worse than a blank alt. Flag it for human review instead.

Step 6: Wire It All Together

def process_article(
    url: str,
    author_name: str,
    date_published: str,
    output_path: str,
) -> dict:
    print(f"Processing: {url}")

    soup = fetch_article(url)
    if not soup:
        return {"status": "error", "reason": "fetch_failed", "url": url}

    article = extract_article_body(soup)
    if not article:
        return {"status": "error", "reason": "no_article_body", "url": url}

    # Fix structure
    article = fix_heading_hierarchy(article)
    image_report = fix_image_alt_text(article)

    # Generate SEO fields
    title = soup.find("title")
    title_text = title.get_text(strip=True) if title else "Untitled"
    description = generate_meta_description(article)

    # Inject or update meta description
    existing_meta = soup.find("meta", attrs={"name": "description"})
    if existing_meta:
        existing_meta["content"] = description
    else:
        meta_tag = soup.new_tag("meta", attrs={"name": "description", "content": description})
        soup.head.append(meta_tag)

    # Find OG image for schema
    og_image = soup.find("meta", property="og:image")
    image_url = og_image["content"] if og_image else None

    # Inject schema
    schema_json = generate_article_schema(
        title=title_text,
        description=description,
        url=url,
        author_name=author_name,
        date_published=date_published,
        image_url=image_url,
    )
    soup = inject_schema(soup, schema_json)

    # Write output
    with open(output_path, "w", encoding="utf-8") as f:
        f.write(str(soup))

    return {
        "status": "success",
        "url": url,
        "meta_description": description,
        "images_fixed": image_report["auto_fixed"],
        "images_needing_review": image_report["needs_review"],
    }

Running this against a directory of URLs:

import json

articles = [
    {
        "url": "https://yourblog.com/post-1",
        "date_published": "2023-06-15",
        "output": "./formatted/post-1.html"
    },
    # ... more articles
]

results = []
for article in articles:
    result = process_article(
        url=article["url"],
        author_name="Your Name",
        date_published=article["date_published"],
        output_path=article["output"],
    )
    results.append(result)

# Summary report
success = [r for r in results if r["status"] == "success"]
errors = [r for r in results if r["status"] == "error"]

print(f"\nProcessed: {len(success)} success, {len(errors)} errors")
for e in errors:
    print(f"  FAILED: {e['url']} — {e['reason']}")

with open("seo_report.json", "w") as f:
    json.dump(results, f, indent=2)

What Can Go Wrong

JavaScript-rendered content. If the blog runs on a JS framework and doesn't server-render, requests gets the skeleton, not the content. You'll need Playwright or Selenium for those. I handle this by checking if the extracted article body has fewer than 200 characters — that's a strong signal the content didn't load.

Rate limiting. If you're processing your own posts by scraping your live site, you're going to hit yourself with a self-DDoS. Add a time.sleep(1) between requests, or — better — process from local HTML files if you have access to them.

The heading offset goes negative. If a post uses only H5 and H6 (it happens, usually from copy-pasted content), the offset calculation maps them to H2/H3 correctly. But if someone used H1 for a subheading and the article body legitimately has H2/H3 sections, you end up with H3/H4/H5 output. There's no perfect mechanical fix for this — it's a sign the original content needs human attention.

Schema headline length. Google's spec says Article schema headlines should be 110 characters maximum. Slice at 110 before adding to the schema object — I got a structured data validation error on three articles before I caught this.

Duplicate meta descriptions. Some templates already set meta descriptions via <meta property="og:description"> without a <meta name="description">. Search engines read both, but they're separate tags. This pipeline only handles name="description". Add a second pass for og:description if your templates use that instead.

Results

After running this against my 300 posts: 284 processed successfully, 16 errors (mostly JavaScript-rendered content). Of those 284:

100% now have valid Article schema (verified via Google Rich Results Test)
73 posts had heading hierarchy corrected — mostly old posts where H1 was used inside the body
41 images were auto-fixed (missing alt attribute added as empty string)
89 images flagged for manual alt text review

Google Search Console showed an increase in structured data coverage within the next crawl cycle. Traffic takes longer to move — that's a different story.

Extending This

A few directions worth exploring from here:

Add robots.txt checking before scraping external URLs — you don't want to process content you shouldn't be touching
Keyword density analysis — collections.Counter on the article body text against a target keyword list
Internal link detection — flag posts with zero internal links, which are common in older content and hurt crawlability
Bulk processing with asyncio + aiohttp — the synchronous requests version is fine at 300 posts, but at 3,000 you want async fetching

The full pipeline is around 200 lines. If there's interest, I'll put it on GitHub — drop a comment.

What's the messiest SEO problem you've had to automate your way out of? I'm curious what other structural issues show up in older content — especially across different CMS platforms.

Tags: #python #webdev #seo #tutorial

DEV Community