I had 300 old blog posts that were technically fine — good content, decent keywords — but structurally a mess. H1s used as subheadings. No meta descriptions. Schema markup that hadn't been touched since 2019. Images with alt text that just said "image."
I wasn't going to fix them by hand. So I built a Python pipeline to do it.
Here's the exact system I ended up with: it scrapes raw HTML, corrects the heading hierarchy, generates meta descriptions programmatically, injects schema markup, and outputs clean, SEO-ready files. It handles the boring 80% automatically so you only have to think about the 20% that actually requires judgment.
Why This Is Harder Than It Looks
The naive approach — "just parse the HTML and fix the tags" — hits three real problems fast:
Heading hierarchy is contextual. A post that jumps from H1 to H4 can't be mechanically fixed without understanding the content structure. You need heuristics to infer what was meant to be a section header versus a subpoint.
Meta descriptions can't just be the first 160 characters. That usually catches navigation menus, author bios, or breadcrumbs. You need to identify the actual article body before you extract anything.
Schema markup that's wrong is worse than no schema markup. Google's rich result tests actively penalize malformed JSON-LD. If you're going to inject it programmatically, it needs to be valid every time.
This pipeline handles all three. Let me walk through each part.
Prerequisites
pip install beautifulsoup4 requests lxml
You'll also want Python 3.9+ for the cleaner type hints. I'm testing against lxml as the parser — it's stricter than html.parser and catches malformed HTML that would silently break things downstream.
Step 1: Scrape and Parse the Raw Content
Start by pulling the HTML and isolating the article body. The key here is being explicit about what counts as "the article" — you don't want to scrape navigation, sidebars, or footers into your content analysis.
import requests
from bs4 import BeautifulSoup
from typing import Optional
def fetch_article(url: str) -> Optional[BeautifulSoup]:
headers = {
"User-Agent": "Mozilla/5.0 (compatible; SEO-formatter/1.0)"
}
try:
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()
except requests.RequestException as e:
print(f"Fetch failed for {url}: {e}")
return None
soup = BeautifulSoup(response.text, "lxml")
return soup
def extract_article_body(soup: BeautifulSoup) -> Optional[BeautifulSoup]:
# Try semantic selectors first — most modern CMS templates use these
# Fall back to <main> if no <article> exists
candidates = [
soup.find("article"),
soup.find("main"),
soup.find(class_=lambda c: c and any(
word in c for word in ["post-content", "article-body", "entry-content", "content"]
)),
]
for candidate in candidates:
if candidate:
return candidate
return None # Caller handles missing body
The selector chain matters: <article> > <main> > class heuristics. If none of those match, you want to know — not silently get the whole page body.
Step 2: Fix the Heading Hierarchy
This is the part that took me longest to get right. The naive fix ("if you see H3, change it to H2") creates new problems. You need to understand the relative structure, not just the absolute levels.
My approach: collect all headings in document order, detect the structural intent from their nesting pattern, then remap to a clean H1 → H2 → H3 hierarchy.
from bs4 import Tag
HEADING_TAGS = ["h1", "h2", "h3", "h4", "h5", "h6"]
def get_heading_level(tag: Tag) -> int:
return int(tag.name[1])
def fix_heading_hierarchy(article: BeautifulSoup, base_title: str = "") -> BeautifulSoup:
headings = article.find_all(HEADING_TAGS)
if not headings:
return article
# Detect the minimum heading level actually used
# If the post uses H2, H3, H4 — the "root" is H2, not H1
levels_used = sorted(set(get_heading_level(h) for h in headings))
min_level = levels_used[0]
# If H1 is used inside the article body, it's almost certainly wrong
# (the page H1 should be the <title>, not inside <article>)
# Remap everything so the topmost level becomes H2
target_root = 2
offset = target_root - min_level
for heading in headings:
current_level = get_heading_level(heading)
new_level = min(current_level + offset, 6) # Cap at H6
heading.name = f"h{new_level}"
return article
The offset calculation is the key insight: rather than hardcoding rules, you find the gap between what exists and what should exist, then shift everything uniformly. A post using H1/H2/H3 becomes H2/H3/H4. A post using H3/H4 (skipping H1 and H2 entirely, which happens more than you'd think) becomes H2/H3.
What this misses: Genuinely broken hierarchies where someone used headings for visual styling rather than structure — like an H4 in the middle of body text because they liked the font size. This handles structural drift, not semantic abuse. For that case, you'd need to check whether the heading has substantial text content around it, which gets complicated fast.
Step 3: Generate Meta Descriptions
The goal: a 150–160 character description that represents the article, not the page chrome. Two approaches I tested:
Approach 1 (what I tried first): Extract the first <p> from the article body and truncate to 155 characters. Clean, simple.
Why it broke: First paragraphs are often intro fluff. "Welcome to our blog. Today we're going to talk about..." gets truncated to a sentence that tells readers nothing.
Approach 2 (what I use now): Find the first <p> with more than 80 characters that doesn't contain any navigation-adjacent terms, then trim to 155 characters at a word boundary.
import re
def generate_meta_description(article: BeautifulSoup, max_length: int = 155) -> str:
# Skip paragraphs that look like navigation or metadata
skip_patterns = re.compile(
r"(published|updated|author|tags:|categories:|share this|follow us|subscribe)",
re.IGNORECASE
)
paragraphs = article.find_all("p")
candidate = ""
for p in paragraphs:
text = p.get_text(strip=True)
# Too short to be body content
if len(text) < 80:
continue
# Looks like metadata
if skip_patterns.search(text):
continue
candidate = text
break
if not candidate:
# Fallback: concatenate all heading text
headings = article.find_all(["h2", "h3"])
candidate = " — ".join(h.get_text(strip=True) for h in headings[:3])
# Trim to max_length at a word boundary
if len(candidate) <= max_length:
return candidate
trimmed = candidate[:max_length].rsplit(" ", 1)[0]
return trimmed.rstrip(".,;:") + "..."
The rsplit(" ", 1)[0] is the detail that matters: it finds the last space before your character limit, so you never cut mid-word. Then strip trailing punctuation before the ellipsis — "and the results were impressive,..." looks wrong.
Step 4: Inject Schema Markup
JSON-LD Article schema is what gives you the rich results in Google Search. The markup needs to go in the <head>, and it needs to be valid every time.
import json
from datetime import datetime
def generate_article_schema(
title: str,
description: str,
url: str,
author_name: str,
date_published: str, # ISO 8601 format: "2024-03-15"
date_modified: str | None = None,
image_url: str | None = None,
) -> str:
schema = {
"@context": "https://schema.org",
"@type": "Article",
"headline": title[:110], # Schema spec: max 110 chars for headline
"description": description,
"url": url,
"author": {
"@type": "Person",
"name": author_name,
},
"publisher": {
"@type": "Organization",
"name": author_name, # Adjust to org name if applicable
},
"datePublished": date_published,
"dateModified": date_modified or date_published,
}
if image_url:
schema["image"] = {
"@type": "ImageObject",
"url": image_url,
}
# json.dumps with indent=2 for readability in source,
# but ensure_ascii=False to handle non-ASCII author names
return json.dumps(schema, indent=2, ensure_ascii=False)
def inject_schema(soup: BeautifulSoup, schema_json: str) -> BeautifulSoup:
script_tag = soup.new_tag("script", type="application/ld+json")
script_tag.string = schema_json
head = soup.find("head")
if head:
# Remove any existing Article schema to avoid duplicates
for existing in head.find_all("script", type="application/ld+json"):
try:
existing_data = json.loads(existing.string or "")
if existing_data.get("@type") == "Article":
existing.decompose()
except json.JSONDecodeError:
pass # Leave malformed script tags alone
head.append(script_tag)
return soup
The existing schema removal is something I wish I'd included from the start. Running this pipeline twice on the same file without deduplication creates duplicate JSON-LD blocks — which Google's structured data validator flags as an error.
Step 5: Fix Image Alt Text
While you're in there: images with missing or meaningless alt text are an easy SEO win that's usually completely ignored.
def fix_image_alt_text(article: BeautifulSoup) -> dict:
images = article.find_all("img")
fixed = 0
needs_manual_review = []
for img in images:
alt = img.get("alt", "").strip()
src = img.get("src", "")
# No alt attribute at all — add empty string for decorative images
# (empty alt is correct for decorative; missing alt is wrong)
if "alt" not in img.attrs:
img["alt"] = ""
fixed += 1
continue
# Meaningless alt text — flag for manual review
meaningless = ["image", "photo", "picture", "img", "screenshot", ""]
if alt.lower() in meaningless:
needs_manual_review.append(src)
return {
"auto_fixed": fixed,
"needs_review": needs_manual_review
}
I made an intentional choice here: don't auto-generate alt text from filenames. screenshot-2024-03-14-at-9.32am.png → "Screenshot 2024 03 14 at 9 32am" is worse than a blank alt. Flag it for human review instead.
Step 6: Wire It All Together
def process_article(
url: str,
author_name: str,
date_published: str,
output_path: str,
) -> dict:
print(f"Processing: {url}")
soup = fetch_article(url)
if not soup:
return {"status": "error", "reason": "fetch_failed", "url": url}
article = extract_article_body(soup)
if not article:
return {"status": "error", "reason": "no_article_body", "url": url}
# Fix structure
article = fix_heading_hierarchy(article)
image_report = fix_image_alt_text(article)
# Generate SEO fields
title = soup.find("title")
title_text = title.get_text(strip=True) if title else "Untitled"
description = generate_meta_description(article)
# Inject or update meta description
existing_meta = soup.find("meta", attrs={"name": "description"})
if existing_meta:
existing_meta["content"] = description
else:
meta_tag = soup.new_tag("meta", attrs={"name": "description", "content": description})
soup.head.append(meta_tag)
# Find OG image for schema
og_image = soup.find("meta", property="og:image")
image_url = og_image["content"] if og_image else None
# Inject schema
schema_json = generate_article_schema(
title=title_text,
description=description,
url=url,
author_name=author_name,
date_published=date_published,
image_url=image_url,
)
soup = inject_schema(soup, schema_json)
# Write output
with open(output_path, "w", encoding="utf-8") as f:
f.write(str(soup))
return {
"status": "success",
"url": url,
"meta_description": description,
"images_fixed": image_report["auto_fixed"],
"images_needing_review": image_report["needs_review"],
}
Running this against a directory of URLs:
import json
articles = [
{
"url": "https://yourblog.com/post-1",
"date_published": "2023-06-15",
"output": "./formatted/post-1.html"
},
# ... more articles
]
results = []
for article in articles:
result = process_article(
url=article["url"],
author_name="Your Name",
date_published=article["date_published"],
output_path=article["output"],
)
results.append(result)
# Summary report
success = [r for r in results if r["status"] == "success"]
errors = [r for r in results if r["status"] == "error"]
print(f"\nProcessed: {len(success)} success, {len(errors)} errors")
for e in errors:
print(f" FAILED: {e['url']} — {e['reason']}")
with open("seo_report.json", "w") as f:
json.dump(results, f, indent=2)
What Can Go Wrong
JavaScript-rendered content. If the blog runs on a JS framework and doesn't server-render, requests gets the skeleton, not the content. You'll need Playwright or Selenium for those. I handle this by checking if the extracted article body has fewer than 200 characters — that's a strong signal the content didn't load.
Rate limiting. If you're processing your own posts by scraping your live site, you're going to hit yourself with a self-DDoS. Add a time.sleep(1) between requests, or — better — process from local HTML files if you have access to them.
The heading offset goes negative. If a post uses only H5 and H6 (it happens, usually from copy-pasted content), the offset calculation maps them to H2/H3 correctly. But if someone used H1 for a subheading and the article body legitimately has H2/H3 sections, you end up with H3/H4/H5 output. There's no perfect mechanical fix for this — it's a sign the original content needs human attention.
Schema headline length. Google's spec says Article schema headlines should be 110 characters maximum. Slice at 110 before adding to the schema object — I got a structured data validation error on three articles before I caught this.
Duplicate meta descriptions. Some templates already set meta descriptions via <meta property="og:description"> without a <meta name="description">. Search engines read both, but they're separate tags. This pipeline only handles name="description". Add a second pass for og:description if your templates use that instead.
Results
After running this against my 300 posts: 284 processed successfully, 16 errors (mostly JavaScript-rendered content). Of those 284:
- 100% now have valid Article schema (verified via Google Rich Results Test)
- 73 posts had heading hierarchy corrected — mostly old posts where H1 was used inside the body
- 41 images were auto-fixed (missing alt attribute added as empty string)
- 89 images flagged for manual alt text review
Google Search Console showed an increase in structured data coverage within the next crawl cycle. Traffic takes longer to move — that's a different story.
Extending This
A few directions worth exploring from here:
-
Add
robots.txtchecking before scraping external URLs — you don't want to process content you shouldn't be touching -
Keyword density analysis —
collections.Counteron the article body text against a target keyword list - Internal link detection — flag posts with zero internal links, which are common in older content and hurt crawlability
-
Bulk processing with
asyncio+aiohttp— the synchronousrequestsversion is fine at 300 posts, but at 3,000 you want async fetching
The full pipeline is around 200 lines. If there's interest, I'll put it on GitHub — drop a comment.
What's the messiest SEO problem you've had to automate your way out of? I'm curious what other structural issues show up in older content — especially across different CMS platforms.
Tags: #python #webdev #seo #tutorial
Top comments (0)