Content creators spend 30 minutes per article extracting metadata. Here's a Python script that does it in 10 seconds.
I've watched this happen: open the article, read it twice, draft a meta description, pick 5-8 SEO tags, write an Open Graph summary, think up a social caption, argue with yourself about the title. Multiply that by 50 articles a month and you've burned a full workday on metadata that nobody directly reads.
This article walks through building a CLI tool that takes raw markdown and outputs structured JSON with SEO tags, meta descriptions, social post drafts, and content summaries — all in under 10 seconds.
Why Automate This
Metadata isn't hard. It's expensive. You just finished writing; now you need to think like an SEO analyst and a social media manager simultaneously. That context switch costs real time.
Consistency is the second issue. Across a content library, human-generated metadata falls apart — some articles have 3 tags, some have 15. Descriptions range from 80 to 300 characters with no pattern.
Automation fixes both: zero cognitive switching, enforced output schema, identical results whether you're processing article 1 or article 500.
What We're Building
Three layers:
- Input — reads markdown or plain text from disk
- Claude API wrapper — sends structured prompt, parses JSON response
- Output — writes metadata as JSON, optionally as YAML frontmatter
Tools: anthropic SDK, click for CLI, rich for terminal output, concurrent.futures for batch processing.
bash
pip install anthropic click rich python-frontmatter
Set your API key:
bash
export ANTHROPIC_API_KEY="sk-ant-..."
The Core Extractor
This is extractor.py — where the actual work happens.
python
import anthropic
import json
import re
from pathlib import Path
client = anthropic.Anthropic()
METADATA_PROMPT = """You are a content strategist and SEO specialist. Analyze the article below and return ONLY a valid JSON object with no additional text.
Required JSON structure:
{
"title": "Optimized SEO title (60 chars max)",
"meta_description": "Compelling meta description (150-160 chars)",
"seo_tags": ["tag1", "tag2", "tag3", "tag4", "tag5"],
"summary": "2-3 sentence content summary for internal use",
"social_post": "LinkedIn/Twitter-ready post with hook (280 chars max)",
"reading_time_minutes": 5,
"primary_keyword": "main target keyword",
"content_category": "tutorial|opinion|news|case-study|reference"
}
Rules:
- seo_tags: 5-8 tags, lowercase, no spaces (use hyphens)
- social_post: start with a hook statement, end with a question or CTA
- reading_time_minutes: estimate based on 200 words per minute
- Return ONLY the JSON object, no markdown fences, no explanation
Article:
{article_content}
"""
def smart_truncate(content: str, max_chars: int = 8000) -> str:
"""Truncate at paragraph boundary to preserve semantic coherence."""
if len(content) <= max_chars:
return content
truncated = content[:max_chars]
last_para = truncated.rfind("\n\n")
if last_para > max_chars * 0.7:
return truncated[:last_para].strip()
return truncated.strip()
def extract_metadata(content: str, model: str = "claude-opus-4-5") -> dict:
"""Send article content to Claude and parse the JSON response."""
prompt = METADATA_PROMPT.format(article_content=smart_truncate(content))
message = client.messages.create(
model=model,
max_tokens=1024,
messages=[
{"role": "user", "content": prompt}
]
)
raw_response = message.content[0].text.strip()
# Strip markdown code fences if Claude added them
if raw_response.startswith(""):
raw_response = re.sub(r"^[a-z]*\n?", "", raw_response)
raw_response = re.sub(r"\n?$", "", raw_response)
return json.loads(raw_response)
def process_file(filepath: str | Path) -> dict:
"""Read a file and return its metadata."""
path = Path(filepath)
if not path.exists():
raise FileNotFoundError(f"File not found: {filepath}")
content = path.read_text(encoding="utf-8")
# Strip YAML frontmatter if present
if content.startswith("---"):
parts = content.split("---", 2)
if len(parts) >= 3:
content = parts[2].strip()
metadata = extract_metadata(content)
metadata["source_file"] = str(path.name)
return metadata
The smart_truncate function is crucial at scale. I ran this on 200 articles and hit json.JSONDecodeError on ~15 files. The issue: truncating at a hard character limit sometimes cuts mid-sentence, confusing the model. Solution: find the last paragraph break before the limit. Error rate dropped to zero.
Wire It to a CLI
Here's cli.py:
python
import click
import json
import sys
from concurrent.futures import ThreadPoolExecutor, as_completed
from pathlib import Path
from rich.console import Console
from rich.table import Table
from rich.progress import Progress, SpinnerColumn, TextColumn
from extractor import process_file
console = Console()
@click.group()
def cli():
"""Content metadata extractor powered by Claude API."""
pass
@cli.command()
@click.argument("filepath", type=click.Path(exists=True))
@click.option("--output", "-o", type=click.Path(), help="Write JSON to file")
@click.option("--format", "fmt", type=click.Choice(["json", "table"]), default="json")
@click.option("--model", default="claude-opus-4-5", help="Claude model to use")
def extract(filepath, output, fmt, model):
"""Extract metadata from a single article file."""
with Progress(SpinnerColumn(), TextColumn("[progress.description]{task.description}")) as progress:
task = progress.add_task("Analyzing article...", total=None)
result = process_file(filepath)
progress.remove_task(task)
if fmt == "table":
table = Table(title=f"Metadata: {filepath}", show_lines=True)
table.add_column("Field", style="cyan", no_wrap=True)
table.add_column("Value", style="white")
for key, value in result.items():
display = json.dumps(value) if isinstance(value, list) else str(value)
table.add_row(key, display[:120])
console.print(table)
else:
output_json = json.dumps(result, indent=2)
if output:
Path(output).write_text(output_json)
console.print(f"[green]✓[/green] Written to {output}")
else:
console.print(output_json)
@cli.command()
@click.argument("directory", type=click.Path(exists=True, file_okay=False))
@click.option("--output-dir", "-o", type=click.Path(), default="./metadata_output")
@click.option("--workers", "-w", default=5, help="Parallel workers (default: 5)")
@click.option("--glob", default="*.md", help="File pattern (default: *.md)")
def batch(directory, output_dir, workers, glob):
"""Process all articles in a directory."""
input_dir = Path(directory)
out_dir = Path(output_dir)
out_dir.mkdir(parents=True, exist_ok=True)
files = list(input_dir.glob(glob))
if not files:
console.print(f"[yellow]No files matching '{glob}' in {directory}[/yellow]")
sys.exit(1)
console.print(f"[blue]Processing {len(files)} files with {workers} workers...[/blue]")
results = []
errors = []
with Progress() as progress:
task = progress.add_task("Extracting metadata...", total=len(files))
with ThreadPoolExecutor(max_workers=workers) as executor:
future_to_file = {executor.submit(process_file, f): f for f in files}
for future in as_completed(future_to_file):
filepath = future_to_file[future]
progress.advance(task)
try:
result = future.result()
results.append(result)
# Write individual JSON file
out_path = out_dir / f"{filepath.stem}_metadata.json"
out_path.write_text(json.dumps(result, indent=2))
except Exception as e:
errors.append({"file": str(filepath), "error": str(e)})
console.print(f"[red]✗[/red] {filepath.name}: {e}")
# Write combined manifest
manifest_path = out_dir / "_manifest.json"
manifest_path.write_text(json.dumps({
"total": len(files),
"success": len(results),
"errors": len(errors),
"articles": results
}, indent=2))
console.print(f"\n[green]Done![/green] {len(results)}/{len(files)} files processed.")
console.print(f"Output: {out_dir.resolve()}")
console.print(f"Manifest: {manifest_path.resolve()}")
if errors:
console.print(f"\n[yellow]{len(errors)} errors logged in manifest.[/yellow]")
if name == "main":
cli()
The batch command is where speed comes from. ThreadPoolExecutor with 5 workers makes 5 concurrent API calls. A 100-article run drops from ~17 minutes to ~3-4 minutes.
Running It
Single file:
bash
python cli.py extract ./articles/my-post.md --format table
python cli.py extract ./articles/my-post.md -o ./output/my-post-meta.json
Batch process:
bash
python cli.py batch ./articles/ --output-dir ./metadata/ --workers 8 --glob "*.md"
The _manifest.json file becomes your content index — searchable, normalized tags, categorized for auditing.
Extending: Multi-Channel Social
The base prompt gives you one social post. For full multi-channel output, add a second API call:
python
SOCIAL_PROMPT = """You are a social media copywriter. Based on this article metadata, generate social content.
Article title: {title}
Summary: {summary}
Primary keyword: {primary_keyword}
Return ONLY a valid JSON object:
{{
"linkedin_hook": "First 2 lines of a LinkedIn post (hook only, 200 chars max)",
"twitter_thread": [
"Tweet 1 of 5: hook/claim (280 chars max)",
"Tweet 2 of 5: supporting point",
"Tweet 3 of 5: supporting point",
"Tweet 4 of 5: key insight or data",
"Tweet 5 of 5: CTA or question"
],
"email_subject_lines": [
"Subject line option 1 (50 chars max)",
"Subject line option 2 — curiosity gap style",
"Subject line option 3 — direct benefit style"
],
"newsletter_teaser": "2-sentence newsletter blurb to drive clicks"
}}
"""
def generate_social_pack(metadata: dict) -> dict:
"""Generate extended social content from existing metadata."""
prompt = SOCIAL_PROMPT.format(
title=metadata.get("title", ""),
summary=metadata.get("summary", ""),
primary_keyword=metadata.get("primary_keyword", "")
)
message = client.messages.create(
model="claude-haiku-4-5",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
)
raw = message.content[0].text.strip()
if raw.startswith(""):
raw = re.sub(r"^[a-z]*\n?", "", raw)
raw = re.sub(r"\n?$", "", raw)
return json.loads(raw)
Note the use of claude-haiku-4-5 for the second pass. Lighter summarization tasks don't need Opus. On 100 articles, the cost difference is material.
The Product Angle
The _manifest.json output is an API response waiting to happen. Wrap it behind FastAPI, add a file upload UI, and you have a content ops tool. Plug it into any CMS API (Contentful, Sanity, WordPress REST) to write metadata back to your articles automatically.
Agencies pay $50-200/month for this kind of tool.
Get Started
Create a folder, drop in both files from above, then:
bash
pip install anthropic click rich python-frontmatter
export ANTHROPIC_API_KEY="sk-ant-your-key"
python cli.py extract ./article.md --format table
python cli.py batch ./articles/ --output-dir ./metadata/ --workers 5
Customize everything in the METADATA_PROMPT string. That's where your domain knowledge lives — adjust the rules for your content types, adjust the JSON schema for your workflow.
Run this on your entire blog once. You'll get back a normalized metadata library. Plug it into your CMS. Never write a meta description by hand again.
Follow for more practical AI and productivity content.
Top comments (0)