DEV Community

Zigma
Zigma

Posted on

From Sitemap to Word Docs (Fast!): A Friendly, Non-Tech Guide

Need to copy your website’s text into Microsoft Word—clean, organized, and ready to edit? This short guide shows how to turn a site’s sitemap into a folder of Word files—one per page—without getting lost in code.

Want the full, copy-and-paste script with a one-click “Copy” UI?  

What You’ll Get (and Why It’s Useful)
A Word document for every page listed in your sitemap (perfect for editing, rewrites, and audits).
Headings preserved as real Word headings (H1→Heading 1, H2→Heading 2, etc.).
Meta Title (“Title tag”) and Meta Description extracted for each page.
No images or scripts—just the copy you care about.

This is ideal for content reviews, tone-of-voice cleanups, SEO audits, migrations, and stakeholder approvals where Word is the common ground.

First, What’s a Sitemap?

A sitemap is a site’s “table of contents” for search engines. Many sites publish it at /sitemap.xml or /sitemap. For example:

https://yourdomain.com/sitemap.xml
https://yourdomain.com/sitemap

If one doesn’t work, try the other (or search Google: site:yourdomain.com sitemap).

What This Mini-Guide Covers

We’ll outline the simple steps to:

Install Python (if you don’t have it)
Create a tiny project folder
Install a few helper tools
Run a script that reads your sitemap and writes Word files

We’ll use tiny code snippets here. For the full script with extras (filters, CSV reports, per-page docs, and copy buttons), use the complete guide.

Step 1 — Install Python

Download and install Python 3.10+ from python.org. On Windows, tick “Add Python to PATH.” Confirm it’s installed:

python --version
Step 2 — Make a Project Folder
mkdir website-to-word cd website-to-word

(Optional but recommended) Create a virtual environment so everything stays tidy:

python -m venv .venv # Windows: .venv\Scripts\Activate # macOS/Linux: source .venv/bin/activate
Step 3 — Install the Few Tools We Need

We’ll keep it light: page fetching, HTML parsing, and Word writing.

requests – fetch pages
beautifulsoup4 + lxml – parse HTML & XML
readability-lxml – fallback content cleanup
python-docx – create Word files
pip install requests beautifulsoup4 lxml readability-lxml python-docx
Prefer Zero Code Copy/Paste?

Skip ahead: the full article includes a ready-to-run script with copy buttons, optional filters, and a CSV report. Open it in a new tab: Turn a Sitemap Into Hundreds of Word Pages.

Step 4 — Use a Short, Beginner-Friendly Script

Below is a compact version that:

finds URLs in your sitemap (XML or simple HTML sitemap)
grabs each page
extracts Title/Description and the main text
writes a Word file per page

Save this as sitemap_to_word_min.py in your project folder.

sitemap_to_word_min.py (compact version) import os, re, gzip, time from urllib.parse import urlparse, urljoin import requests from bs4 import BeautifulSoup from lxml import etree, html as lxml_html from readability import Document from docx import Document as Docx from docx.shared import Pt from docx.oxml.ns import qn UA = {"User-Agent":"ZigmaContentFetcher/1.0 (+https://zigma.ca)"} def get(url, tries=3, t=20): for i in range(tries): try: return requests.get(url, headers=UA, timeout=t, allow_redirects=True) except: time.sleep(0.8*(2**i)) def links_from_html_sitemap(content, base): soup = BeautifulSoup(content, "lxml"); host = urlparse(base).netloc; out=[] for a in soup.select("a[href]"): href=a.get("href","").strip() if not href or href.startswith(("mailto:","tel:","javascript:","#")): continue absu=urljoin(base, href); p=urlparse(absu) if p.scheme in ("http","https") and p.netloc==host: out.append(absu.split("#")[0]) # dedupe seen=set(); res=[] for u in out: if u not in seen: res.append(u); seen.add(u) return res def parse_sitemap_xml(content, base): cm, urls=[],[] try: root=etree.fromstring(content, parser=etree.XMLParser(recover=True,huge_tree=True)) ns={"sm":"http://www.sitemaps.org/schemas/sitemap/0.9"} for loc in root.findall(".//sm:sitemap/sm:loc",ns): cm.append(urljoin(base,loc.text.strip())) for loc in root.findall(".//sm:url/sm:loc",ns): urls.append(urljoin(base,loc.text.strip())) except: pass if not cm and not urls: soup=BeautifulSoup(content,"xml") for sm in soup.find_all("sitemap"): loc=sm.find("loc"); if loc and loc.text: cm.append(urljoin(base,loc.text.strip())) for u in soup.find_all("url"): loc=u.find("loc"); if loc and loc.text: urls.append(urljoin(base,loc.text.strip())) # dedupe def d(seq): s=set(); o=[] for x in seq: if x not in s: o.append(x); s.add(x) return o return d(cm), d(urls) def gather(smap): todo=[smap]; seen=set(); found=[] while todo: cur=todo.pop() if cur in seen: continue seen.add(cur) r=get(cur, t=25) if not r or r.status_code!=200: continue ct=(r.headers.get("Content-Type") or "").lower(); content=r.content if cur.endswith(".gz") or "gzip" in ct: try: content=gzip.decompress(content); ct="application/xml" except: pass if "text/html" in ct or cur.endswith("/sitemap"): found+=links_from_html_sitemap(content, cur); continue children, urls = parse_sitemap_xml(content, cur) found+=urls; todo+=children # only http(s), dedupe s=set(); out=[] for u in found: if u not in s and urlparse(u).scheme in ("http","https"): out.append(u); s.add(u) return out def meta(doc): t=doc.xpath("//title"); title=t[0].text.strip() if t and t[0].text else "" md=doc.xpath('//meta[translate(@name,"ABCDEFGHIJKLMNOPQRSTUVWXYZ","abcdefghijklmnopqrstuvwxyz")="description"]/@content') descr=md[0].strip() if md else "" if not title: ogt=doc.xpath('//meta[translate(@property,"ABCDEFGHIJKLMNOPQRSTUVWXYZ","abcdefghijklmnopqrstuvwxyz")="og:title"]/@content') if ogt: title=ogt[0].strip() if not descr: ogd=doc.xpath('//meta[translate(@property,"ABCDEFGHIJKLMNOPQRSTUVWXYZ","abcdefghijklmnopqrstuvwxyz")="og:description"]/@content') if ogd: descr=ogd[0].strip() return title, descr def clean(html_bytes): soup=BeautifulSoup(html_bytes,"lxml") root=soup.find("main") or soup.find("article") or soup.body or soup if root: for t in root.find_all(["script","style","noscript","iframe","svg","video","picture","source","form"]): t.decompose() for img in root.find_all("img"): img.decompose() for t in root.find_all(["a","span","div","section","header","footer","em","strong","b","i","u","small","sup","sub"]): t.unwrap() allowed={"h1","h2","h3","h4","p","ul","ol","li","br"} for t in list(root.find_all(True)): if t.name not in allowed and t.name not in ("body","html"): t.unwrap() return "

"+"".join(str(c) for c in root.children)+"
" # fallback: Readability try: h=Document(html_bytes).summary(html_partial=True) s=BeautifulSoup(h,"lxml") for t in s(["script","style","noscript","iframe","svg","video","picture","source"]): t.decompose() for img in s.find_all("img"): img.decompose() allowed={"h1","h2","h3","h4","p","ul","ol","li","br"} for t in list(s.find_all(True)): if t.name not in allowed and t.name not in ("body","html"): t.unwrap() body=s.body or s return "
"+"".join(str(c) for c in body.children)+"
" except: return "

" def write_doc(url, title, descr, body, path): os.makedirs(os.path.dirname(path) or ".", exist_ok=True) doc=Docx() st=doc.styles["Normal"]; st.font.name="Calibri"; st.element.rPr.rFonts.set(qn('w:eastAsia'),"Calibri"); st.font.size=Pt(11) doc.add_heading(title or "(Untitled Page)",1) doc.add_paragraph(f"URL: {url}") if title: doc.add_paragraph(f"Title tag: {title}") if descr: doc.add_paragraph(f"Description tag: {descr}") soup=BeautifulSoup(body,"lxml") for el in soup.find_all(["h1","h2","h3","h4","p","ul","ol"], recursive=True): n=el.name if n in ("h1","h2","h3","h4"): lvl={"h1":1,"h2":2,"h3":3,"h4":4}[n] txt=el.get_text(" ",strip=True) if txt: doc.add_heading(txt, level=lvl) elif n=="p": txt=el.get_text(" ",strip=True) if txt: doc.add_paragraph(txt) else: ordered=(n=="ol") for li in el.find_all("li", recursive=False): t=li.get_text(" ",strip=True) if not t: continue p=doc.add_paragraph(t) try: p.style="List Number" if ordered else "List Bullet" except: pass doc.save(path) def run(sitemap, outdir, max_urls=None): print("[1/3] Reading sitemap..."); urls=gather(sitemap) if max_urls: urls=urls[:max_urls] if not urls: return print("No URLs found.") print(f"[2/3] Writing docs to {outdir} ..."); os.makedirs(outdir, exist_ok=True) for i,u in enumerate(urls,1): r=get(u, t=25) if not r or r.status_code>=400: print("SKIP",u); continue ct=(r.headers.get("Content-Type") or "").lower() if "html" not in ct: print("SKIP non-HTML",u); continue doc=lxml_html.fromstring(r.content); doc.make_links_absolute(u) title,descr=meta(doc); body=clean(r.content) slug=re.sub(r"[^a-z0-9.-]+","", (title or urlparse(u).path or "page").lower())[:80] or "page" path=os.path.join(outdir, f"{i:04d}{slug}.docx") write_doc(u,title,descr,body,path) print("OK ",u,"->",os.path.basename(path)) if name=="main": import argparse ap=argparse.ArgumentParser(description="Export sitemap pages to Word") ap.add_argument("--sitemap", required=True) ap.add_argument("--outdir", required=True) ap.add_argument("--max-urls", type=int, default=None) a=ap.parse_args() run(a.sitemap, a.outdir, a.max_urls)

This compact script is kept intentionally short. The full version adds filters, CSV output, per-page folders, and “Copy” buttons in the article for zero-friction setup.

Step 5 — Run It

From your project folder (with the virtual environment activated):

Windows PowerShell python .\sitemap_to_word_min.py --sitemap https://yourdomain.com/sitemap.xml --outdir .\exports --max-urls 50 # macOS / Linux python ./sitemap_to_word_min.py --sitemap https://yourdomain.com/sitemap.xml --outdir ./exports --max-urls 50

Open the exports folder—you’ll see files like 0001_home.docx, 0002_services.docx, etc. Each starts with the URL, Title tag, Description tag, and then the page’s headings and paragraphs.

Smart Tips (Save Time & Headaches)
Test small first: Use --max-urls 10 to verify everything works before exporting the full site.
HTML or XML sitemap? The script handles both. If /sitemap.xml isn’t found, try /sitemap.
SEO audits: With everything in Word, it’s easy to check Title/Description presence, heading structure, and thin content.
Images & embeds: Are intentionally removed to keep your doc lightweight and copy-focused.
Permissions matter: Only export content you own or have explicit permission to copy; respect robots and terms.
Troubleshooting (Quick Fixes)

“No URLs found.” Try the other sitemap URL (HTML vs XML). Some sites use a sitemap index that points to multiple child sitemaps—this script follows those too.

“Output looks short / H1 missing.” Some hero text is injected by JavaScript and won’t appear in raw HTML. Most marketing sites still work fine. For heavy JS pages, consider a browser scraper (e.g., Playwright) or see our full guide for richer extraction options.

“Where did my files go?” They’re saved in the folder you passed to --outdir. If unsure, use an absolute path like --outdir "C:\Users\you\Desktop\exports".

“I want one giant Word doc, not many.” Totally doable—append pages into the same Docx() object and add doc.add_page_break() between pages.  

Want Filters, CSV Reports, and Copy Buttons?

The long-form guide includes:

Include/Exclude by path (e.g., only /blog/ or skip /privacy/)
Per-page Word docs plus a master doc (optional)
CSV export (URL, Title, Description, word count)
Pretty code blocks with Copy buttons

Why Teams Love This Workflow
Writers & Editors: Work in Word, suggest changes, track edits; no CMS logins required.
SEO & Strategy: Verify title/description coverage, heading hierarchy, and content depth at a glance.
Brand & Compliance: Review copy in a neutral, design-free context; add comments where needed.
Migration & Localization: Hand clean source docs to translators or uploaders.

It’s simple, fast, and doesn’t require engineering time—especially with the full script’s filtering and reporting.

Recap (You’re Closer Than You Think)
Find your sitemap (/sitemap.xml or /sitemap).
Install Python and the helper libraries.
Save the compact script as sitemap_to_word_min.py.
Run it with your sitemap URL and an output folder.
Open your Word docs and start editing.

Need the “everything included” version with filters and CSV?  

Need a Hand? We’ll Do It For You.

Whether you’re planning a site rewrite, SEO audit, or content migration, Zigma can automate exports, preserve formatting, generate reports, and handle JavaScript-heavy pages.

Digital Marketing

If you have any projects coming up, contact us. We’re happy to help.

Top comments (0)