From Sitemap to Word Docs (Fast!): A Friendly, Non-Tech Guide
Need to copy your website’s text into Microsoft Word—clean, organized, and ready to edit? This short guide shows how to turn a site’s sitemap into a folder of Word files—one per page—without getting lost in code.
Want the full, copy-and-paste script with a one-click “Copy” UI?
What You’ll Get (and Why It’s Useful)
- A Word document for every page listed in your sitemap (perfect for editing, rewrites, and audits).
- Headings preserved as real Word headings (H1→Heading 1, H2→Heading 2, etc.).
- Meta Title (“Title tag”) and Meta Description extracted for each page.
- No images or scripts—just the copy you care about.
This is ideal for content reviews, tone-of-voice cleanups, SEO audits, migrations, and stakeholder approvals where Word is the common ground.
First, What’s a Sitemap?
A sitemap is a site’s “table of contents” for search engines. Many sites publish it at /sitemap.xml or /sitemap. For example:
- https://yourdomain.com/sitemap.xml
- https://yourdomain.com/sitemap
If one doesn’t work, try the other (or search Google: site:yourdomain.com sitemap).
What This Mini-Guide Covers
We’ll outline the simple steps to:
- Install Python (if you don’t have it)
- Create a tiny project folder
- Install a few helper tools
- Run a script that reads your sitemap and writes Word files
We’ll use tiny code snippets here. For the full script with extras (filters, CSV reports, per-page docs, and copy buttons), use the complete guide.
Step 1 — Install Python
Download and install Python 3.10+ from python.org. On Windows, tick “Add Python to PATH.” Confirm it’s installed:
python --version
Step 2 — Make a Project Folder
mkdir website-to-word cd website-to-word
(Optional but recommended) Create a virtual environment so everything stays tidy:
python -m venv .venv # Windows: .venv\Scripts\Activate # macOS/Linux: source .venv/bin/activate
Step 3 — Install the Few Tools We Need
We’ll keep it light: page fetching, HTML parsing, and Word writing.
- requests – fetch pages
- beautifulsoup4 + lxml – parse HTML & XML
- readability-lxml – fallback content cleanup
- python-docx – create Word files
pip install requests beautifulsoup4 lxml readability-lxml python-docx
Prefer Zero Code Copy/Paste?
Skip ahead: the full article includes a ready-to-run script with copy buttons, optional filters, and a CSV report. Open it in a new tab: Turn a Sitemap Into Hundreds of Word Pages.
Step 4 — Use a Short, Beginner-Friendly Script
Below is a compact version that:
- finds URLs in your sitemap (XML or simple HTML sitemap)
- grabs each page
- extracts Title/Description and the main text
- writes a Word file per page
Save this as sitemap_to_word_min.py in your project folder.
# sitemap_to_word_min.py (compact version) import os, re, gzip, time from urllib.parse import urlparse, urljoin import requests from bs4 import BeautifulSoup from lxml import etree, html as lxml_html from readability import Document from docx import Document as Docx from docx.shared import Pt from docx.oxml.ns import qn UA = {"User-Agent":"ZigmaContentFetcher/1.0 (+https://zigma.ca)"} def get(url, tries=3, t=20): for i in range(tries): try: return requests.get(url, headers=UA, timeout=t, allow_redirects=True) except: time.sleep(0.8*(2**i)) def links_from_html_sitemap(content, base): soup = BeautifulSoup(content, "lxml"); host = urlparse(base).netloc; out=[] for a in soup.select("a[href]"): href=a.get("href","").strip() if not href or href.startswith(("mailto:","tel:","javascript:","#")): continue absu=urljoin(base, href); p=urlparse(absu) if p.scheme in ("http","https") and p.netloc==host: out.append(absu.split("#")[0]) # dedupe seen=set(); res=[] for u in out: if u not in seen: res.append(u); seen.add(u) return res def parse_sitemap_xml(content, base): cm, urls=[],[] try: root=etree.fromstring(content, parser=etree.XMLParser(recover=True,huge_tree=True)) ns={"sm":"http://www.sitemaps.org/schemas/sitemap/0.9"} for loc in root.findall(".//sm:sitemap/sm:loc",ns): cm.append(urljoin(base,loc.text.strip())) for loc in root.findall(".//sm:url/sm:loc",ns): urls.append(urljoin(base,loc.text.strip())) except: pass if not cm and not urls: soup=BeautifulSoup(content,"xml") for sm in soup.find_all("sitemap"): loc=sm.find("loc"); if loc and loc.text: cm.append(urljoin(base,loc.text.strip())) for u in soup.find_all("url"): loc=u.find("loc"); if loc and loc.text: urls.append(urljoin(base,loc.text.strip())) # dedupe def d(seq): s=set(); o=[] for x in seq: if x not in s: o.append(x); s.add(x) return o return d(cm), d(urls) def gather(smap): todo=[smap]; seen=set(); found=[] while todo: cur=todo.pop() if cur in seen: continue seen.add(cur) r=get(cur, t=25) if not r or r.status_code!=200: continue ct=(r.headers.get("Content-Type") or "").lower(); content=r.content if cur.endswith(".gz") or "gzip" in ct: try: content=gzip.decompress(content); ct="application/xml" except: pass if "text/html" in ct or cur.endswith("/sitemap"): found+=links_from_html_sitemap(content, cur); continue children, urls = parse_sitemap_xml(content, cur) found+=urls; todo+=children # only http(s), dedupe s=set(); out=[] for u in found: if u not in s and urlparse(u).scheme in ("http","https"): out.append(u); s.add(u) return out def meta(doc): t=doc.xpath("//title"); title=t[0].text.strip() if t and t[0].text else "" md=doc.xpath('//meta[translate(@name,"ABCDEFGHIJKLMNOPQRSTUVWXYZ","abcdefghijklmnopqrstuvwxyz")="description"]/@content') descr=md[0].strip() if md else "" if not title: ogt=doc.xpath('//meta[translate(@property,"ABCDEFGHIJKLMNOPQRSTUVWXYZ","abcdefghijklmnopqrstuvwxyz")="og:title"]/@content') if ogt: title=ogt[0].strip() if not descr: ogd=doc.xpath('//meta[translate(@property,"ABCDEFGHIJKLMNOPQRSTUVWXYZ","abcdefghijklmnopqrstuvwxyz")="og:description"]/@content') if ogd: descr=ogd[0].strip() return title, descr def clean(html_bytes): soup=BeautifulSoup(html_bytes,"lxml") root=soup.find("main") or soup.find("article") or soup.body or soup if root: for t in root.find_all(["script","style","noscript","iframe","svg","video","picture","source","form"]): t.decompose() for img in root.find_all("img"): img.decompose() for t in root.find_all(["a","span","div","section","header","footer","em","strong","b","i","u","small","sup","sub"]): t.unwrap() allowed={"h1","h2","h3","h4","p","ul","ol","li","br"} for t in list(root.find_all(True)): if t.name not in allowed and t.name not in ("body","html"): t.unwrap() return "
"+"".join(str(c) for c in root.children)+"
" # fallback: Readability try: h=Document(html_bytes).summary(html_partial=True) s=BeautifulSoup(h,"lxml") for t in s(["script","style","noscript","iframe","svg","video","picture","source"]): t.decompose() for img in s.find_all("img"): img.decompose() allowed={"h1","h2","h3","h4","p","ul","ol","li","br"} for t in list(s.find_all(True)): if t.name not in allowed and t.name not in ("body","html"): t.unwrap() body=s.body or s return "
"+"".join(str(c) for c in body.children)+"
" except: return "
" def write_doc(url, title, descr, body, path): os.makedirs(os.path.dirname(path) or ".", exist_ok=True) doc=Docx() st=doc.styles["Normal"]; st.font.name="Calibri"; st._element.rPr.rFonts.set(qn('w:eastAsia'),"Calibri"); st.font.size=Pt(11) doc.add_heading(title or "(Untitled Page)",1) doc.add_paragraph(f"URL: {url}") if title: doc.add_paragraph(f"Title tag: {title}") if descr: doc.add_paragraph(f"Description tag: {descr}") soup=BeautifulSoup(body,"lxml") for el in soup.find_all(["h1","h2","h3","h4","p","ul","ol"], recursive=True): n=el.name if n in ("h1","h2","h3","h4"): lvl={"h1":1,"h2":2,"h3":3,"h4":4}[n] txt=el.get_text(" ",strip=True) if txt: doc.add_heading(txt, level=lvl) elif n=="p": txt=el.get_text(" ",strip=True) if txt: doc.add_paragraph(txt) else: ordered=(n=="ol") for li in el.find_all("li", recursive=False): t=li.get_text(" ",strip=True) if not t: continue p=doc.add_paragraph(t) try: p.style="List Number" if ordered else "List Bullet" except: pass doc.save(path) def run(sitemap, outdir, max_urls=None): print("[1/3] Reading sitemap..."); urls=gather(sitemap) if max_urls: urls=urls[:max_urls] if not urls: return print("No URLs found.") print(f"[2/3] Writing docs to {outdir} ..."); os.makedirs(outdir, exist_ok=True) for i,u in enumerate(urls,1): r=get(u, t=25) if not r or r.status_code>=400: print("SKIP",u); continue ct=(r.headers.get("Content-Type") or "").lower() if "html" not in ct: print("SKIP non-HTML",u); continue doc=lxml_html.fromstring(r.content); doc.make_links_absolute(u) title,descr=meta(doc); body=clean(r.content) slug=re.sub(r"[^a-z0-9._-]+","_", (title or urlparse(u).path or "page").lower())[:80] or "page" path=os.path.join(outdir, f"{i:04d}_{slug}.docx") write_doc(u,title,descr,body,path) print("OK ",u,"->",os.path.basename(path)) if __name__=="__main__": import argparse ap=argparse.ArgumentParser(description="Export sitemap pages to Word") ap.add_argument("--sitemap", required=True) ap.add_argument("--outdir", required=True) ap.add_argument("--max-urls", type=int, default=None) a=ap.parse_args() run(a.sitemap, a.outdir, a.max_urls)
This compact script is kept intentionally short. The full version adds filters, CSV output, per-page folders, and “Copy” buttons in the article for zero-friction setup.
Step 5 — Run It
From your project folder (with the virtual environment activated):
# Windows PowerShell python .\sitemap_to_word_min.py --sitemap https://yourdomain.com/sitemap.xml --outdir .\exports --max-urls 50 # macOS / Linux python ./sitemap_to_word_min.py --sitemap https://yourdomain.com/sitemap.xml --outdir ./exports --max-urls 50
Open the exports folder—you’ll see files like 0001_home.docx, 0002_services.docx, etc. Each starts with the URL, Title tag, Description tag, and then the page’s headings and paragraphs.
Smart Tips (Save Time & Headaches)
- Test small first: Use --max-urls 10 to verify everything works before exporting the full site.
- HTML or XML sitemap? The script handles both. If /sitemap.xml isn’t found, try /sitemap.
- SEO audits: With everything in Word, it’s easy to check Title/Description presence, heading structure, and thin content.
- Images & embeds: Are intentionally removed to keep your doc lightweight and copy-focused.
- Permissions matter: Only export content you own or have explicit permission to copy; respect robots and terms.
Troubleshooting (Quick Fixes)
“No URLs found.” Try the other sitemap URL (HTML vs XML). Some sites use a sitemap index that points to multiple child sitemaps—this script follows those too.
“Output looks short / H1 missing.” Some hero text is injected by JavaScript and won’t appear in raw HTML. Most marketing sites still work fine. For heavy JS pages, consider a browser scraper (e.g., Playwright) or see our full guide for richer extraction options.
“Where did my files go?” They’re saved in the folder you passed to --outdir. If unsure, use an absolute path like --outdir "C:\Users\you\Desktop\exports".
“I want one giant Word doc, not many.” Totally doable—append pages into the same Docx() object and add doc.add_page_break() between pages.
Want Filters, CSV Reports, and Copy Buttons?
The long-form guide includes:
- Include/Exclude by path (e.g., only /blog/ or skip /privacy/)
- Per-page Word docs plus a master doc (optional)
- CSV export (URL, Title, Description, word count)
- Pretty code blocks with Copy buttons
Why Teams Love This Workflow
- Writers & Editors: Work in Word, suggest changes, track edits; no CMS logins required.
- SEO & Strategy: Verify title/description coverage, heading hierarchy, and content depth at a glance.
- Brand & Compliance: Review copy in a neutral, design-free context; add comments where needed.
- Migration & Localization: Hand clean source docs to translators or uploaders.
It’s simple, fast, and doesn’t require engineering time—especially with the full script’s filtering and reporting.
Recap (You’re Closer Than You Think)
- Find your sitemap (/sitemap.xml or /sitemap).
- Install Python and the helper libraries.
- Save the compact script as sitemap_to_word_min.py.
- Run it with your sitemap URL and an output folder.
- Open your Word docs and start editing.
Need the “everything included” version with filters and CSV?
Need a Hand? We’ll Do It For You.
Whether you’re planning a site rewrite, SEO audit, or content migration, Zigma can automate exports, preserve formatting, generate reports, and handle JavaScript-heavy pages.
If you have any projects coming up, contact us. We’re happy to help.
Top comments (0)