<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Zigma</title>
    <description>The latest articles on DEV Community by Zigma (@zigma).</description>
    <link>https://dev.to/zigma</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3557801%2Fd02d3d9d-f3fd-4e45-9ff4-fdb55ad6230f.png</url>
      <title>DEV Community: Zigma</title>
      <link>https://dev.to/zigma</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/zigma"/>
    <language>en</language>
    <item>
      <title>From Sitemap to Word Docs (Fast!): A Friendly, Non-Tech Guide</title>
      <dc:creator>Zigma</dc:creator>
      <pubDate>Fri, 10 Oct 2025 15:24:25 +0000</pubDate>
      <link>https://dev.to/zigma/from-sitemap-to-word-docs-fast-a-friendly-non-tech-guide-1800</link>
      <guid>https://dev.to/zigma/from-sitemap-to-word-docs-fast-a-friendly-non-tech-guide-1800</guid>
      <description>&lt;p&gt;Need to copy your website’s text into Microsoft Word—clean, organized, and ready to edit? This short guide shows how to turn a site’s sitemap into a folder of Word files—one per page—without getting lost in code.&lt;/p&gt;

&lt;p&gt;Want the full, copy-and-paste script with a one-click “Copy” UI?  &lt;/p&gt;

&lt;p&gt;What You’ll Get (and Why It’s Useful)&lt;br&gt;
A Word document for every page listed in your sitemap (perfect for editing, rewrites, and audits).&lt;br&gt;
Headings preserved as real Word headings (H1→Heading 1, H2→Heading 2, etc.).&lt;br&gt;
Meta Title (“Title tag”) and Meta Description extracted for each page.&lt;br&gt;
No images or scripts—just the copy you care about.&lt;/p&gt;

&lt;p&gt;This is ideal for content reviews, tone-of-voice cleanups, SEO audits, migrations, and stakeholder approvals where Word is the common ground.&lt;/p&gt;

&lt;p&gt;First, What’s a Sitemap?&lt;/p&gt;

&lt;p&gt;A sitemap is a site’s “table of contents” for search engines. Many sites publish it at /sitemap.xml or /sitemap. For example:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://yourdomain.com/sitemap.xml" rel="noopener noreferrer"&gt;https://yourdomain.com/sitemap.xml&lt;/a&gt;&lt;br&gt;
&lt;a href="https://yourdomain.com/sitemap" rel="noopener noreferrer"&gt;https://yourdomain.com/sitemap&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If one doesn’t work, try the other (or search Google: site:yourdomain.com sitemap).&lt;/p&gt;

&lt;p&gt;What This Mini-Guide Covers&lt;/p&gt;

&lt;p&gt;We’ll outline the simple steps to:&lt;/p&gt;

&lt;p&gt;Install Python (if you don’t have it)&lt;br&gt;
Create a tiny project folder&lt;br&gt;
Install a few helper tools&lt;br&gt;
Run a script that reads your sitemap and writes Word files&lt;/p&gt;

&lt;p&gt;We’ll use tiny code snippets here. For the full script with extras (filters, CSV reports, per-page docs, and copy buttons), use the complete guide.&lt;/p&gt;

&lt;p&gt;Step 1 — Install Python&lt;/p&gt;

&lt;p&gt;Download and install Python 3.10+ from python.org. On Windows, tick “Add Python to PATH.” Confirm it’s installed:&lt;/p&gt;

&lt;p&gt;python --version&lt;br&gt;
Step 2 — Make a Project Folder&lt;br&gt;
mkdir website-to-word cd website-to-word&lt;/p&gt;

&lt;p&gt;(Optional but recommended) Create a virtual environment so everything stays tidy:&lt;/p&gt;

&lt;p&gt;python -m venv .venv # Windows: .venv\Scripts\Activate # macOS/Linux: source .venv/bin/activate&lt;br&gt;
Step 3 — Install the Few Tools We Need&lt;/p&gt;

&lt;p&gt;We’ll keep it light: page fetching, HTML parsing, and Word writing.&lt;/p&gt;

&lt;p&gt;requests – fetch pages&lt;br&gt;
beautifulsoup4 + lxml – parse HTML &amp;amp; XML&lt;br&gt;
readability-lxml – fallback content cleanup&lt;br&gt;
python-docx – create Word files&lt;br&gt;
pip install requests beautifulsoup4 lxml readability-lxml python-docx&lt;br&gt;
Prefer Zero Code Copy/Paste?&lt;/p&gt;

&lt;p&gt;Skip ahead: the full article includes a ready-to-run script with copy buttons, optional filters, and a CSV report. Open it in a new tab: Turn a Sitemap Into Hundreds of Word Pages.&lt;/p&gt;

&lt;p&gt;Step 4 — Use a Short, Beginner-Friendly Script&lt;/p&gt;

&lt;p&gt;Below is a compact version that:&lt;/p&gt;

&lt;p&gt;finds URLs in your sitemap (XML or simple HTML sitemap)&lt;br&gt;
grabs each page&lt;br&gt;
extracts Title/Description and the main text&lt;br&gt;
writes a Word file per page&lt;/p&gt;

&lt;p&gt;Save this as sitemap_to_word_min.py in your project folder.&lt;/p&gt;

&lt;h1&gt;
  
  
  sitemap_to_word_min.py (compact version) import os, re, gzip, time from urllib.parse import urlparse, urljoin import requests from bs4 import BeautifulSoup from lxml import etree, html as lxml_html from readability import Document from docx import Document as Docx from docx.shared import Pt from docx.oxml.ns import qn UA = {"User-Agent":"ZigmaContentFetcher/1.0 (+&lt;a href="https://zigma.ca)%22" rel="noopener noreferrer"&gt;https://zigma.ca)"&lt;/a&gt;} def get(url, tries=3, t=20): for i in range(tries): try: return requests.get(url, headers=UA, timeout=t, allow_redirects=True) except: time.sleep(0.8*(2**i)) def links_from_html_sitemap(content, base): soup = BeautifulSoup(content, "lxml"); host = urlparse(base).netloc; out=[] for a in soup.select("a[href]"): href=a.get("href","").strip() if not href or href.startswith(("mailto:","tel:","javascript:","#")): continue absu=urljoin(base, href); p=urlparse(absu) if p.scheme in ("http","https") and p.netloc==host: out.append(absu.split("#")[0]) # dedupe seen=set(); res=[] for u in out: if u not in seen: res.append(u); seen.add(u) return res def parse_sitemap_xml(content, base): cm, urls=[],[] try: root=etree.fromstring(content, parser=etree.XMLParser(recover=True,huge_tree=True)) ns={"sm":"&lt;a href="http://www.sitemaps.org/schemas/sitemap/0.9%22" rel="noopener noreferrer"&gt;http://www.sitemaps.org/schemas/sitemap/0.9"&lt;/a&gt;} for loc in root.findall(".//sm:sitemap/sm:loc",ns): cm.append(urljoin(base,loc.text.strip())) for loc in root.findall(".//sm:url/sm:loc",ns): urls.append(urljoin(base,loc.text.strip())) except: pass if not cm and not urls: soup=BeautifulSoup(content,"xml") for sm in soup.find_all("sitemap"): loc=sm.find("loc"); if loc and loc.text: cm.append(urljoin(base,loc.text.strip())) for u in soup.find_all("url"): loc=u.find("loc"); if loc and loc.text: urls.append(urljoin(base,loc.text.strip())) # dedupe def d(seq): s=set(); o=[] for x in seq: if x not in s: o.append(x); s.add(x) return o return d(cm), d(urls) def gather(smap): todo=[smap]; seen=set(); found=[] while todo: cur=todo.pop() if cur in seen: continue seen.add(cur) r=get(cur, t=25) if not r or r.status_code!=200: continue ct=(r.headers.get("Content-Type") or "").lower(); content=r.content if cur.endswith(".gz") or "gzip" in ct: try: content=gzip.decompress(content); ct="application/xml" except: pass if "text/html" in ct or cur.endswith("/sitemap"): found+=links_from_html_sitemap(content, cur); continue children, urls = parse_sitemap_xml(content, cur) found+=urls; todo+=children # only http(s), dedupe s=set(); out=[] for u in found: if u not in s and urlparse(u).scheme in ("http","https"): out.append(u); s.add(u) return out def meta(doc): t=doc.xpath("//title"); title=t[0].text.strip() if t and t[0].text else "" md=doc.xpath('//meta[translate(&lt;a class="mentioned-user" href="https://dev.to/name"&gt;@name&lt;/a&gt;,"ABCDEFGHIJKLMNOPQRSTUVWXYZ","abcdefghijklmnopqrstuvwxyz")="description"]/&lt;a class="mentioned-user" href="https://dev.to/content"&gt;@content&lt;/a&gt;') descr=md[0].strip() if md else "" if not title: ogt=doc.xpath('//meta[translate(@property,"ABCDEFGHIJKLMNOPQRSTUVWXYZ","abcdefghijklmnopqrstuvwxyz")="og:title"]/&lt;a class="mentioned-user" href="https://dev.to/content"&gt;@content&lt;/a&gt;') if ogt: title=ogt[0].strip() if not descr: ogd=doc.xpath('//meta[translate(@property,"ABCDEFGHIJKLMNOPQRSTUVWXYZ","abcdefghijklmnopqrstuvwxyz")="og:description"]/&lt;a class="mentioned-user" href="https://dev.to/content"&gt;@content&lt;/a&gt;') if ogd: descr=ogd[0].strip() return title, descr def clean(html_bytes): soup=BeautifulSoup(html_bytes,"lxml") root=soup.find("main") or soup.find("article") or soup.body or soup if root: for t in root.find_all(["script","style","noscript","iframe","svg","video","picture","source","form"]): t.decompose() for img in root.find_all("img"): img.decompose() for t in root.find_all(["a","span","div","section","header","footer","em","strong","b","i","u","small","sup","sub"]): t.unwrap() allowed={"h1","h2","h3","h4","p","ul","ol","li","br"} for t in list(root.find_all(True)): if t.name not in allowed and t.name not in ("body","html"): t.unwrap() return "
&lt;/h1&gt;

&lt;p&gt;"+"".join(str(c) for c in root.children)+"&lt;br&gt;
" # fallback: Readability try: h=Document(html_bytes).summary(html_partial=True) s=BeautifulSoup(h,"lxml") for t in s(["script","style","noscript","iframe","svg","video","picture","source"]): t.decompose() for img in s.find_all("img"): img.decompose() allowed={"h1","h2","h3","h4","p","ul","ol","li","br"} for t in list(s.find_all(True)): if t.name not in allowed and t.name not in ("body","html"): t.unwrap() body=s.body or s return "&lt;br&gt;
"+"".join(str(c) for c in body.children)+"&lt;br&gt;
" except: return "&lt;/p&gt;

&lt;p&gt;" def write_doc(url, title, descr, body, path): os.makedirs(os.path.dirname(path) or ".", exist_ok=True) doc=Docx() st=doc.styles["Normal"]; st.font.name="Calibri"; st.&lt;em&gt;element.rPr.rFonts.set(qn('w:eastAsia'),"Calibri"); st.font.size=Pt(11) doc.add_heading(title or "(Untitled Page)",1) doc.add_paragraph(f"URL: {url}") if title: doc.add_paragraph(f"Title tag: {title}") if descr: doc.add_paragraph(f"Description tag: {descr}") soup=BeautifulSoup(body,"lxml") for el in soup.find_all(["h1","h2","h3","h4","p","ul","ol"], recursive=True): n=el.name if n in ("h1","h2","h3","h4"): lvl={"h1":1,"h2":2,"h3":3,"h4":4}[n] txt=el.get_text(" ",strip=True) if txt: doc.add_heading(txt, level=lvl) elif n=="p": txt=el.get_text(" ",strip=True) if txt: doc.add_paragraph(txt) else: ordered=(n=="ol") for li in el.find_all("li", recursive=False): t=li.get_text(" ",strip=True) if not t: continue p=doc.add_paragraph(t) try: p.style="List Number" if ordered else "List Bullet" except: pass doc.save(path) def run(sitemap, outdir, max_urls=None): print("[1/3] Reading sitemap..."); urls=gather(sitemap) if max_urls: urls=urls[:max_urls] if not urls: return print("No URLs found.") print(f"[2/3] Writing docs to {outdir} ..."); os.makedirs(outdir, exist_ok=True) for i,u in enumerate(urls,1): r=get(u, t=25) if not r or r.status_code&amp;gt;=400: print("SKIP",u); continue ct=(r.headers.get("Content-Type") or "").lower() if "html" not in ct: print("SKIP non-HTML",u); continue doc=lxml_html.fromstring(r.content); doc.make_links_absolute(u) title,descr=meta(doc); body=clean(r.content) slug=re.sub(r"[^a-z0-9.&lt;/em&gt;-]+","&lt;em&gt;", (title or urlparse(u).path or "page").lower())[:80] or "page" path=os.path.join(outdir, f"{i:04d}&lt;/em&gt;{slug}.docx") write_doc(u,title,descr,body,path) print("OK ",u,"-&amp;gt;",os.path.basename(path)) if &lt;strong&gt;name&lt;/strong&gt;=="&lt;strong&gt;main&lt;/strong&gt;": import argparse ap=argparse.ArgumentParser(description="Export sitemap pages to Word") ap.add_argument("--sitemap", required=True) ap.add_argument("--outdir", required=True) ap.add_argument("--max-urls", type=int, default=None) a=ap.parse_args() run(a.sitemap, a.outdir, a.max_urls)&lt;/p&gt;

&lt;p&gt;This compact script is kept intentionally short. The full version adds filters, CSV output, per-page folders, and “Copy” buttons in the article for zero-friction setup.&lt;/p&gt;

&lt;p&gt;Step 5 — Run It&lt;/p&gt;

&lt;p&gt;From your project folder (with the virtual environment activated):&lt;/p&gt;

&lt;h1&gt;
  
  
  Windows PowerShell python .\sitemap_to_word_min.py --sitemap &lt;a href="https://yourdomain.com/sitemap.xml" rel="noopener noreferrer"&gt;https://yourdomain.com/sitemap.xml&lt;/a&gt; --outdir .\exports --max-urls 50 # macOS / Linux python ./sitemap_to_word_min.py --sitemap &lt;a href="https://yourdomain.com/sitemap.xml" rel="noopener noreferrer"&gt;https://yourdomain.com/sitemap.xml&lt;/a&gt; --outdir ./exports --max-urls 50
&lt;/h1&gt;

&lt;p&gt;Open the exports folder—you’ll see files like 0001_home.docx, 0002_services.docx, etc. Each starts with the URL, Title tag, Description tag, and then the page’s headings and paragraphs.&lt;/p&gt;

&lt;p&gt;Smart Tips (Save Time &amp;amp; Headaches)&lt;br&gt;
Test small first: Use --max-urls 10 to verify everything works before exporting the full site.&lt;br&gt;
HTML or XML sitemap? The script handles both. If /sitemap.xml isn’t found, try /sitemap.&lt;br&gt;
SEO audits: With everything in Word, it’s easy to check Title/Description presence, heading structure, and thin content.&lt;br&gt;
Images &amp;amp; embeds: Are intentionally removed to keep your doc lightweight and copy-focused.&lt;br&gt;
Permissions matter: Only export content you own or have explicit permission to copy; respect robots and terms.&lt;br&gt;
Troubleshooting (Quick Fixes)&lt;/p&gt;

&lt;p&gt;“No URLs found.” Try the other sitemap URL (HTML vs XML). Some sites use a sitemap index that points to multiple child sitemaps—this script follows those too.&lt;/p&gt;

&lt;p&gt;“Output looks short / H1 missing.” Some hero text is injected by JavaScript and won’t appear in raw HTML. Most marketing sites still work fine. For heavy JS pages, consider a browser scraper (e.g., Playwright) or see our full guide for richer extraction options.&lt;/p&gt;

&lt;p&gt;“Where did my files go?” They’re saved in the folder you passed to --outdir. If unsure, use an absolute path like --outdir "C:\Users\you\Desktop\exports".&lt;/p&gt;

&lt;p&gt;“I want one giant Word doc, not many.” Totally doable—append pages into the same Docx() object and add doc.add_page_break() between pages.  &lt;/p&gt;

&lt;p&gt;Want Filters, CSV Reports, and Copy Buttons?&lt;/p&gt;

&lt;p&gt;The long-form guide includes:&lt;/p&gt;

&lt;p&gt;Include/Exclude by path (e.g., only /blog/ or skip /privacy/)&lt;br&gt;
Per-page Word docs plus a master doc (optional)&lt;br&gt;
CSV export (URL, Title, Description, word count)&lt;br&gt;
Pretty code blocks with Copy buttons&lt;/p&gt;

&lt;p&gt;Why Teams Love This Workflow&lt;br&gt;
Writers &amp;amp; Editors: Work in Word, suggest changes, track edits; no CMS logins required.&lt;br&gt;
SEO &amp;amp; Strategy: Verify title/description coverage, heading hierarchy, and content depth at a glance.&lt;br&gt;
Brand &amp;amp; Compliance: Review copy in a neutral, design-free context; add comments where needed.&lt;br&gt;
Migration &amp;amp; Localization: Hand clean source docs to translators or uploaders.&lt;/p&gt;

&lt;p&gt;It’s simple, fast, and doesn’t require engineering time—especially with the full script’s filtering and reporting.&lt;/p&gt;

&lt;p&gt;Recap (You’re Closer Than You Think)&lt;br&gt;
Find your sitemap (/sitemap.xml or /sitemap).&lt;br&gt;
Install Python and the helper libraries.&lt;br&gt;
Save the compact script as sitemap_to_word_min.py.&lt;br&gt;
Run it with your sitemap URL and an output folder.&lt;br&gt;
Open your Word docs and start editing.&lt;/p&gt;

&lt;p&gt;Need the “everything included” version with filters and CSV?  &lt;/p&gt;

&lt;p&gt;Need a Hand? We’ll Do It For You.&lt;/p&gt;

&lt;p&gt;Whether you’re planning a site rewrite, SEO audit, or content migration, Zigma can automate exports, preserve formatting, generate reports, and handle JavaScript-heavy pages.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://zigma.ca/blog/turn-a-sitemap-into-hundreds-of-word-pages-perfect-for-reviews-audits-rewrites/" rel="noopener noreferrer"&gt;Digital Marketing&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you have any projects coming up, contact us. We’re happy to help.&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>programming</category>
      <category>beginners</category>
      <category>sitemap</category>
    </item>
  </channel>
</rss>
