DEV Community

Cover image for The Definitive Guide To Sitemaps With Python
Dmitry
Dmitry

Posted on • Updated on • Originally published at abstractkitchen.com

The Definitive Guide To Sitemaps With Python

Sitemaps are important. Especially for big websites. It is always a good idea to develop your website with SEO in mind. Unfortunately, most developers ignore this part. This article describes general idea and how to implement your sitemaps with python. I made this article for myself in the first place, because I tend to forget things.

What Is Sitemap

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>http://www.example.com/foo.html</loc>
    <lastmod>2022-06-04</lastmod>
  </url>
</urlset>
Enter fullscreen mode Exit fullscreen mode

Sitemaps help search engines discover your website pages. You combine your most important URLs in a bunch of XML files. Different sitemaps can contain different types of media. It can be plain URLs, Images, Videos, and News entries. Images, videos, and news entries are just URLs with additional metadata.

Sitemaps are especially important if you have a website with a lot of pages. Now, I will not go into details, because obviously you're a smart person and will find everything at Google Search Central or sitemaps.org.

Just a few simple rules for you:

  • You can combine sitemaps in index sitemaps.
  • Sitemap size must not exceed 50mbs and/or 50k URLs.
  • Sitemap can be compressed via GZIP.
  • Don't forget to link your sitemaps in robots.txt
  • All sitemaps must be in the same domain.
  • "priority" and "changefreq" are deprecated by Google, so don't bother wasting space.

Can I Link To Multiple Sitemaps In robots.txt?

Yes, you can. Sitemap directive can be used multiple times. Here is real-world example:

Sitemap: https://example.com/sitemap.en.us.xml
Sitemap: https://example.com/sitemap.en.gb.xml
Enter fullscreen mode Exit fullscreen mode

Create Your First Sitemap With Python

Here is the idea. You'll need 3 modules: xml, os and, optionally gzip. This snippet shows how sitemap can be created.

import os
import gzip

from xml.etree import cElementTree


def add_url(root_node, url, lastmod):
    doc = cElementTree.SubElement(root_node, "url")
    cElementTree.SubElement(doc, "loc").text = url
    cElementTree.SubElement(doc, "lastmod").text = lastmod

    return doc


def save_sitemap(root_node, save_as, **kwargs):
    compress = kwargs.get("compress", False)

    sitemap_name = save_as.split("/")[-1]
    dest_path = "/".join(save_as.split("/")[:-1])

    sitemap_name = f"{sitemap_name}.xml"
    if compress:
        sitemap_name = f"{sitemap_name}.gz"

    save_as = f"{dest_path}/{sitemap_name}"

    # create sitemap path if not existed
    if not os.path.exists(f"{dest_path}/"):
        os.makedirs(f"{dest_path}/")

    if not compress:
        tree = cElementTree.ElementTree(root_node)
        tree.write(save_as, encoding='utf-8', xml_declaration=True)
    else:

        # gzip sitemap
        gzipped_sitemap_file = gzip.open(save_as, 'wb')
        cElementTree.ElementTree(root_node).write(gzipped_sitemap_file)
        gzipped_sitemap_file.close()

    return sitemap_name


# create root XML node
sitemap_root = cElementTree.Element('urlset')
sitemap_root.attrib['xmlns'] = "http://www.sitemaps.org/schemas/sitemap/0.9"

# add urls
add_url(sitemap_root, "https://example.com/url-1", "2022-04-07")
add_url(sitemap_root, "https://example.com/url-2", "2022-04-07")
add_url(sitemap_root, "https://example.com/url-3", "2022-04-07")

# save sitemap. xml extension will be added automatically
save_sitemap(sitemap_root, "sitemaps/sitemap")

# if you want to gzip sitemap
save_sitemap(sitemap_root, "sitemaps/sitemap", compress=True)
Enter fullscreen mode Exit fullscreen mode

If you want to add images, videos or news sections you'll need to add xml attributes for your root node.

# create root XML node
sitemap_root = cElementTree.Element('urlset')
sitemap_root.attrib['xmlns'] = "http://www.sitemaps.org/schemas/sitemap/0.9"

# for images add
sitemap_root.attrib["xmlns:image"] = "http://www.google.com/schemas/sitemap-image/1.1"

# for videos add
sitemap_root.attrib["xmlns:video"] = "http://www.google.com/schemas/sitemap-video/1.1"

# for news add
sitemap_root.attrib["xmlns:news"] = "http://www.google.com/schemas/sitemap-news/0.9"

# add this snippet to attach image to url
def add_url_image(url_node, image_url):
    image_node = cElementTree.SubElement(url_node, "image:image")
    cElementTree.SubElement(image_node, "image:loc").text = image_url

    return image_node

# now when you want to add image to url
url_1 = add_url(sitemap_root, "https://example.com/url-1", "2022-04-07"),
add_url_image(url_1, "https://example.com/image-1.jpg")
Enter fullscreen mode Exit fullscreen mode

I will not describe here how to add videos or news to your URL, because with this code you can easily do it yourself.

How To Create Index Sitemap

If you have a lot of pages on your website or you simply want to place your sitemaps in different sections you'll need index sitemaps. Index sitemap is just an XML-file with root tag sitemapindex with sitemap tags containing URLs to your sitemaps.

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>http://www.example.com/sitemap1.xml</loc>
  </sitemap>
  <sitemap>
    <loc>http://www.example.com/sitemap2.xml</loc>
  </sitemap>
</sitemapindex>
Enter fullscreen mode Exit fullscreen mode

Let's improve our code to create index sitemap. Add function add_sitemap_url at the beginning of your file.

def add_sitemap_url(root_node, sitemap_url):
    sitemap_url_node = cElementTree.SubElement(root_node, "sitemap")
    cElementTree.SubElement(sitemap_url_node, "loc").text = sitemap_url

    return sitemap_url_node
Enter fullscreen mode Exit fullscreen mode

Then use it whenever you need it.

# create sitemapindex tag
sitemap_index_node = cElementTree.Element('sitemapindex')
sitemap_index_node.attrib['xmlns'] = "http://www.sitemaps.org/schemas/sitemap/0.9"

# append links to other sitemaps
add_sitemap_url(sitemap_index_node, "https://example.com/sitemap1.xml")
add_sitemap_url(sitemap_index_node, "https://example.com/sitemap2.xml")

save_sitemap(sitemap_index_root, "sitemaps/sitemap")
Enter fullscreen mode Exit fullscreen mode

You can find code here. Feel free to comment or ask questions.

Sitemapa Library

Sitemapa Library

Now, for small sitemaps, it's all pretty easy. If you need to generate lots of sitemaps with images, videos, or news metadata, your code will become messy at some point. I created sitemapa as a little abstraction for XML burden.

Sitemapa is a small package to reduce your work while generating sitemaps. You describe your sitemaps with JSON structure. Sitemapa is framework-agnostic and not indexing your website — it's just generating sitemaps from your description. Noting more. I use it to generate sitemaps for millions of URLs on my websites.

Keep in mind that it's your job to validate your urls and lastmod dates.

Features

  • Use JSON to describe your sitemaps. Don't waste your time with XML.
  • No extra dependencies.
  • Create regular sitemaps. URLs, Images, Videos and News are supported.
  • Create index sitemaps to combine your regular sitemaps.
  • Create extra attributes for your tags like <video:price currency="EUR">1.99</video:price>.
  • Compress sitemaps with gzip.
  • Auto Image, Video or news xmlns attributes.

Installation

pip install sitemapa

# import in your script
from sitemapa import Sitemap, IndexSitemap
Enter fullscreen mode Exit fullscreen mode

Create Standard Sitemap. Sitemap Class API.

You need to import Sitemap class to create a standard sitemap: from sitemapa import Sitemap. Sitemap class has two methods: append_url and save.

append_url(url, url_data=None)
Parameters: url(str) — Website URL
            url_data(Optional[dict]) — URL Description
            url_data can contain next keys:
              - lastmod
              - changefreq. Deprecated at Google
              - priority. Deprecated at Google
              - images. To describe URL images
              - videos. To describe URL videos
              - news. To describe URL news


Return type: dict. Dictionary with all urls and url_data

# ------

save(save_as, **kwargs)
Parameters: save_as(str) — Sitemap name and where to save. For example: sitemap1.xml or sitemap1.xml.gz

Return type: str. For example sitemap1.xml or sitemap1.xml.gz
Enter fullscreen mode Exit fullscreen mode

Let's create a sitemap like this and save it as sitemap1.xml.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>http://www.example.com/url1.html</loc>
  </url>
  <url>
    <loc>http://www.example.com/foo.html</loc>
    <lastmod>2022-06-04</lastmod>
  </url>
</urlset>
Enter fullscreen mode Exit fullscreen mode

And this is the implementation with sitemapa:

from sitemapa import Sitemap

standard_sitemap = Sitemap()

standard_sitemap.append_url("http://www.example.com/url1.html")
standard_sitemap.append_url("http://www.example.com/foo.html", {
    "lastmod": "2022-06-04"
})

# method 'save' will reset inner dictionary with URLs
sitemap1_name = standard_sitemap.save("sitemap1.xml")

# now, if you want to create new sitemap, just do this:
standard_sitemap.append_url("http://www.example.com/url-2.html")
standard_sitemap.append_url("http://www.example.com/url-3.html")
sitemap2_name = standard_sitemap.save("sitemap2.xml")
Enter fullscreen mode Exit fullscreen mode

Add Images To Your Standard Sitemap

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:image="http://www.google.com/schemas/sitemap-image/1.1">
  <url>
    <loc>http://example.com/sample1.html</loc>
    <image:image>
      <image:loc>http://example.com/image.jpg</image:loc>
    </image:image>
    <image:image>
      <image:loc>http://example.com/photo.jpg</image:loc>
    </image:image>
  </url>
  <url>
    <loc>http://example.com/sample2.html</loc>
    <image:image>
      <image:loc>http://example.com/picture.jpg</image:loc>
    </image:image>
  </url>
</urlset>
Enter fullscreen mode Exit fullscreen mode

To do so, we'll use url_data description.

from sitemapa import Sitemap

sitemap_with_images = Sitemap()

sitemap_with_images.append_url("http://example.com/sample1.html", {
    "images": [
        "http://example.com/image.jpg",
        "http://example.com/photo.jpg"
    ]
})

# you can also describe like this
sitemap_with_images.append_url("http://example.com/sample2.html", {
    "images": [
        {
            "loc": "http://example.com/picture.jpg",
            "lastmod": "2022-05-05"
        }
    ]
})

sitemap_with_images.save("sitemap.xml")
Enter fullscreen mode Exit fullscreen mode

As you can see you can use a list of images or a list of dictionaries. I prefer the first option, since Google deprecated all keys except loc.

I described more use cases in the original article, don't forget to take a look.

This article is my summary for sitemaps. I hope it helps you on your journey. Don't forget to verify everything with official resources. If you have any questions or you see mistakes in this text, don't be shy and drop me a line.

Top comments (0)