🌈 Josh

Posted on Mar 10, 2020 • Originally published at joshwcomeau.com

Generate an SEO-Friendly Sitemap for your Gatsby Site

#gatsby #seo

This is a cross-post from my personal blog, where you'll find many other posts on React and Gatsby!

So here's the thing. I really don't want to have to care about SEO. It's all very nebulous, and it attracts so many snake-oil salespeople. SEO websites are the worst. And yet, if you want people to see the stuff that you build, SEO remains super important.

Happily, we don't need to become SEO experts. A few key optimizations can play a big role in our search engine results!

I was intrigued by a recent blog post about how the Ghost team moved their blog to Gatsby. The move had a profound impact on their SEO:

In that article, the author explains how adding an XML sitemap (among other factors) helped them achieve remarkable organic traffic gains. So today, this tutorial will walk you through how to generate a sitemap for your Gatsby blog.

What is an XML Sitemap?

An XML sitemap is a raw document designed to help machines learn about the structure of a website. They look like this:

This is different from the "sitemap" sometimes linked to in the footers of websites. No human is meant to look at this, and they shouldn't be linked to. This is a document exclusively for Googlebot and its cousins.

Leveraging the ecosystem

Whenever I run into a new problem when working on a Gatsby project, my first instinct is always to check and see if a solution has been created by the community. A quick search reveals gatsby-plugin-sitemap, an officially-maintained plugin that solves this exact problem! 🎉

Let's install it, either using yarn or npm:

yarn add gatsby-plugin-sitemap

Next, we can add it to our gatsby-config.js:

module.exports = {
  siteMetadata: {
    // ✂️
  },
  plugins: [
    'gatsby-plugin-sitemap',
    // ✂️
  ],
}

Whenever we build our site, this plugin will generate a sitemap.xml file, alongside all the other files that Gatsby builds.

Critically, this plugin only runs when building for production. This means that you won't be able to test it when running in development mode. Let's build, and spin up a static server with serve:

yarn build && serve public

serve is an NPM package that will serve the files on your local filesystem. If you've never used it before, you'll first need to install it with yarn add -g serve.

We pass public as an argument, since Gatsby builds into the /public directory; that's where all our static files will be served from.

You should now be able to open localhost:5000/sitemap.xml, and see a beautiful ugly XML document.

Excluding certain paths

Unless you're extremely lucky, it's likely that this sitemap isn't quite right.

One of the biggest reasons to add a sitemap is to tell Google which pages not to worry about. For example, my blog had the following sites specified in the original version of my sitemap:

<url>
  <loc>https://www.joshwcomeau.com/admin</loc>
  <changefreq>daily</changefreq>
  <priority>0.5</priority>
</url>
<url>
  <loc>https://www.joshwcomeau.com/confirmed</loc>
  <changefreq>daily</changefreq>
  <priority>0.5</priority>
</url>

admin is an authenticated route I use for viewing stats about the website, and confirmed is shown when users join my newsletter. Neither of these pages makes sense to include in search results.

Happily, we can customize the plugin to pass an array of paths to exclude:

// gatsby-config.js

module.exports = {
  siteMetadata: {
    // ✂️
  },
  plugins: [
    {
      resolve: 'gatsby-plugin-sitemap',
      options: {
        exclude: ['/admin', '/confirmed'],
      },
    }
    // ✂️
  ],
}

Advanced customizations

When reading the Google sitemap recommendations, I found this bit of information:

List only canonical URLs in your sitemaps. If you have two versions of a page, list only the (Google-selected) canonical in the sitemap.

A "canonical" URL is the "true home" for a specific entity. If you have multiple URLs that contain the same content, you need to mark one as "canonical" for search engines to use.

If you don't do this, Google will penalize you, and it can hurt your search result rankings 😬

In addition to this sitemap stuff, it is also a good idea to add a <link rel="canonical"> tag to the head of each page with React Helmet. I'll be writing another post about this soon – subscribe to my newsletter so you don't miss it!

On my blog, post URLs are in the following format: /:category/:slug. This presents a problem, since posts can belong to multiple categories. For example, the post that you're reading right now can be reached through both of these URLs:

/gatsby/seo-friendly-sitemap/
/seo/seo-friendly-sitemap/

The posts on my blog are all written using MDX. In the frontmatter for the posts, I have data that looks like this:

---
title: "Generate an SEO-Friendly Sitemap for your Gatsby Site"
type: tutorial
publishedOn: 2020-03-09T09:30:00-0400
categories: ['gatsby', 'seo']
---

Categories are listed in priority order, so the first category should always form the canonical URL.

The challenge is clear: I need to fetch the categories from my MDX frontmatter and use it to filter the sites generated in the sitemap. Delightfully, this is an option with the plugin!

Querying data with GraphQL

Inside our gatsby-config.js, we can write a GraphQL query to pull whatever data we need:

module.exports = {
    siteMetadata: {
    // ✂️
  },
  plugins: [
    {
      resolve: 'gatsby-plugin-sitemap',
      options: {
        exclude: ['/admin', '/confirmed'],
        query: `
          {
            site {
              siteMetadata {
                siteUrl
              }
            }

            allSitePage {
              edges {
                node {
                  path
                }
              }
            }
          }
        `,
      },
    },
  ],
};

By default, the plugin uses a query like this, but we can overwrite it. Here it fetches the siteUrl, which in my case is http://www.joshwcomeau.com, and then it fetches the path for every page node (eg. /gatsby/seo-friendly-sitemap). It stitches those two strings together for every page it finds, and produces a sitemap.

In order to filter out non-canonical results, we first need to expose the right data to GraphQL!

allSitePage is an index of every page created, either by putting a React component in src/pages, or using the createPage API. In my case, I'm generating all articles/tutorials programmatically with createPage.

Here's what a typical createPage call looks like, inside gatsby-node.js:

createPage({
  path: pathname,
  component: path.resolve(...),
  context: {
    /* component props */
  },
});

If you're building a blog with Markdown or MDX, you're probably already using this to generate your pages. You provide it a path to live, a component to mount, and some contextual data that the component might need. Anything passed to context becomes available to the component via props.

Happily, it turns out that context also gets exposed to GraphQL!

I added a new piece of data to context:

createPage({
  path: pathname,
  component: path.resolve(...),
  context: {
    isCanonical: currentCategory === canonicalCategory
  },
});

The currentCategory and canonicalCategory variables were already available to me, since I was iterating through all my data and using it to create these pages.

With this data added, I could update the GraphQL query passed to query, in my gatsby-config.js:

query: `
  {
    site {
      siteMetadata {
        siteUrl
      }
    }

    allSitePage {
      edges {
        node {
          path
          context {
            isCanonical
          }
        }
      }
    }
  }
`,

One of the most common stumbling blocks for Gatsby developers is GraphQL. It's a powerful tool, but it has a pretty steep learning curve. The Gatsby GraphQL Concepts doc should help clarify some of what we're doing here.

Filtering pages

We've now exposed each page's "canonical status" to GraphQL, and written it into the query that gatsby-plugin-sitemap will use. The final piece of this puzzle: overwriting the default "serializer" to specify what should be done with this queried data.

Here's what that looks like:

{
  resolve: `gatsby-plugin-sitemap`,
  options: {
    exclude: ['/admin', '/confirmed'],
    query: /* ✂️ */,
    serialize: ({ site, allSitePage }) => {
      return allSitePage.edges
        .filter(({ node }) => (
          node.context.isCanonical !== false
        ))
        .map(({ node }) => {
          return {
            url: site.siteMetadata.siteUrl + node.path,
            changefreq: 'daily',
            priority: 0.7,
          };
        });
    },
  },
}

serialize is a function that transforms the data from the query into an array of "sitemappy" objects. The items we return will be used as the raw data to generate the sitemap.

Now that we've specified it in GraphQL, we can access node.context.isCanonical to filter out duplicate pages.

You'll notice we're explicitly checking to see if isCanonical is false. This is important, since isCanonical will be undefined for all the non-blog-post pages, and these pages are totally worth including in the sitemap. We only want to remove pages that are false, not ones that are falsy.

By using the query and serialize escape hatches built into gatsby-plugin-sitemap, we are given far greater control over the generated sitemap. It also allows us to fine-tune some page-specific options!

Page-specific options

When generating the XML sitemap, you may have noticed a couple additional fields being shown:

<url>
  <loc>https://www.your-website.com/page-1</loc>
  <changefreq>daily</changefreq>
  <priority>0.5</priority>
</url>
<url>
  <loc>https://www.your-website.com/page-2</loc>
  <changefreq>daily</changefreq>
  <priority>0.5</priority>
</url>

In fact, there are a handful of options that can be used to tweak each page, for optimal effects.

changefreq

changefreq is a measure of how often your page changes. From the Sitemaps protocol:

This value provides general information to search engines and may not correlate exactly to how often they crawl the page. Valid values are:

always

hourly

daily

weekly

monthly

yearly

never

The value "always" should be used to describe documents that change each time they are accessed. The value "never" should be used to describe archived URLs.

For a blog, I feel like daily fits most usecases pretty well.

priority

priority is a relative measure of a page's importance. You can use this to signal to the crawler which pages it should care about, and which aren't so important. There are 11 values available to you: 0.0 through 1.0.

On this blog, I'm using it to rank article pages like this one above "index" pages like the latest content page.

If you're a clever trickster, you might be concocting a devious plan: set every page to a 1.0 priority, and watch as your site rockets to the top of the search results!

Unfortunately, this scheme doesn't work—priority is a relative measure of importance. It won't affect how your site compares to other sites.

lastmod

Finally, we can add a date-time stamp to indicate when the page was last modified.

I'm honestly not sure how valuable this is, since presumably Googlebot is smart enough to detect when a page's content has changed, but correctly following a specification can't hurt!

Even more customizations

If you feel like you're limited by the options presented by this plugin, the folks at Ghost created their own advanced sitemap plugin. It uses XSL templating for a much prettier output! Because it's a newer and less-battle-tested plugin, I opted to stick with the standard one for my blog, but this could be a powerful option for folks with advanced usecases!

Submitting your sitemap

Once your sitemap has been generated, and your site's been deployed, you'll need to let Google know that it exists!

For this, there are a number of options. I opted to submit it via the Google Search Console tool, though there are other options outlined in their documentation.

Top comments (4)

Zhandos Mukataev • Mar 14 '20 • Edited

Hello!

Thank you for amazing post! Josh, pls, let me ask one question. In MDX-posts FrontMatter you set publish date. I mean string like this: publishedOn: 2020-03-09T09:30:00-0400

Please, tell me. How you write published date for MDX-posts? Is there any automatically solution? Or you write publish date for each posts manually? It's don't convenient.

🌈 Josh • Mar 18 '20

Hi Zhandos!

I write them myself. When I'm ready to publish a post, I set that timestamp to the current time and deploy (I also have an isPublished boolean that controls whether it's seen in the list).

It's manual, but I'm also not sure how it could be more convenient =) no matter what I need to set the date somewhere, and I may as well keep it in the same place as the article. I don't have to do any other "work" to publish.