I use GitHub Pages for my personal website, as well as for several project sites. Although some static site generators include support for sitemap generation (e.g., Jekyll has a plugin for sitemaps), my personal website is generated by a custom static site generator that I built for a few specialized reasons, and most of my project sites for Java libraries consist of a single hand-written HTML page combined with javadoc-generated documentation. So a while back I implemented a GitHub Action, generate-sitemap, that can generate an XML sitemap by crawling a GitHub repository containing the HTML of the site. It uses the last commit date of each file to produce the
<lastmod> tags. By default, it includes URLs for HTML and PDF files in the sitemap, and skips other file extensions in the repository. But it can be configured to include URLs corresponding to whatever file extensions you want included. It checks the head of HTML pages for
noindex meta tags, and excludes such files from the sitemap, and it likewise excludes files from the sitemap if they match a
Disallow rule in your
robots.txt. The generate-sitemap can be configured in a few other ways as well (see the documentation in the GitHub repository for all details). The generate-sitemap action is implemented in Python as a container action.
Table of Contents: This post is organized as follows:
In order for the
<lastmod> dates to be correctly determined, the step that checks out your repository must use
actions/checkout's optional input
fetch-depth: 0 in order to get the full git history, such as with a step like the following:
steps: - name: Checkout the repo uses: actions/checkout@v3 with: fetch-depth: 0
Here is an example workflow. It runs on pushes to the branch
main. It then starts with the checkout as described above. The generate-sitemap action assumes that the entire repository is the website by default (you can change that behavior with the input
path-to-root). The most important input is probably
base-url-path, which specifies the URL to the root of your site. This example workflow includes html and pdf files in the sitemap by default. There are optional inputs that can be used to exclude either of these, and an optional input
additional-extensions that can be used to additionally include files of any specific type you desire in the sitemap.
name: Generate xml sitemap on: push: branches: [ main ] jobs: sitemap_job: runs-on: ubuntu-latest name: Generate a sitemap steps: - name: Checkout the repo uses: actions/checkout@v3 with: fetch-depth: 0 - name: Generate the sitemap uses: cicirello/generate-sitemap@v1 with: base-url-path: https://www.example.com/ - name: Commit and push run: | if [[ `git status --porcelain sitemap.xml` ]]; then git config --global user.name 'github-actions' git config --global user.email '41898282+github-actions[bot]@users.noreply.github.com' git add sitemap.xml git commit -m "Automated sitemap update" sitemap.xml git push fi
The generate-sitemap action doesn't commit and push, so you need a step in your workflow to do that. In the above example workflow, the last step uses a simple shell script to commit and push. This example does the commit as the
github-actions bot. If you'd rather be the committer, then adjust that step as necessary. There are also actions in the GitHub Marketplace that can be used for the commit and push step if you prefer.
You can find more information about this GitHub Action in its GitHub repository:
Check out all of our GitHub Actions: https://actions.cicirello.org/
The generate-sitemap GitHub action generates a sitemap for a website hosted on GitHub Pages, and has the following features:
- Support for both xml and txt sitemaps (you choose using one of the action's inputs).
- When generating an xml sitemap, it uses the last commit date of
each file to generate the
<lastmod>tag in the sitemap entry. If the file was created during that workflow run, but not yet committed, then it instead uses the current date (however, we recommend if possible committing newly created files first).
- Supports URLs for html and pdf files in the sitemap, and has inputs to control the included file types (defaults include both html and pdf files in the sitemap).
- Now also supports including URLs for a user specified list of additional file extensions in the sitemap.
You can also find information about this GitHub Action, as well as others I've implemented and maintain at the following site (which by the way is served via GitHub Pages, and uses this action to generate its sitemap):
Follow me here on DEV:
Follow me on GitHub:
Vincent A Cicirello
If you want to generate the equivalent to the above for your own GitHub profile, check out the cicirello/user-statistician GitHub Action.
Or visit my website: