How to spell check your docs in a continuous integration pipeline
As software engineers, we write a lot of documentation (or we should, at least). It is usual to write these docs in Markdown format in the code repository for many reasons, but mainly:
- Docs are kept in plain text, isolated from the format.
- Docs are under version control.
- Docs modifications can be reviewed in the same pull request than the code modifications.
While it should not be necessary to mention the importance of avoiding spelling mistakes in our docs, it is very common to see comments in code reviews about typos in them. If the errors are not detected on time, they would impact on our perceived trustworthiness. Tolerance for typos is low. It makes the writer look careless.
You can avoid typos by checking your spelling locally, using a plugin for the IDE, for example, but it would depend only on you. I personally prefer to add automatic check mechanisms to the continuous integration pipeline that don't depend on recommendations for local environments.
So, in this short post we'll see how to configure a Github action to check the spelling of Markdown documents whenever a pull request is opened on our repository.
Creating the Github workflow
If you already have a Github workflow configured in your repository and you want to add the spell check to it as another job or step, you can skip this chapter and read the next one directly.
We are going to create a new Github workflow to execute the spell check. Create a
.github/workflows/spellcheck.yml file in the repository.
├── .github/ │ └── workflows/ │ └── spellcheck.yml
We will name the workflow as
spellcheck. It will be executed each time a new pull request is opened, and it will contain only one job named
check-spelling. For the moment we will only add a step for checking out the repository.
name: spellcheck on: pull_request: jobs: check-spelling: runs-on: ubuntu-latest steps: - name: Checkout uses: actions/checkout@v3
Adding the spell check step
For the spellchecking, we will use the
rojopolis/spellcheck-github-actions Github action, which is able to check Python, Markdown, HTML and Text files, and it supports following languages:
Add the step to the workflow where correspond. Based on the above example, we will add it right after checking out the repository code:
name: spellcheck on: pull_request: jobs: check-spelling: runs-on: ubuntu-latest steps: - name: Checkout uses: actions/checkout@v3 - name: Check Spelling uses: email@example.com with: config_path: .spellcheck.yml task_name: Markdown
Now we have to add a configuration file for the spelling checker. It uses
PySpelling under the hood. When checking Markdown files, it first converts a Markdown text file's buffer using
Python Markdown and returns a single
SourceText object containing the text as HTML. Then it captures the HTML content, comments, and even attributes and performs the check. It has a lot of configuration options, but here we are going to see only an example with some basics. For further info you can read the docs of the
rojopolis/spellcheck-github-actions Github action.
- Create a file named
.spellcheck.ymlin the root folder of the repository, and paste the contents of the example below.
- Change the
sourcesproperty depending on your repository structure. Provide patterns for every folder containing the files that you want to automatically check.
matrix: - name: Markdown expect_match: false apsell: mode: en dictionary: wordlists: - .wordlist.txt output: wordlist.dic encoding: utf-8 pipeline: - pyspelling.filters.markdown: markdown_extensions: - markdown.extensions.extra: - pyspelling.filters.html: comments: false attributes: - alt ignores: - ':matches(code, pre)' - 'code' - 'pre' - 'blockquote' sources: - '*.md' - 'content/blog/**/*.md'
The above configuration will check the spelling of your repository's
README.md and other Markdown files in the
content/blog folder against an English dictionary. This is the configuration used to check the contents of this blog.
Checking for bad spelling
Now, whenever a new pull request is opened, the Github action will be executed and it will help you make sure most spelling errors do not make it into your repository. If the check detects any misspelling, the action will fail and you'll see a trace in the Github action logs like:
Misspelled words: <htmlcontent> README.md (1) -------------------------------------------------------------------------------- Github !!!Spelling check failed!!!
Now you can fix the error and push the docs again, or you can also add the word to your own dictionary in case it is a false positive, as described in the next chapter.
Adding custom words
Sometimes you'll probably have to use some words that are not contained in the default
Aspell dictionary used by
PySpelling. This is very usual when talking about terms used in technical docs. Look again the configuration example above, and you'll see that we have added a
wordlists property to the
dictionary one. It makes reference to a
.wordlist.txt file, so you can create that file and add your own words to it, and
PySpelling will ignore them.
Note also that the configuration includes some filters to ignore some patterns. So, if the word is surrounded by a
code tag in an HTML file it will be ignored, for example. Then, you could format your text in a more convenient way instead of adding the word to your custom dictionary. It would depend on the case, but sometimes it may be a good practice to give a different format to some kinds of terms.
In a Markdown file, Github will be detected as a misspelling, but `Github` won't.
Then, after fixing all errors or adding the necessary custom words, at some point we get:
Spelling check passed :)
And this is how our repository will finally like after adding all the needed configuration files:
├── .github/ │ └── workflows/ │ └── spellcheck.yml ├── .spellcheck.yml └── .wordlist.txt
Adding an automatic spellchecking step to your continuous integration pipeline will prevent your docs containing ugly typos, and it will save you lots of hours of code reviews. It won't prevent other types of orthographic errors, like semantic or style related ones, so I you'll still have to reread carefully your docs before publishing them, but it will be a great help.
This blog is being checked with the examples provided in this post, because it is generated from Markdown files, so you can get the code of the examples directly from
Top comments (1)
I’ve learned how to spellcheck a batch of Markdown files offline, with Aspell (OSS supporting 90 languages).
It was a nice experiment so I wrote a post about it: roneo.org/en/markdown-spellcheck-m...