DEV Community: Victor Shepelev

Rebuilding the spellchecker, pt.4: Introduction to suggest algorithm

Victor Shepelev — Fri, 22 Jan 2021 10:49:12 +0000

This is the fourth part of the "Rebuilding the spellchecker" series, dedicated to explaining how the world's most popular spellchecker Hunspell works.

Today's topic is suggest!

Quick recap:

In the first part, I've described what Hunspell is; and why I decided to rewrite it in Python. It is an explanatory rewrite dedicated to uncovering the knowledge behind the Hunspell by "translating" it into a high-level language, with a lot of comments.
In the second part, I've covered the basics of the lookup (word correctness check through the dictionary) algorithm, including affix compression.
In the third part, the rest of the lookup is explained: compounding, word breaking, and text case.

And now, we'll switch to the juiciest part of the spellchecking problem: guessing the corrections for the misspelled word, called suggest in Hunspell. This post only draws the big picture of suggestion algorithms in general and the Hunspell's particular flavor. Even more ~~nasty~~ amazingly curious details would be covered in the next issue (or, rather, issues).

The problem with suggest

The question "how the suggest works?" was what drew me initially to the project. The lookup part seemed trivial. And even if, as I understood later, it is not that trivial, the lookup is still a task with a known answer. The word is either correct or not; the spellchecker, however it is implemented and however it stores its data, should just say whether it is correct. All the complexity of lookup implementation is only a set of optimizations, because it is hard or impossible to just store a list of "all correct words".

But suggest is a different beast altogether. There are many ways to misspell a word, due to mistyping, genuine error, or OCR glitch; and going back from the misspelled word to the correct one is no easy task. Frequently, only the text's author can say for sure what is right: was "throught" meant to be "through", "thought", or maybe "throughout"?.. What about "restraunt": "restraint" or "restaurant"? Ideally, there should be exactly one guess (then we can even apply auto-correct to the user's text), but that's rarely the case.

Even when the human can guess "what word was misspelled here", it is not always obvious what is an algorithmic way to deduce the correct word from the misspelled one, such that its results felt correct for the human. Moreover, the algorithm found for one case or set of cases may produce an irrelevant result in others, and it is hard to find the objective measure of whether your suggester is "good".

So, while lookup approaches vary only by their performance, the smallest tweaks in the suggestion algorithm might produce dramatically different results.

How it can be done

The famous article by Peter Norvig "How to Write a Spelling Corrector" describes the possible algorithm in these steps:

generate multiple "edits" of the word (insert one letter, remove one letter, swap two adjacent letters, etc.)
from all edits, select the words that are present in the dictionary;
rank them by word's commonness (using a source dictionary with weights, or a big source text which is summarized to "word → how often it is used");
take the first one as a singular good suggestion.

The entire algorithm implementation in Python takes less lines than most of the core methods of Spylls.

Note that Norvig's article is an awesome, concise, and friendly explanation of the basic idea of how spellchecking might work, intended to create the intuition about the process. But it is by no means enough to build a good spellchecker. Unfortunately, quite a few libraries exist that claim to be production-ready spellchecking solution implementing "the famous Norvig's algorithm". They ignore both "The full details of an industrial-strength spell corrector are quite complex..." at the very beginning of the article and a large section "Future Work" in the end. In real life, the results are typically less than satisfying. Much less.

Some of the modern approaches to spellchecking still take this road: for example, SymSpell algorithm (claiming to be "1 million times faster") is at its core just a brilliant idea for a novel storage format for a flat word list, that allows optimizing the calculation of edit distance significantly.

Most of the "industrial-strength spell correctors" (using Norvig's definition), though, are multi-stage. They produce possible corrections with several different algorithms and, most frequently, return several suggestions, not relying on the algorithm's ability to guess the very best one.

For example, Aspell, one of the Hunspell's "uncles"¹ (still considered by some to have better suggestion quality for English), has quite succinct description of its suggestion strategy, and even exposes command-line options for the user to control some parameters of this strategy.

Hunspell's approach is much more complicated, not to say "cumbersome". From what I can guess—I didn't dive deep into history and reasoning behind all the decisions—it grew organically with Hunspell's popularity, resulting from a multitude of cases and requirements from users of a variety of languages. There is no single "complex algorithm" that can be extracted and explained on the whiteboard, but rather a sequence of simpler algorithms. They are guided by a ton of settings that can be present in aff-files and kept together by lots of tests.

How Hunspell does it

Hunspell does the search for a correction in the following stages:

Generate a list of edits and check their correctness with the lookup, but
- there are many more of them than the classic insert-delete-swap-replace; in fact, more than dozen, depending on the particular language meta-information provided by aff-file;
- there is no ranking/reordering of edits (neither by word popularity nor by closeness to the original word); the order of their calculation is the order they will be returned: it is assumed that Hunspell's code already applies edits in the highest-probability-first order.
If there were no results on the edit stage, or they weren't considered very good (more on this later), the search through the entire dictionary is performed:
- the similarity of the misspelled word and each dictionary stem is calculated with rough and quick formula;
- for top-100 similar stems, all of their affix forms are produced, and similarity to them is calculated with another rough and quick formula;
- for top-200 of similar affixed forms, a very complicated and precise formula is used to choose only the best suggestions.
There might be an optional third stage: metaphone (pronunciation) based suggestions... Although, it depends on the existence of the metaphone encoding data in dictionary's aff-file, and there is a very small number of such dictionaries in the wild (namely, one). We'll touch on this curious topic briefly in the future.
Finally, some post-processing is performed on the suggestion, like converting it to the same character case as an initial misspelling (unless it is a prohibited case for this word!) or replacing some characters with "output conversion" rules.

For the impatient: we'll cover the details of the implementation of each stage in the future posts, but you can begin reading the docs and the code right now, starting from the algo.suggest module.

Quality estimation

Is Hunspell's suggestion algorithm good? And how good is it?

Those questions are open ones—and even the way they can be answered is unclear. Intuitively, Hunspell's suggestions are quite decent—otherwise, it wouldn't be the most widespread spellchecker, after all. A fair amount of "unhappy customers" can be easily found, too, in hunspell's repo issues. At the same time, one should distinguish between different reasons for the sub-par suggestion quality. It might be due to the algorithm itself, or due to the source data quality: the literal absence of the desired suggestion in the dictionary, or lack of aff-file settings that could've guided Hunspell to finding it.

Hunspell's development process, to the best of my knowledge, doesn't use any realistic text corpora to evaluate suggestion algorithm—only feature-by-feature synthetic tests.

In contrast, Aspell's site provides an evaluation dataset for English, including comparison with Hunspell (Aspell wins, by a large margin). Hunspell's repo actually contains something similar: script to evaluate Hunspell vs. Aspell based on Wikipedia's List of common misspellings (Hunspell wins), but mostly for informational purposes: the results are neither promoted nor used as a reference point for further development.

The current Hunspell's development consensus "what's the best suggestion algorithm" is maintained by a multitude of synthetic test dictionaries, validating that one of the suggestion features, or set of them, works (and frequently indirectly validating other features). This situation is both a blessing and a curse: synthetic tests provide stable enough environment to refactor Hunspell (or to rewrite it in a different language, IYKWIM); on the other hand, there is no direct way to test the quality—the tests only confirm that features work in expected order. So, there is no way to prove that some big redesign, or some alternative spellchecker passes the quality check at least as good as Hunspell and improves over this baseline.

There is, for example, a curious evaluation table provided by a modern ML-based spellchecker JamSpell. According to it, JamSpell is awesome—while Hunspell is a mere 0.03% better than dummy ("fix nothing") spellchecker... Which doesn't ring true, somehow!

My initial assumption for the Spylls project was that understanding the current implementation in full would be a precondition for public experimentation to improve it significantly. Or—as I dreamed—we'll be able to mix-and-match approaches of several spellcheckers (at least Hunspell and Aspell, considering, say, the popular article demonstrating the cases where the latter beats the former). What I uncovered, though, makes me suspect that relying on feature-by-feature tests and strict ordering of simple algorithms makes Hunspell too rigid for a breakthrough quality improvement... But more on this later.

For now—however we estimate the quality, practically, it works. In the next part, we'll look closely at all the hoops Hunspell jumps through in order to provide meaningful edit-based suggestions. Follow me on Twitter or subscribe to my mailing list if you don't want to miss the follow-up!

PS: Huge thanks to @squadette, my faithful editor. Without his precious help, the text would be even more convoluted!

Aspell is older than Hunspell, but it is not its direct ancestor. There was once an old Ispell, then Aspell and MySpell were created independently to replace it, then Hunspell superseded MySpell (and also Aspell took some features from MySpell too, namely affix compression). It's complicated. "Uncle" would be the most appropriate family relation. ↩

Rebuilding the spellchecker, pt.3: Lookup—compounds and solutions

Victor Shepelev — Fri, 15 Jan 2021 10:11:32 +0000

This is the third part of the "Rebuilding the spellchecker" series, dedicated to the explanation of how the world's most popular spellchecker Hunspell works.

Quick recap:

In the first part, I've described what Hunspell is; and why I decided to rewrite it in Python. It is an explanatory rewrite dedicated to uncovering the knowledge behind the Hunspell by "translating" it into a high-level language, with a lot of comments.
In the second part I've covered the basics of the lookup (word correctness check through the dictionary) algorithm, including affix compression.

This part is a carry-over of lookup algorithm explanation, dedicated to word compounding and some less complicated but nonetheless important concerns: word case and word-breaking. To understand this part, reading the previous one is strongly suggested. At very least you should remember that there are stems with flags, specified by .dic-file, with the meaning of flags defined in .aff-file.

Word compounding

Many languages, like German, Dutch, or Norvegian, have word compounding: two stems can be joined together, producing the new word. To check the spelling of the word in the language with compounding, the spellchecker needs to break it into all possible parts, and check if there exists a combination of parts such that all parts would be correct words that are allowed inside compound words.

Hunspell has two independent mechanisms to specify the compounding logic of a language in aff-file: per-stem flags, and regexp-like rules. Sometimes both mechanisms are used in the same dictionary.

Per-stem flag checks

There is a generic COMPOUNDFLAG directive to specify a flag, which, when attached to a stem, means "this stem can be anywhere in a compound" (examples from LibreOffice's Norvegian dictionary):

# In nb_NO.aff:
...
# Directive defines: any word with "z" flag is allowed to be in a compound
COMPOUNDFLAG z

# In nb_NO.dic:
...
fritt/CEGKVz
...
røyk/AEGKVWz

Both fritt ("free") and røyk ("smoke") have z flag, which means they could be in any place in compound word, and thus, "røykfritt" ("smoke-free" = "non-smoking") is a valid one—and "frittrøyk" too¹.

There are also more precise COMPOUNDBEGIN/COMPOUNDMIDDLE/COMPOUNDEND directives, setting the flags for stems which can be only at a certain place in compounds. Flags designated by those directives could be freely mixed: a compound can consist of a part marked with generic COMPOUNDFLAG, and another part marked with COMPOUNDEND.

To check the compound word for correctness, Hunspell needs to chop off the beginning of the word, of every possible length, and check if it is a valid stem which is allowed at the beginning of the compound. If so, the algorithm recursively chops the next parts, till the whole word is split into compound parts (or no suitable parts found).

Note that depending on the word's length, and on how many dictionary words are allowed to be in compounds, the loop can take quite some time: the process we described in the previous part (affix-based search of the correct form) can be repeated dozens of times for various "part candidates".

Let Lookup.compound_by_flags in Spylls documentation be your guide!

Defining compounds as regexp-like rules

There is another way to specify compounding logic. It is implemented by COMPOUNDRULE directive, with statements like A*B?C (meaning, "correct compound consists of any number of words with the flag A, then one or zero words with the flag B, then a mandatory word with the flag C"). The most common use of it is specifying suffixes in numerals. For example, in the en_US dictionary:

# en_US.aff
COMPOUNDRULE 2     # we have 2 compound rules listed below
COMPOUNDRULE n*1t  # rule 1: any number (*) of "n"-marked stems, then "1"-marked stem, then "t"-marked stem
COMPOUNDRULE n*mp  # rule 2: any number (*) of "n"-marked stems, then "m"-marked stem, then "p"-marked stem

# en_US.dic
# ...defines numbers as "stems" for this rule:
0/nm
1/n1
2/nm
3/nm
4/nm
5/nm
6/nm
7/nm
8/nm
9/nm
# ...and numerical suffixes as stems, with different flags, too!
0th/pt
1st/p
1th/tc
2nd/p
2th/tc
3rd/p
3th/tc
4th/pt
5th/pt
6th/pt
7th/pt
8th/pt
9th/pt

This leads to Hunspell being able to say that "1201st" is correct (the rule n*mp matched: "1" and "2" with "n" flags, "0" with "m", and "1st" with "p"), and "1211th" is correct (another rule in action: n*1t), but "1211st" is not.

Handling the word correctness check in a presence of the COMPOUNDRULE requires to again recursively split word into possible parts—but this time already found parts should be checked for a partial match against known rules.

Lookup.compound_by_rules implements this in a complicated, yet concise way.

But wait, there is more!

~~To make things more complicated~~ To match the complexity of real life, both algorithms of compound words checking need to consider:

Numeric limitations: some dictionaries might limit the minimum size of a part of the compound, or the maximum number of parts.
Affixes: By default, any prefix is allowed at the beginning of the compound word, and any suffix is allowed at the end; and yet, some affixes might have flags saying "it should never be in any compound", and some others might have flags saying "it is allowed in the middle of the compound" (e.g. prefix to non-first or suffix to a non-last part).
Several rules that, being present in aff-file, reject some compound words with seemingly correct parts as incorrect: for example, if the letter at the boundary of the compound is tripled (fall+lucka); if some parts of the compound are repeated (dubb+bon+bon); if the non-first part of the compound is capitalized, or "this regexp-like pattern is prohibited at the boundary of the compound parts", and so on.
Some of those settings might lead to a whole new word checking loop in the middle of compound checking: for example, CHECKCOMPOUNDREP setting tells the algorithm: use the REP-table specified in aff-file (typical misspelled sequences of letters, like "f=>ph", usually used on suggest) to check if some part of the compound, with replacement applied, is the valid word. If yes, then it is an incorrect compound! E.g. "badabum" split into parts "ba", "da", "bum", but then, if we apply the replacement "u=>oo", turns out "daboom" is a correct non-compound word... Then we should consider "badabum" a misspelling, and the "ba daboom" is most probably what was misspelled.

Are you thrilled? Then follow the Lookup.compound_forms docs to uncover even more dirty details.

...and other complications

Affix check and (de)compounding are the main parts of the algorithm, yet there is more! Just a brief overview to give you some taste:

Words case: "kitten" can be spelled "Kitten" or "KITTEN", but "Paris" can't be spelled "paris"
- ...but, the word might have a flag defined as KEEPCASE in aff-file, meaning it should be ONLY in the exact case as in the dictionary;
- ...and there are complications with the German language: "SS" can be downcased as "ss" or "ß" ("sharp s"), and both should be checked through the dictionary, and also, when the word is uppercased, it is allowed to have "ß": "STRAßE";
- ...and in Turkic languages casing rules for "i" are different: "i=>İ" and "I=>ı";
- ...and the ending part of the compound might have a flag saying "this compound should be titlecased": in Swedish dictionary, there are special words like "afrika", which are allowed only at the end of compounds, and require the whole compound to be in titlecase: "Sydafrika" (South Afrika);
Word breaking: "foo-bar" should be checked as the whole word, and also as two separate words "foo" and "bar"
- ...unless aff-file redefines this, by prohibiting word-breaking, or changing by which patterns words should be broken.
Some words might be present in the dictionary with a flag defined as FORBIDDENWORD: it is used to disallow words that are logically possible (allowed stem with allowed suffix), but this specific combination is incorrect in the language.
There might be an ICONV ("input conversions") directive defined in aff-file, saying which chars convert before the spellchecking: for example, replacement of several kinds of typographic apostrophes with simple ' to simplify the dictionary, or unpacking the ligatures (ﬁ → fi).
- But this feature can be used not only for handling fancy typography: for example, the Dutch dictionary uses it for enforcing proper case of "ij": In Dutch, it is considered a single entity and both letters should always have the same case. It is achieved by ICONV-ing to ligatures: ij→ĳ and IJ→Ĳ (but Ij wouldn't be converted, and wouldn't be found in a dictionary, as all dictionary words also contain ligatures).
An IGNORE directive, defined in aff-file, says which characters to drop before spellchecking (in Arabic and Semitic languages, where vowels may be present but should be ignored).

That's mostly the size of it!

To go for the full ride, start reading the Spylls docs from the Lookup.__call__ method. You won't be disappointed!

Lookup: takeout

To reiterate on everything said above: There are good and useful dictionaries for spellchecking of many languages, freely available in Hunspell's format, and one might be tempted to reuse them in own code. But the process of going from the not-that-complicated input format to a full reliable spellchecking includes at least:

Reading of aff-files (consisting of multiple directive "types", with reading logic depending on particular directive) and dic-files (words with flags)—we'll talk about this interesting task in later installments²;
Affix analysis: either on-the-fly (how Hunspell and Spylls do), or once: "unpack" the list of stems with flags into words with affixes;
Compounding analysis—unless you just want to omit the support for languages with compounding (this apparently can't be solved with a pre-generated list of "all correct compound words");
Handling of complications with word breaking, text case, special characters, and whatnot.

Some of those tasks are directly related to how Hunspell is built and could've been done differently. But mostly, this chapter tries to show that "check if the word spelled correctly", especially in languages other than English, should be seen as a not-that-trivial task, to say the least.

And we haven't even started on corrections suggestion yet, which we'll gladly do in the next part. Follow me on Twitter or subscribe to my mailing list if you don't want to miss the follow-up!

PS: Huge thanks to @squadette, my faithful editor. Without his precious help, the text would be even more convoluted!

The second one can't be found in Google—this compound form is grammatically correct, but doesn't make sense. ↩
I admit that starting with the proper format description would be logically correct, but I wanted to tell the juiciest stuff first. There would be a post on file formats and how to read them. For the impatient, Spylls docs cover aff and dic files quite comprehensively. ↩

Rebuilding the spellchecker, pt.2: Just look in the dictionary, they said!

Victor Shepelev — Sun, 10 Jan 2021 13:10:14 +0000

This is the second part of the "Rebuilding the spellchecker" series, dedicated to the explanation of how the world's most popular spellchecker Hunspell works.

Quick recap: In the first part, I've described what Hunspell is; and why I decided to rewrite it in Python. It is an explanatory rewrite dedicated to uncovering the knowledge behind the Hunspell by "translating" it into a high-level language, with a lot of comments.

Now, let's dive into how the stuff really works!

There are two main parts of word-by-word spellchecker algorithms:

Check if a word is correct: "lookup" part
Propose the correction for incorrect words: "suggest" part

Hunspell also implements several other algorithms to be useful as standalone software. It can extract plain text from numerous formats, like HTML or TeX, split it into words (tokenize), correctly handling punctuation—but at the end of the day, a word-by-word correctness check is applied. I excused myself from implementing "wrapper" algorithms: text extraction and tokenization are thoroughly investigated topics, and there are numerous libraries in any language solving it with decent speed and quality.

Hunspell works on a word-by-word basis (no context is taken into account). Each word is just looked up in the dictionary loaded from the plaintext <langname>.dic file in Hunspell-specific format. If it is not considered correct by dictionary lookup (which, as we'll see soon, is more complex than "is it present in the dictionary"), several algorithms of suggest are applied sequentially, trying to find correct words similar to the given one.

Hunspell's lookup algorithm, or, Just look in the dictionary, they said!

When coming from English-only spellchecking, the developers tend to perceive the "lookup" part as trivial (e.g., the famous Peter Norvig's article starts from the assumption that only the correction—"suggest"—part deserves some explanations). But that's not quite so.

The first and most straightforward idea¹ for the lookup would be: we'll just take the dictionary (presumably, the flat list of all correct words) and look for our candidate word in this list: if it is there, it is correct. End of story.

Now, Hunspell's dictionaries exist for many languages and have a plaintext format, which, at the first sight is quite close to a plain word list—so, probably it would be easy to reuse them? Let's take a look into the en_US.dic in the LibreOffice dictionary repository. You'll see a list of words in the following format:

...
acetyl
acetylene/M
ache/DSMG
achene/MS
achievable/U
achieve/BLZGDRS
achievement/SM
...

The line ache/DSMG specifies the stem ache, having D, S, M, G flags associated with it. The meaning of flags is defined by en_US.aff (called "affix file", or just aff-file; every Hunspell dictionary is distributed as a pair of .dic and .aff files).

Affix compression

In this particular case, all four flags are associated with word suffixes. Here's the definition of D suffix:

SFX D Y 4                    # Suffix header: suffix (SFX), with flag D, combinable with prefixes (Y), 4 entries:
SFX D   0  d    e            # * if the stem ends is "e", strip nothing (0), and add "d"
SFX D   y  ied  [^aeiou]y    # * if the stem ends with "y", preceded by non-vowel, strip "y" and add "ied"
SFX D   0  ed   [^ey]        # * if the stem ends with not "e", and not "y", strip nothing, add "ed"
SFX D   0  ed   [aeiou]y     # * "y" with preceding vowel: strip nothing, add "ed"

For our ache stem, this definition says that form ached exists. In the similar fashion, S flag defines that word aches exists, G flag defines aching, M flag defines ache's. So, the ache/DSMG line in .dic file specifies 5 correct words: "ache", "ached", "aches", "aching", "ache's".

Note: the fact that flags are similar to the suffixes they define (S flag defines suffix -s and so on) is just a convention. It is rather a handy mnemonics that creators of en_US dictionary used, and there could be any other symbol.

In a similar fashion, word prefixes might be defined (specified with flags in dic-file, and described with PFX directive in aff-file). Say, this definition:

advantage/GMEDS

...defines these forms: advantage, advantage's, advantaging, advantaged, advantages, disadvantage's, disadvantaging, disadvantaged, disadvantages, disadvantage (all combinations of four suffixes and a prefix "dis-").

This technique of "packing" dictionaries is called affix compression, and its primary goal is to optimize dictionary size: on disk and in memory. It becomes extremely important for languages with rich inflection. For example, in English one stem might produce no more than ten forms, but in Ukrainian it could easily be dozens or even hundreds, so the amount of possible correct forms quickly grows to tens of millions. "Just a flat list of all known words" might become impractical (not on today's top MacBook, probably, but still), and that's where the affix compression comes in handy.

This is not the only possible approach to make dictionary storage more effective: for example, morfologik (the default internal spellchecker of the most widely used open-source proofreading software LanguageTool) codes all possible forms in binary files, using finite-state automata. And this approach is very efficient by speed and memory, but very hard for humans to edit and review, and thus, to keep dictionary up-to-date.

Another benefit of splitting words into stems and affixes: when Hunspell's user wants to add a new word to their personal dictionary, Hunspell allows to just specify "it is inflected the same way as (some other word)", thus sparing the user of teaching the dictionary "'monad' is a word, and 'monads' too, as well as 'monad's'...".

Note, though, that "suffixes" and "prefixes" specified in some language's dictionary not necessarily correspond to grammatical suffixes/prefixes of the language. The splitting into stems and affixes is deduced automatically by dictionary authoring tools from flat word lists, so it is up to probabilities whether "common endings" deduced have any grammatical meaning. Actually, Hunspell's format has a rich sub-language to specify grammatical information, but of all LibreOffice and Firefox dictionaries I've checked, only a few (Latvian, Slovak, Galician, Breton) made use of this feature.

One important factor to mention is Hunspell's limitation for the amount of suffixes/prefixes a word may have. Currently, the software understands no more than 2 suffixes and 2 prefixes in a word, which is lower than the common number of grammatical suffixes a lot of languages allow. Most of the dictionaries solve this by "linearizing" the word list: Ukrainian word "громадянство" (citizenship) grammatically consists of the stem "громад-" and suffixes "-ян-", "-ств-", "-о" (the latter is called an ending in grammatically correct terms), but Hunspell's Ukrainian dictionary includes the full word "громадянство". For other languages, the suffix number limitation makes Hunspell totally unusable, so to spellcheck the Finnish, you need to install Voikko spellchecker.

Affix compression comes with a price paid in lookup algorithm complexity and performance. Instead of just a quick lookup through a hashtable or other lookup-optimized structure, we now have to:

Check if the whole word is in the list of stems. If yes, it is correct,
- ...unless it has a flag corresponding to the aff-file directive "this stem requires prefixes or suffixes".
If no, check if the word has some of the known suffixes; and if so, whether the stem without one of those suffixes is in the stem list, and has a flag corresponding to this suffix.
If no, check if one more suffix can be found in the stem (then we'll have a stem and two suffixes, and need to check the compatibility of their flags).
Repeat with prefixes (up to two), and with all possible suffix-prefix combination (taking into account whether suffix and prefix both have "cross-product" allowed).
Consider that suffixes and prefixes can have flags of their own, specifying "suffixes with this flag might be attached after me", or "if used, this suffix requires that at least one other affix would be present" and ... many other things.
And there are funny cases like CIRCUMFIX flag: if the suffix has it, this means that this suffix is only allowed in words having a prefix with the same flag.

It is still a simplified description. To follow the algorithm in full, you can read Spylls docs starting from Lookup.affix_forms, follow the links to methods it invokes, and read inline comments under "Show code".

Performance-wise, for each word correctness check, there could be many lookups through known suffixes and prefixes lists, and quite a few dictionary lookups (as consecutive chopping off of suffixes and prefixes produces new stems we need to check).

And once you tackle the affixes problem, it is all uphill from there!

Stay tuned for the next installment about the Hunspell lookup, where we'll cover word compounding problem, and some other important edge cases of the word correctness check.

This blog entry circa 2008, recently seen on HackerNews, even uses the spellchecking as an example of how much we've progressed with our tools, making the spellchecker implementation as trivial as hashmap lookup. ↩

Rebuilding the most popular spellchecker. Part 1

Victor Shepelev — Wed, 06 Jan 2021 12:25:19 +0000

How I decided to write a spellchecker and almost died trying

A few years ago I had a fun idea for a "weekend project": pure-Ruby spellchecker. Ruby is my language of choice, and no-dependencies spellchecker seemed a small useful tool for the CI environment: for example, to check comments/docs spelling without installing any third-party software. I actually could've pulled out the project in its limited scope (only English, only spot misspelled words without fixing, limited dictionary) with just a flat list of known words, but that's not what happened.

Back then, I decided to make a moderately generic tool, at least able to work with multiple languages. Fortunately (or so I believed!), there were many already existing and freely available spellchecking dictionaries distributed as LibreOffice and Firefox extensions. All of those dictionaries are in the format defined by the Hunspell tool/library—which is an open-source library that is used for spellchecking in Libre/OpenOffice, Mozilla products (Firefox, Thunderbird), but also Google Chrome/Chromium, macOS, several Adobe products, and so on.

The dictionaries looked like easy to reuse text files with some ("insignificant" as it seemed) metadata, and the whole "use Hunspell dictionaries from pure Ruby spellchecker" project still felt like a "weekend-long" one, for the first few weekends. Only gradually the underwater complexity of the multilanguage word-by-word spellchecking uncovered. Eventually, I was distracted from the project and abandoned it, but I still had the fascination with the seemingly-simple, actually-mind-blowingly-complicated Hunspell, the software everybody used daily and hardly ever notice.

The idea to dig deeper into it, to understand it and explain, grew on me and bothered me for quite some time. And what is a better way to understand something, if not to retell it in your own words? After several lazy and not very far-progressed attempts to write something Hunspell-alike (twice in Ruby, once in Rust, once in Python), eventually, in February 2020, the task I settled down to solve is: "explanatory rewrite" of the Hunspell into high-level language with a lot of comments. I achieved this goal by December 2020, with the first release of the Spylls project: the port of Hunspell's core algorithms into modern, well-documented, well-structured Python.

And now I want to share some insights of what I uncovered on the road: about spellchecking in general and Hunspell in particular.

In the ongoing article series, I'll cover these topics:

What is Hunspell, why is it significant, and why try to "explain" it (current article)
Base spellchecking concepts: lookup and suggest, as seen by Hunspell
How lookup (checking if the word is correct) works, and why it could be much more complicated than "just look in the list of the known words"
How suggest (proposed fix for the incorrect word) works, and how hard it is to estimate its quality
A closer look into Hunspell's dictionary format. It is the most widespread open dictionary format in the world, and we'll see what linguistic and algorithmic information it potentially can carry, and what part of it is actually used in existing dictionaries
Some details on Spylls implementation process and results
Closing thoughts on the big picture of word-by-word spellchecker problem, and Hunspell's approach to it

What is Hunspell?

Note: The information on Hunspell's origins and history is mostly my guesses, following partial and incomplete sources everywhere.

Hunspell (Wikpedia article), initially Hungarian spellchecker, emerged as an alternative for previously existing aspell/ispell/myspell somewhere in 2002 (I guess?). It was created by László Németh, in a need of supporting languages with complicated suffixing/prefixing rules and word compounding (such as Hungarian). Hunspell's design seemingly proved itself to be flexible enough to support most of the world's languages, and in a few years, it became the most used spellchecker in the world. You have most probably used it even if you've never heard the name before today: Hunspell is the default spellchecking engine in Chrome and Firefox, Libre/OpenOffice, Adobe products, and macOS (not an exhaustive list). Dictionaries in Hunspell format exist for almost all actively used languages for which the concept of word-by-word spellchecking makes sense¹.

Currently, Hunspell is maintained on GitHub (repo has only around 1k stars, will you believe it?). It seems that maintenance is not that easy if you'll weight the number of open issues and PRs, and the latest commits timeline: at the time of writing it (Jan 2021), the last commit to master was of May 2020, and the last release was 1.7 on Dec 2018. Hunspell's codebase is mostly "old-school" C++. It is being slowly modernized and it has very few comments; there are thousands of two-branch ifs to handle non-Unicode and Unicode text separately. There is also an attempt to rewrite Hunspell from scratch in a modern C++, which at some point was developed under the hunspell GitHub organization. Now it is independent and called nuspell (and, while not yet supporting all of the Hunspell features, already "achieved" version 4.2.0).

Obviously, there are open-source spellcheckers other than Hunspell. GNU aspell (that at one point was superseded by Hunspell, but still holds its ground in English suggestion quality), to name one of the older ones; but also there are novel approaches, like SymSpell, claiming to be "1 million times faster" or ML-based JamSpell, claiming to be much more accurate.

And yet, what makes Hunspell stand out is its coverage of the world's languages. It is not ideal, but the amount of dictionaries ready to use immediately, and amount of experience of dealing with typical problems and corner cases, coded into the codebase, is hard to beat.

Why rewrite it?

As I've already stated above, the goal of the Spylls project was to create an explanatory rewrite: E.g., the "retelling" of how Hunspell works in a way that is easy to follow and to play with.

The necessity of this approach came to me from three facts:

Hunspell is used almost everywhere and is taken for granted;
It is much more complicated than one might naively expect;
This complexity—and years of human work that was spent growing the project—is notoriously hard to follow through the Hunspell's codebase and grasp in full.

In other words, I wanted to make the knowledge behind Hunspell more open.

The way I have chosen was not, of course, the only one possible. I could've just read through the original code and write a series of articles (or, rather, a book?) on how it works. I could've thoroughly commented and republished the original source code. But I felt that reimplementing is the only way of understanding what's and why's of the algorithms (at least for somebody not being a Hunspell's core developer); and that implementation in a high-level language will allow focusing on words and language-related algorithms, not memory management or fighting with Unicode.

Note that there are also a few "pragmatic" ports of Hunspell into other languages (in order to use it in environments where C++ dependency is undesireable), namely WeCantSpell.Hunspell in C# and nspell in JS (very incomplete); and aforementioned nuspell can also be considered a "port" (from legacy C++ to a modern one).

Why Python?

My language of choice is Ruby. It was also the first language that I've tried to port Hunspell into. I'd be happy to proceed with Ruby if my goal has been just a "pragmatic" library. And yet, when I decided that my goal is to make the knowledge of Hunspell's algorithms accessible to a wide audience, I understood that Ruby is not the best choice: language reputation (slightly esoteric and mostly-for-web) would make my project lest noticeable; and my preferred coding style (mix of OO and functional, with lots of small immutable domain objects and fluent chains of iterators), while allowing me to be very effective, would make the code less accessible to other languages users.

What I needed was a high-level language, with as low boilerplate as possible; as mainstream as possible; as easy to experiment with and prototype as possible. Without diving into too much argument here, Python and modern JavaScript seemed to be the most suitable options, and, to be honest, Python was just closer to my soul. So, here we are!

The code style is mostly imperative (as it corresponds to how Hunspell is structured), with large-ish, but clearly structured methods, and a small number of classes/objects (mostly they are either "whole algorithm as a class" or almost-passive "structs" -- or, in Python, dataclasses). I tried to limit myself in the usage of complex Python-specific features (like functools or itertools), but have a decent use of "list comprehensions" (as they are quite readable and Pythonic) and generators (lazy lists). Overall, I wanted the code to be good Python, but not too smart. Whether I succeeded, is up to you to decide.

Currently, Spylls has ≈1.5k lines of library code in 14 files. It conforms (with some reservations) to all Hunspell's integrational tests. Those tests look like a set of files each, consisting of "test dictionary + what words should be considered good, what words should be considered bad, what should be suggested instead of the bad words", and there are 127 of such sets to pass. There are 2 thousand comment lines in the code, explaining thoroughly every detail of the algorithm and rendered at the Spylls documentation site; note that besides docstrings at the beginning of each class and method, there are also inline comments in code—that's why the documentation site uses custom theme with inline "Show code" feature.

With this being said, I am wrapping up the introductory post.

In the next series: An introduction to Hunspell's "lookup" and "suggest" concepts; and deeper dive into the lookup.

Word-by-word spellchecking makes less sense for hieroghliphic languages like Chinese and Japanese; it is also problematic for languages where words aren't separated by whitespaces, like Lao or Thai. ↩