DEV Community

Cover image for Duplicate Finder for Documentation
igor.kulakov
igor.kulakov

Posted on • Originally published at flounder.dev

Duplicate Finder for Documentation

Anyone who worked on technical documentation in a big team is certainly aware of the content duplication issue. Even with the best tools and practices at hand, duplication is fundamentally difficult to overcome.

As the project grows in size, duplicated content will start to occur. This is especially true for big projects including many similar products or features.


Good:

define once:

<p>
    If you encounter any issues, refer to the troubleshooting guide or contact support.
</p>
Enter fullscreen mode Exit fullscreen mode

then reuse elsewhere:

<TroubleshootingNote/>
Enter fullscreen mode Exit fullscreen mode

Bad:

<p>
    If you encounter any issues, refer to the troubleshooting guide or contact support.
</p>

<!-- same meaning, slightly different wording-->
<p>
    In case of problems, consult the troubleshooting guide or contact support
</p>
Enter fullscreen mode Exit fullscreen mode

The idea that advocates against duplication is commonly known as DRY Principle. Though it is primarily associated with programming, the same property is highly favoured in documentation.

Modern authoring tools typically have features for content reuse, making technical constraints less of a concern. The real problem, on the other hand, lies in spotting duplicates. Before you extract something to a reusable chunk, you need to know what to extract.

If you are a programmer, your IDE might highlight duplicate code for you:

IntelliJ IDEA hightlights duplicated code

Unfortunately, the same feature is not suitable for documentation, as it relies on comparing abstract syntax trees (AST). This approach doesn't work well with text.

One of my ongoing projects is to implement a duplicate finder for documentation. The tool will be capable of quickly finding non-exact, or fuzzy, matches, such as the bad example above.

Current status

As of this writing, the project is WIP, but there is already a working prototype:

The UI of the duplicate finder tool prototype showing several detected duplicates in a dummy project

The algorithm takes under 30 seconds to analyze a project with ~6k source files on my MBP M1, and I'm planning on improving it to instantly highlight duplicates right as you type in the editor.

The prototype has already helped me and my colleagues find a lot of duplicates in real projects, so I'm quite enthusiastic about the results and future improvements.

What's next

In the following posts, I will lay out the algorithm step-by-step and perform benchmarks to evaluate its performance. If you are into programming, you are welcome to code along.

Alternatively, you can keep an eye on the progress and use the final deliverable when the project is complete. Once finished, this feature will be available in Writerside, a great authoring tool made by my colleagues.

I hope that the project description resonates with you, and that you'll find the walkthrough useful.

To receive updates on the new posts, follow me on X or subscribe to the mailing list on my blog.

See you in the next posts!

Top comments (0)