I made a tool for debugging character encoding problems

#mojibake #unicode #text #debugging

I made a thing! It’s called string inspector, and you can install it through the cargo, the rust package manager (cargo install string-inspector).

What it does

This tool takes a string, decodes it in one or more character encodings, and prints out the characters below the raw bytes (as hex). Here’s what it looks like:

The above example shows UTF-8, but you can also interpret strings as Latin-1, and I plan on adding more character encodings later.

Note: this currently only works on Unix systems; Windows is not supported.

Why I made this

I made this partly because I wanted to learn rust, and partly because I often find myself trying to debug text data by writing code or using a debugger, and I kind of just want a simple tool I can paste text into.

When I say “debug text data”, I mean I want to understand what data the string is made up of, as opposed to just printing it to the screen.

Sometimes it can be non-obvious what characters make up a string, especially when you consider characters from unfamiliar writing systems. You can get invisible characters, characters that combine with each other to produce a single “grapheme”, and characters that change the direction of the text. These things can create problems for applications.

If I’m dealing with a weird bug that only affects certain inputs, one of the things I want to rule out as soon as possible is if there’s any unusual characters in it that might not be obvious from just looking at it. If there are, then what are those characters? Is what looks like an apostrophe actually a RIGHT SINGLE QUOTATION MARK? Or perhaps my coworkers are trolling me by replacing all my semicolons with greek question marks?

Even worse is when you’re dealing with code that creates mangled text, aka mojibake.

This can happen when you take text encoded in one character encoding (such as ISO 8859-1), and interpret it as if it is encoded in another encoding (such as UTF-8).

There’s this great example of mojibake where somebody has written out a russian address in gibberish latin characters and a postal worker has dutifully translated it back to cyrillic characters:

Source: Programming with Unicode

In english, it’s less obvious when code is doing this, because english text uses mostly characters that are part of the ASCII character set (128 characters, including most of the symbols you’d find on an english keyboard). A lot of western character sets are extensions of ASCII, and one of the goals of UTF-8 was to be interoperable with ASCII, so most of the time sloppy text handling still works.

But ASCII doesn’t contain all the symbols you’d find in english text. Since I’m British, I often see the pound symbol (£) get messed up, because different character encodings treat it differently. I think it’s easier to debug if you can see the actual bytes. If you know that £ is supposed to be 0xc2 0xa3 in valid UTF-8, you can check whether your string has really been encoded that way.

There are lots of tools for intelligently guessing character encodings, and I looked at some of them when I started this project. Generally they use statistical analysis and guess the language of the text at the same time as they guess the encoding. But that’s really designed for long documents, not short snippets of text, and since my tool is geared towards debugging I want to help the user understand how text has been encoded, rather than spitting out a guess that may be wrong.

What's next?

Please let me know if you find this tool useful!

The initial version is just a proof of concept, so I’m expecting it to be a bit buggy. The next thing I want to do is make it more reliable.

That means adding a load more tests, and throwing lots of weird input at it to see what happens. Feel free to contribute PRs if you know rust! I'm still figuring it out as I go.

Assuming I get that done, I then want to explore ways the tool could help identify mojibake in english text. For example, if there is a sequence of bytes that you would get by double-encoding UTF-8, maybe it could suggest what the original string is?

DEV Community

I made a tool for debugging character encoding problems

What it does

Why I made this

What's next?

Top comments (0)