DEV Community

Cover image for How I used brute-force where I least expected it
TAHRI Ahmed R.
TAHRI Ahmed R.

Posted on • Edited on

3 1

How I used brute-force where I least expected it

There is a very old issue regarding "encoding detection" in a text file that has been partially resolved by a program like Chardet. I did not like the idea of single prober per encoding table that could lead to hard coding specifications.

Alt Text

I wanted to challenge the existing methods of discovering originating encoding.

You could consider this issue as obsolete because of current norms :

You should indicate used charset encoding as described in standards

But the reality is different, a huge part of the internet still have content with an unknown encoding. (One could point out subrip subtitle (SRT) for instance)

This is why a popular package like Requests embed Chardet to guess apparent encoding on remote resources.

You should know that :

  • You should not care about the originating charset encoding, that because two different table can produce two identical files.
  • BOM (byte order mark) is not universal and concern only a tiny number of encodings and not only Unicode !

I'm brute-forcing on three premises (in that order) :

  • Binaries fit encoding table
  • Chaos
  • Coherence

Chaos : I opened hundred of text files, written by humans, with the wrong encoding table. I observed, then I established some ground rules about what is obvious when it seems like a mess. I know that my interpretation of what is chaotic is very subjective, feel free to contribute to improve or rewrite it.

Coherence : For each language there is on earth (the best we can), we have computed letter appearance occurrences ranked. So I thought that those intel are worth something here. So I use those records against the decoded text to check if I can detect intelligent design.

So I present to you Charset Normalizer. The Real First Universal Charset Detector.

Alt Text

Feel free to help though testing or contributing.

Postmark Image

Speedy emails, satisfied customers

Are delayed transactional emails costing you user satisfaction? Postmark delivers your emails almost instantly, keeping your customers happy and connected.

Sign up

Top comments (0)

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay