Dviejo

Posted on Nov 23, 2019

Character set walkthrough

#beginners #tutorial #productivity

How do we write to a file a character like Ñ and the word привет? (hello in Russian: privet). We have the idea that to represent a character we need a byte, but this has not been so for many years.

At the programming level, this currently represents one of the biggest problems when we have to make an application that contemplates several languages, with different character sets.

This forces us to when we create a file we have to express the character set with which it is made. Whoever receives the file should know which character set it's been written to be able to read it properly.

And in the databases, in which character set are the data recorded? When the database is created, database managers should usually express what character set they want to work with. If our database must store information in Russian and Spanish, we must use a character set that allows both.
If we open the notepad ++, we write a Ñ and record as test.txt and do type test.txt we see

but if we do it with the Powershell

We have written a Ñ and we have two different representations. This already poses a problem. One thing is what is in the file, which is a Ñ and another thing is how the programs interpret it, in this case, the cmd and PowerShell do it differently.

If we read it with a program, for example in python,

what is happening, in python, is that when you open the file we are not telling you that the character set is utf-8

But why have we had to say utf-8 and not something else? In what character set has the notepad ++ recorded the file? The notepad, if we say nothing else, records the file in utf-8. When we tell the python that the file is in utf-8 we see the Ñ.

What programmers do when we see "weird characters", we say, "that is a problem of the character set," but we don't really know very well what has happened. "

One lesson to learn is that when we open a file or record a file we should specify the character set with which it is recorded.

But why does it come out
as a representation of the Ñ in the cmd? Well, when we open a cmd it starts with a character set that is not the utf-8. If we run chcp ithit does notmy machine:

Active code page: 850. If we see what code 850 is like in the following url

We will see that the first character is from row C column 3 and the second is from row 9 col 1. Therefore, refer to the C391 characters.

If we change the character set to utf8 with the command

chcp 65001

and type type.txt then it appears to us

The CMD program has correctly interpreted the character set of the file.

We have to keep in mind that sometimes the problem will not be in the file, but in the program that interprets the file.

If we open the file with the notepad.exe, it interprets it properly and the Ñ appears

What bytes have actually been recorded in the file? If we look at the content in hexadecimal with the powershell

Format-Hex test.txt

And we have:

The Ñ is saved using two bytes C3 91

If we see the utf-8 specification, we see that the Ñ is represented with two bytes (https://www.utf8-chartable.de/)

The UTF-8 IS a multibyte character encoding, which allows the exchange between different systems in a secure way.

The first 128 characters are serialized in the same way, from 128 they are serialized with 2 bytes. There are characters that are serialized with 3 4 5 and 6 bytes.

This article is the first in a series dedicated to character sets.

DEV Community

Character set walkthrough

Top comments (0)