DEV Community

Nirmal Patel
Nirmal Patel

Posted on

2

UTF-8

UTF-8 is a multi-byte variable-width character encoding scheme for saving Unicode codepoints - which allow displaying almost all characters from international languages.
UTF-8 uses 1-byte to store codepoints 0-127. So English text looks exactly the same as they look in ASCII.

ASCII

represents every character using a number between 32 and 127. Space was 32, the letter “A” was 65, etc. This could conveniently be stored in 7 bits.

Using 7 bits gives 128 possible values from 0000000 to 1111111, so ASCII has enough room for all lower case and upper case Latin letters, along with each numerical digit, common punctuation marks, spaces, tabs and other control characters.

ANSI

below 128, same as ASCII, but there were lots of different ways to handle the characters from 128 and up, depending on where you lived. These different systems were called code pages. For example in Israel, DOS used a code page called 862, while Greek users used 737.

Unicode

a single character set that included every reasonable writing system on the planet.

Characters are represented as CodePoints

Every platonic letter in every alphabet is assigned a magic number by the Unicode consortium which is written like: U+0639.
This magic number is called a code point.
The U+ means “Unicode” and the numbers are hexadecimal.

UTF-8

(8-bit Unicode Transformation Format)

UTF-8 is a system for storing strings of Unicode code points, those magic U+ numbers, in memory using 8 bit bytes.

In UTF-8, every code point from 0-127 is stored in a single byte. This has the neat side effect that English text looks exactly the same in UTF-8 as it did in ASCII.

Code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.

UTF-8 is therefore a multi-byte variable-width encoding. Multi-byte because a single character like Я takes more than one byte to specify it.
Variable-width because some characters like H take only 1 byte and some up to 4.

UTF-8 is universal and covers Latin characters as well as Cyrillic, Arabic, Japanese...

References used:

  1. https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

  2. https://www.smashingmagazine.com/2012/06/all-about-unicode-utf8-character-sets/

Sentry blog image

How to reduce TTFB

In the past few years in the web dev world, we’ve seen a significant push towards rendering our websites on the server. Doing so is better for SEO and performs better on low-powered devices, but one thing we had to sacrifice is TTFB.

In this article, we’ll see how we can identify what makes our TTFB high so we can fix it.

Read more

Top comments (0)

Cloudinary image

Optimize, customize, deliver, manage and analyze your images.

Remove background in all your web images at the same time, use outpainting to expand images with matching content, remove objects via open-set object detection and fill, recolor, crop, resize... Discover these and hundreds more ways to manage your web images and videos on a scale.

Learn more

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay