markdown guide
 

"Text" is an intuitive concept to humans, but it's fairly involved for a computer to "understand" text.

Computers natively understand only sequences of (small) numbers. Most computers treat all memory as a sequence of bytes (aka, octets, meaning a pattern of 8 bits). A byte has 256 distinct values, which we usually identified with the numbers 0, 1, 2, ..., 255.

To store text in a computer, we need to encode that text as a sequence of bytes. In a single byte encoding (like ASCII), we break text into characters, and assign a byte value to each character.

For example, in ASCII, the text Hello is broken into the characters

  • H with value 72
  • e with value 101
  • l with value 108
  • l with value 108
  • o with value 111

So we call the sequences of bytes [72, 101, 108, 108, 111] the ASCII encoding of the string Hello.

ASCII is a character encoding, meaning it is a method for encoding text into bytes.

ASCII is special in a few important ways:

  • It's a single byte encoding. Each character is mapped to exactly one byte.
    • It is by far the most popular single-byte encoding that survives today.
  • It's only uses 7 of the 8 bits; it doesn't use byte values 128 to 255.

The first fact makes it very easy for computers to use ASCII. However, it also means you can only use a few hundred distinct symbols in your text -- this means it's impossible to represent, for example, Chinese and Japanese text.

The second fast means that you can make "ASCII compatible" encodings, by utilizing the extra unused bit. UTF-8, the most popular (and best) unicode encoding, is "ASCII compatible", so text that is encoded as ASCII can be safely decoded as UTF-8 (the reverse is not true, however).

If you are only using English, and no funny symbols, ASCII will be enough. However, if you want to work with the full set of available symbols and languages, you will want to use a Unicode encoding. The best Unicode encoding is UTF-8.


UTF-8 is different from ASCII in a few crucial ways:

  • UTF-8 is a variable byte encoding. Depending on the character, it make take 1 to 4 bytes to store.
  • UTF-8 is "ASCII compatible". The ASCII characters still use byte values 0 to 127; all the characters that aren't ASCII characters are at least 2 bytes long, and are made up of only the bytes 128 to 255.
  • UTF-8 encodes all of Unicode. Unicode assigns a codepoint (a numeric identifier) to nearly every symbol used in modern languages and typesetting (and also many extinct languages!). Unicode code points range from 0 to 1114111. As mentioned, 0 to 127 align with how ASCII assigns byte values to the characters in ASCII.
 

Computers works with numbers (binary encoded). To make computers work with letters we come up with encoding. We agreed that, for example, 64 stands for A, 65 for B etc. ASCII is a standard which describes encoding (one of them).

ASCII stands for American Standard Code for Information Interchange. ASCII was originally designed for use with teletypes.

For historical reasons ASCII overlived original teletypes, teletypes were used to interact with mainframes (very big old computers with size of a room), then this standard was adopted for other computers and it survived till our days. UTF-8 keep these exact mappings (for compatibility reasons).

Read more:

Classic DEV Post from Nov 12 '17

How I Take Notes

Taking notes can help you remember what you learn and be a log of your past learnings!

FarzanRashid profile image

Be an informed and productive developer.

Sign up (for free)