re: What are ASCII values? VIEW POST


"Text" is an intuitive concept to humans, but it's fairly involved for a computer to "understand" text.

Computers natively understand only sequences of (small) numbers. Most computers treat all memory as a sequence of bytes (aka, octets, meaning a pattern of 8 bits). A byte has 256 distinct values, which we usually identified with the numbers 0, 1, 2, ..., 255.

To store text in a computer, we need to encode that text as a sequence of bytes. In a single byte encoding (like ASCII), we break text into characters, and assign a byte value to each character.

For example, in ASCII, the text Hello is broken into the characters

  • H with value 72
  • e with value 101
  • l with value 108
  • l with value 108
  • o with value 111

So we call the sequences of bytes [72, 101, 108, 108, 111] the ASCII encoding of the string Hello.

ASCII is a character encoding, meaning it is a method for encoding text into bytes.

ASCII is special in a few important ways:

  • It's a single byte encoding. Each character is mapped to exactly one byte.
    • It is by far the most popular single-byte encoding that survives today.
  • It's only uses 7 of the 8 bits; it doesn't use byte values 128 to 255.

The first fact makes it very easy for computers to use ASCII. However, it also means you can only use a few hundred distinct symbols in your text -- this means it's impossible to represent, for example, Chinese and Japanese text.

The second fast means that you can make "ASCII compatible" encodings, by utilizing the extra unused bit. UTF-8, the most popular (and best) unicode encoding, is "ASCII compatible", so text that is encoded as ASCII can be safely decoded as UTF-8 (the reverse is not true, however).

If you are only using English, and no funny symbols, ASCII will be enough. However, if you want to work with the full set of available symbols and languages, you will want to use a Unicode encoding. The best Unicode encoding is UTF-8.

UTF-8 is different from ASCII in a few crucial ways:

  • UTF-8 is a variable byte encoding. Depending on the character, it make take 1 to 4 bytes to store.
  • UTF-8 is "ASCII compatible". The ASCII characters still use byte values 0 to 127; all the characters that aren't ASCII characters are at least 2 bytes long, and are made up of only the bytes 128 to 255.
  • UTF-8 encodes all of Unicode. Unicode assigns a codepoint (a numeric identifier) to nearly every symbol used in modern languages and typesetting (and also many extinct languages!). Unicode code points range from 0 to 1114111. As mentioned, 0 to 127 align with how ASCII assigns byte values to the characters in ASCII.
code of conduct - report abuse