junyu fang

Posted on Aug 22

ASCII and Unicode: The Evolution of Computer Languages

Today, the A, B, and C we type on our keyboards are letters to humans, but to computers, they're just 0s and 1s.

So, how do computers understand that A represents the capital letter A, rather than just a random number?

The answer is—ASCII.

ASCII

In the 1960s, American teletype (TTY) manufacturers each had their own incompatible "character sets," making communication difficult.

Each letter needed to be converted into an electrical signal (a 0/1 pulse) for transmission. Telegraph systems at the time typically used 5-bit (Baudot code) or 6-bit (Fielddata code) to represent characters.

The American Standards Association (ASA, later renamed ANSI) released ASCII (American Standard Code for Information Interchange) in 1963.

ASCII is the first widely adopted English-based character encoding standard. It uses 7-bit binary numbers to represent characters and can represent up to 2^7 = 128 symbols.

The ASCII table in the image shows a 7-bit binary representation of the vertical and horizontal axes.

Character Example

For example, the ASCII code for 5 is:

First, look at b7, b6, and b5 above. The corresponding value is 011.

Then look at b4, b3, b2, and b1 on the left. The corresponding value is 0101.

Therefore, the ASCII code for 5 is 011 0101 in binary, and 53 in decimal.

The Relationship Between ASCII and Computers

1. The Origin of Computer Storage Units

The design of a computer's CPU, memory chips, and bus determines the minimum number of bits that can be read or written at once.

Early machines included 6-bit machines (DEC PDP-6) and 9-bit machines (CDC 6600), with varying degrees of uniformity in storage units.

Later, 8 bits were gradually standardized as the minimum storage unit.

Reasons:

ASCII can fit perfectly into 8 bits, with the remaining bit used for parity check/extension;
8 bits naturally align with binary hexadecimal (1 byte = 8 bits = 2 hexadecimal digits).

Memory alignment is highly efficient (byte addressing is much simpler than bit addressing).

Consequently, hardware manufacturers gradually adopted 8 bits as the minimum addressable unit, called a byte.

Although ASCII is only 7 bits, this means that when storing data, the most significant bit (the 8th bit) is usually filled with 0.

Note: A bit is the smallest unit of computer storage, capable of storing only 0 or 1, and is typically represented by the lowercase letter "b." One byte equals 8 bits, which is equivalent to one ASCII character, typically represented by an uppercase "B."

2. ASCII History Timeline

1963: The first version of ASCII is released (7 bits, 128 characters).

1968: US President Lyndon Johnson signs an executive order requiring all US federal government computers to support ASCII, which promotes the widespread adoption of ASCII.

1970s: IBM, DEC, and other major companies adopted ASCII, making it the universal language for computer and terminal communication.

1981: The IBM PC (personal computer) was released, using extended ASCII (8-bit, 256 characters), effectively introducing ASCII to the world.

Extended ASCII

In the 1980s, as computers began to become internationalized, ASCII only supported English; European languages, scientific notation, and glyphs were not supported.

ASCII only uses 7-bit binary characters, with the 8th bit set to 0 by default. Therefore, various manufacturers changed the highest bit from 0 to 1 to define additional characters. This is known as Extended ASCII.

Extended ASCII uses 8-bit binary numbers to represent characters, allowing for a maximum of 2^8 = 256 symbols.

128 additional characters were added to the ASCII base (0-127, consistent with ASCII):

Accented letters (é, ñ, ü) for various languages;
Line symbols and special graphics;
Some mathematical symbols.

Extension standards vary by manufacturer:

The IBM PC defines symbols such as smiley faces, outlines, and Greek letters.
ISO developed the ISO 8859 series (such as ISO-8859-1 "Latin-1," which includes Western European characters such as é and ñ).

For example, é might equal 233 (0xE9).

ç may = 231 (0xE7).

Macintosh uses MacRoman encoding.
Windows uses Windows-1252 encoding.

Thus, extended ASCII is not a unified standard. In an 8-bit byte, the first 128 bits are compatible with ASCII, while the last 128 bits extend the encoding of each manufacturer.

Unicode Encoding

ASCII was the beginning of everything, but it only had 128 characters, covering only the English-speaking world. With the continued development of globalization, extended ASCII began to experience "garbled characters":

On a French-speaking computer, 128–255 corresponded to é, à, and so on.
On a Russian-speaking computer, 128–255 corresponded to кириллица (Cyrillic).
On a PC, 128–255 might correspond to a border glyph.

The same byte value on a computer might display completely different characters because different manufacturers extended ASCII in different, incompatible ways.

To address this issue, the Unicode Consortium was established in 1991, and the Unicode 1.0 specification was released in 1991–1992. It encompasses nearly all the world's written symbols (now exceeding one million characters), defines a globally unified character set, assigns each character a unique code, and is fully compatible with the ASCII range (0–127), ensuring historical compatibility. Including emoji, Chinese, Arabic, etc.

For example, 😊 = U+1F60A. U+ indicates a Unicode code point, and 1F60A is a hexadecimal number, which converts to decimal as 128522.

In UTF-8, it occupies 4 bytes:

Note: UTF-8 is an implementation of Unicode; there are also UTF-16 and UTF-32.

Summary

ASCII is the first computer character standard, allowing machines to understand A-Z, 0-9, punctuation, and control characters.

ASCII uses 7-bit binary and can represent 128 characters; extended ASCII uses 8-bit binary and can represent up to 256 characters, with different definitions in different countries and manufacturers. Unicode unified the incompatibilities of extended ASCII.

ASCII laid the foundation for modern encodings and still plays a key role in Unicode today.

So, the next time you type "A" into your computer, think about this—to the computer, it's actually just 0100 0001.