DEV Community: Bradley Neumaier

On working incrementally

Bradley Neumaier — Tue, 21 Jul 2020 15:20:47 +0000

I have learned a few different things through trial and error over the course of my software developer career. The most prominent of which is how to work incrementally. I got this phrase from agile though my use of it here has nothing to do with that methodology. What I am referring to is the mindset and workflow that I use when developing.

Before defining what I mean in abstract and general terms, allow me to present an example of what I’m talking about. Suppose you have a major refactoring task. Several hours in you try running the code for the first time and find that it doesn’t even build. Frowning, you dive deeper into the compiler errors and emerge with buildable code an hour later. At this point you finish working for the day thinking that all is well. The next day you log several more hours into refactoring, fix all the compiler errors, and finally finish with refactoring the code. The code is now, of course, a perfection the likes of which no one has ever seen before nor will ever see again. You commit your code into git and sit back thinking that all is right with the world again. Belatedly, you realize you haven’t run any unit tests yet. Doing so produces more red than green (i.e., you’ve broken so many tests that more fail now than pass). Panic begins to settle in. You frantically look through the errors and try to debug the problem, but now you’re dealing with tests and code you had no part in writing. It’s not as easy to figure out how to fix the problem. It’s now the end of the day and you haven’t made any progress fixing the unit tests. You spend the entire next day attempting to fix them with limited success. At this point you begin to wonder if it wouldn’t be easier just to start over again from scratch.

Ever been in a similar situation before? I have on more than one occasion. If my example didn’t convey a sense of dread, then let me say outright that being in that situation sucks. The solution is, as the title and intro suggested, to work incrementally. Instead of waiting until several hours have lapsed to try building the code, build it frequently. Instead of waiting two days to run the existing unit tests, run them frequently. And instead of waiting until you’re finished to commit your code, commit frequently. TDD advocates a similar approach of building and running unit tests often, but what I’m talking about is more generic and also relies heavily on version control (specifically git, though by no means does it have to be).

My approach to a giant refactoring task involves a frequent cycle of building (assuming I’m working with a compiled language), running the unit tests, and committing my code. In particular, I make sure my commits are atomic. This means that the code in each of my commits is doing one thing only (sort of like the single responsibility principle for git commits). Much like SRP, the granularity can be taken to an excessive degree, but what I generally mean is that I limit myself to, for example, refactoring one class at a time or even one method at a time. And also be aware that I don’t mean you should be committing one file at a time either. If refactoring a class requires making changes to 20 files, then that is totally fine. Atomic commits do not mean small commits. Relative to size, atomic commits mean changing only one thing at a time (which implies that the change will be as small as possible, but doesn’t make being small a goal).

Atomic commits also mean that I could check out any commit in my git history and I would have code that is in a working state. It builds and all the unit tests pass. Sometimes, however, it isn’t practical to have every unit test passing at all times. You might go an entire day refactoring and still not have code that passes every unit test simply because of the nature of the refactor. It is still useful to frequently commit your code in these instances as well. These commits are called work-in-progress commits. I prefix a WIP to the commit title to differentiate them from regular commits. I also don’t push them to the branch I’m working on. Either don’t push them at all, or push them to a temp branch. An example of how I might use these work-in-progress commits is to commit after every several tests I fix. The important thing is that once I get all the tests back in a passing condition, I squash all those work-in-progress commits into one atomic commit.

Being this fastidious might seem like major overkill, but believe me when I say that it is a tremendous help when you need to check out or revert an old commit. The person in my example scenario was contemplating having to revert the entirety of their changes. Wouldn’t it be way more preferable to only have to reset to the last commit? This approach also makes finding where a bug was introduced way easier. Start from a commit where you know the bug didn’t exist, and work your way up, commit by commit, checking to see if the bug was introduced there or not. Once you find the offending commit there is a drastically smaller diff you have to pour over to find the bug than if you were going into it blind.

I’ve even been in a non-coding situation where I used this “working incrementally” mindset. I was helping a friend set up some furniture and we had to use this tool we were unfamiliar with. I approached the problem methodically, seeing how the tool worked on different surfaces and limiting myself to changing only one variable at a time in my “unit tests.” In a sense what I’m describing is just the scientific method.

Following my “working incrementally“ guidelines might not make much sense to you if you’re still new in your programming career. If someone else had written this blog post and I had tried to read it when I was in college I probably would’ve fallen asleep midway through. It also might feel like a huge pain to constantly run unit tests and commit your code. To which I would respond that putting forth some effort upfront is worth it to save yourself a whole lot of pain down the road.

For additional resources on atomic commits I would recommend this and this.

Why I hate refactoring

Bradley Neumaier — Mon, 30 Dec 2019 01:54:09 +0000

Okay, I admit it. This was a clickbait title. So sue me. I don't actually hate refactoring. What I really hate, and what will be the topic of this post, is being lazy (particularly when it applies to refactoring).

How does this phenomenon arise? Well, it's pretty simple. We as developers get lazy. For example, suppose someone points out in a code review that we could DRY some code and thereby remove duplicated effort. You respond by saying there's no time for that now and that we'll add it to the technical debt in our backlog. Or say that you settle for a few TODO comments instead of actually doing the due diligence to write clean code. It will get done eventually after all.

Wrong! These things rarely get done in practice once they've been relegated to the purgatory known as, "to be done later." There's always more important work to be done than humble ol' refactoring. I've worked for 2 different employers since graduating college and I honestly cannot think of a single instance where cleaning up technical debt and refactoring code actually got prioritized in a sprint.

And to be fair, I'm not arguing that refactoring is always more important than writing new features. I'm not even arguing that it's more important most of the time. At best you could say that, occasionally, the benefits probably outweigh the costs for some specific instance. Which, as the clever reader that you are will surmise, is hardly a standing ovation for refactoring. So what is the point I'm trying to make?

Stop being lazy! Stop procrastinating! The correct answer to the first hypothetical scenario from above should have been, "I'll DRY the code in my upcoming code review," and the correct way to handle the second scenario is to not leave TODOs littered in your code unless absolutely necessary.

Legacy TODOs are so obnoxious. No one on the team knows what they mean anymore and are therefore afraid to remove it in case it could be useful. So what you end up with is just training developers to ignore TODOs. Which is not a good habit to ingrain.

Similarly, technical debt cleanup in the backlog is always relegated to being a second class citizen. It's very hard to justify "improving code quality" over "delivering features that earn money."

Fortunately there's a simple, albeit effortful, solution to all this. Take pride in writing clean code. Do your due diligence to ensure you follow the ~~boy~~ person scout rule. Taking an extra hour in the sprint to refactor code to be maintainable could mean you won't have to spend 15 hours a year from now to refactor the bigger mess that your procrastination induced. Or perhaps, as is more likely, you never get around to refactoring it and any work involving that gross, legacy part of the codebase just becomes more and more difficult as time goes on.

Decoding the confusing world of encodings (Part 2)

Bradley Neumaier — Thu, 23 May 2019 00:54:07 +0000

What is an encoding? Part 2

In part 1 we demystified the following ways the term "encoding" is used:

This file is hex encoded

This file uses an ASCII encoding

This string is Unicode encoded

Let's write the output to a UTF-8 encoded file

In part 2 we'll address the remaining ways "encoding" could be used:

Our message is safe because it's encoded using Base64

Python uses Unicode strings for encoding

Our message is safe because it's encoded using Base64

This statement deals with several different concepts. I'll start by going over the types of encoding.

As best as I can tell there are 2 different categories for encoding: character encodings and binary-to-text encodings. ASCII and UTF-8 are examples of character encodings. Base64 is an example of a binary-to-text encoding.

What's the difference? Both character encodings and binary-to-text encodings share the same goal of turning bits into characters. However, character encodings are designed to produce human-readable output. Binary-to-text encodings are designed to turn bits into human-printable output.

Wait, what? That was a nebulous distinction you say? Okay, let me try to explain it in a different way. A character encoding like ASCII is really good for data storage and transmission. For example, say you're writing a speech. You want to save it on your computer so you don't have to re-type it every time. The computer stores that speech as a bunch of 1s and 0s. ASCII is needed to translate those bits back into the words, letters, and punctuation that make up the speech. In the same way, say you want to upload the speech to the cloud. The exact same process is needed to transport that speech over the Internet.

Base64 is an example of a binary-to-text encoding. In fact, it's pretty much the only one in use, much like UTF-8 is for character encodings on the web. It is a subset of ASCII, containing 64 of the 128 ASCII characters: a-z, A-Z, 0-9, +, and /. It doesn't contain characters like NUL or EOF (which are examples of non-printable characters). Base64 is often used to translate a binary file to text, or even a text file with non-printable characters to one with only printable characters. The benefits of this are that you can output the contents of any type of file, no matter what data it contains. It doesn't have to be limited to a file either; it can be just a string, such as a password. Also, you are guaranteed to always have characters that can be displayed, no matter what the underlying bits are. That is something UTF-8 cannot accomplish. How does Base64 do it?

I described in the UTF-8 section in part 1 how certain bit patterns at the start of a byte indicate how many bytes the character will be. 0 for 1 byte, 110 for 2 bytes, 1110 for 3 bytes, and 11110 for 4 bytes. And it uses 10 to indicate a byte is a continuation byte. This means that byte sequences that don't follow this pattern are incomprehensible to UTF-8. A byte that doesn't start with 0, 10, 110, 1110, or 11110 wouldn't be rendered properly by UTF-8. For example, UTF-8 doesn't understand 11111111.

Let's show this on the command line with a new file, file3.txt:



$ cat file3.txt
123



$ xxd -b file3.txt
00000000: 00110001 00110010 00110011 00001010                    123.



$ printf '\xff' | dd of=file3.txt bs=1 seek=0 count=1 conv=notrunc # overwrite the first byte with 11111111
1+0 records in
1+0 records out
1 byte copied, 0.0009188 s, 1.1 kB/s



$ xxd -b file3.txt
00000000: 11111111 00110010 00110011 00001010                    .23.

This is what the file looked like in VSCode using a UTF-8 encoding before being overwritten with the printf '\xff' | dd... command:

And this is what it looked like after:

As mentioned before, Base64 can always display printable characters, even when UTF-8 cannot. Let's see that in action:



$ base64 file3.txt > file4.txt

And now the file has printable characters:

Okay, great. But how did we end up with /zIzCg==? I'll take this one step at a time to avoid confusion.

Base64 has 64 characters in its alphabet. That means it only needs 6 bits to represent the whole alphabet (2⁶ == 64). UTF-8 uses the leading bits in a byte as metadata to determine whether it's a starting byte or a continuation byte. Those bytes don't hold any information about the character being stored (i.e., the actual data). In contrast, Base64 uses the entire byte as data. It has no metadata. However, as I mentioned it only uses 6 bits. A byte has 8 bits. How does this math line up?

Let's start by examining the Base64 table, which looks very similar to the ASCII table:

file3.txt's binary representation is 11111111 00110010 00110011 00001010. The way Base64 works is to interpret the bits in groups of 6. So even though the logical grouping of a byte is 8 bits, we're going to modify the groupings to be 6 bits (to reflect how Base64 sees this): 111111 110011 001000 110011 000010 10. In fact, let's look at it in a table format to make things easier:

Bytes	Base64 character
`111111`	`/`
`110011`	`z`
`001000`	`I`
`110011`	`z`
`000010`	`C`
`10`	???

The first 5 groupings of 6 bits line up perfectly with the first 5 characters of our Base64 encoded file4.txt. But we only have 2 bits remaining at the end, which is not enough to make a valid character in Base64. file3.txt had 4 bytes, which is 32 bits. 32 is not divisible by 6.

When a file size is not divisible by 6 bits, Base64 resorts to padding. To make a 32 bit file compatible with Base64 we'll append 0000 to the end of the file so that the final character can be properly rendered by Base64. Here is the new bit string: 111111 110011 001000 110011 000010 100000. Let's view it in a table format too:

Bytes	Base64 character
`111111`	`/`
`110011`	`z`
`001000`	`I`
`110011`	`z`
`000010`	`C`
`100000`	`g`

That's much better. Now the first 6 characters match. But what about the == at the end? We have no bits remaining. In fact, = isn't even in the Base64 table! What gives?

Base64 requires that the number of characters outputted be divisible by 4. This means that those = are padding characters to satisfy that requirement. But why does that requirement exist? Well, let's hypothesize a bit here. Base64 characters use 6 bits each. A byte uses 8 bits. Bytes are fundamental building blocks in a file system. We don't measure things in bits, but rather in bytes. So how many Base64 characters does it take so that the total number of bits fit neatly into a string of bytes (i.e., is divisible by 8)?

It takes 24 bits, which is 3 bytes. And there are 4 Base64 characters (of 6 bits each) in 24 bits. I suppose this was the rationale behind the = padding requirement.

Here is a table that displays how the original file size affects the Base64 output:

Original file size	# of Base64 characters	`=` padding	`0` padding
1 byte	4	`==`	`0000`
2 bytes	4	`=`	`00`
3 bytes	4
4 bytes	8	`==`	`0000`
5 bytes	8	`=`	`00`
6 bytes	8
...	...	...	...

Let's walk through some examples of strings that both require padding and do not require it.

2 = of padding: @ (01000000)

Bytes	UTF-8 character
`01000000`	`@`

Bytes	Bit positions	Base64 character
`010000`	010000 00	`Q`
`000000`	010000 00	`A`
`padding`	`none`	`=`
`padding`	`none`	`=`

Notice that since there were only 2 bits to use at the end, 0000 was used as padding to the end to make the bit length (excluding any = padding) divisible by 6.

1 = of padding: AB (0100000101000010)

Bytes	UTF-8 character
`01000001`	`A`
`01000010`	`B`

Bytes	Bit positions	Base64 character
`010000`	010000 0101000010	`Q`
`010100`	010000 010100 0010	`U`
`001000`	010000010100 0010	`I`
`padding`	`none`	`=`

This time 00 was used as padding at the end of the string.

No padding: v3c (011101100011001101100011)

Bytes	UTF-8 character
`01110110`	`v`
`00110011`	`3`
`01100011`	`c`

Bytes	Bit positions	Base64 character
`011101`	011101 100011001101100011	`d`
`100011`	011101 100011 001101100011	`j`
`001101`	011101100011 001101 100011	`N`
`100011`	011101100011001101 100011	`j`

No 0s needed as padding this time since the number of bits was divisible by 6.

Now we should be able to understand when padding is required and when it isn't. Let's take a look at the completed table of file4.txt (the Base64 representation of file3.txt):

Raw binary of file3.txt (4 bytes in total): 11111111001100100011001100001010

Bytes	Bit positions	Base64 character
`111111`	111111 11001100100011001100001010	`/`
`110011`	111111 110011 00100011001100001010	`z`
`001000`	111111110011 001000 11001100001010	`I`
`110011`	111111110011001000 110011 00001010	`z`
`000010`	111111110011001000110011 000010 10	`C`
`100000`	111111110011001000110011000010 10	`g`
`padding`	`none`	`=`
`padding`	`none`	`=`

Since file3.txt is 4 bytes, it required 0000 as padding for the last Base64 character and == as padding for the complete Base64 output.

One last thing to be aware of is that file4.txt, whose contents are /zIzCg==, will be stored as UTF-8 (which will be the exact same as ASCII in this instance since Base64 is a subset of the ASCII alphabet). Remember that Base64 isn't a character encoding! It's a binary-to-text encoding. Character encodings are the ones that are stored on disk. One mistaken assumption I had while learning this was that the Base64 file would have the exact same bytes on disk as the original file (i.e., file4.txt and file3.txt would have the same bytes). However this is not the case! Observe:



$ xxd -b file4.txt
00000000: 00101111 01111010 01001001 01111010 01000011 01100111  /zIzCg
00000006: 00111101 00111101 00001010                             ==.

So Base64 took the underlying bits of file3.txt, used its algorithm to map those to Base64 characters, and then wrote those characters to file4.txt in UTF-8. If we created a new file and manually typed in /zIzCg==, it would have the exact same binary representation. This is simply a UTF-8 encoding of text.

What is Base64url?

Base64url is something that will occasionally show up. This is a variant on Base64 where + and / are replaced with - and _ so that the output will be URL-safe. + and / must be encoded in a URL (i.e., + becomes %2B, / becomes %2F), but - and _ are considered safe.

= is also not URL-safe, but there is no standardization on how to handle it. Some libraries will percent-encode it (%3D) and some will encode it as a period (.).

Encoding vs. encryption

For some reason people often mix these two terms up. I think the reason why, specifically when it involves Base64, is because of the HTTP Authorization request header and JWTs. Both of these concepts are security-related and involve Base64 to transform plaintext into seemingly "scrambled" output. As a result, people mistakenly think Base64 encoding is the same thing as encryption.

Well it's not.

Encryption is the process of mathematically transforming plaintext into ciphertext (a bunch of gibberish) using a key (basically just a random number). Depending on the type of encryption used, the only way to transform ciphertext back to plaintext is with that same key (symmetric encryption) or with a different-but-mathematically-related key (asymmetric encryption). The only way to break encryption without the key is through brute force, which depending on the strength of encryption used, could take 6.4 quadrillion years.

Encoding, in the binary-to-text sense, is the process of transforming bits into an output that's human-printable. It's meant to be a trivially reversible process that anyone can do. Even if a different encoding than Base64 were used, there are a pretty finite amount of encodings out there. Brute forcing that would probably take a modern computer a handful of milliseconds to accomplish.

This of course implies that the HTTP Authorization request header and JWTs do not provide any inherent data confidentiality. Not to say that they are useless, but just that encryption is not one of their benefits. Anyone who intercepts those pieces of data can simply decode the Base64 with ease (if they are technically savvy enough to sniff network traffic then the odds are pretty good they also know what Base64 is). Base64 is meant to ensure that you won't have to deal with binary data (i.e., bytes that the standard character encodings don't know how to interpret) or characters like NUL or EOF. It is often used in security-related concepts (such as the PEM format for example), but it is not itself a security technique!

Python uses Unicode strings for encoding

In python 2 there are a class of string literals that are known as unicode strings. They are delineated by prefixing the character u to a string literal (e.g., u'abc'). I am not a fan of the term unicode string because it leads to the confusion that unicode is an encoding. So what exactly does python mean when it refers to unicode strings?

Let's look at some examples in Python 2.7.12:



>>> a = u'abc'
>>> b = u'abcŔŖ'
>>> a
u'abc'
>>> b
u'abc\u0154\u0156'

So we define 2 strings, a and b, which contain the same contents as file1.txt and file2.txt from part 1 did. a is able to be printed out to the console without an issue, but the console can't render ŔŖ at the end of b. Instead those characters are replaced with their unicode code points: \u0154 (U+0154) and \u0156 (U+0156). It appears that the python 2 interpreter can only print strings using ASCII, and not a unicode-compatible encoding.

Let's try explicitly encoding these strings:



>>> a.encode('utf-8')
'abc'
>>> a.encode('ascii')
'abc'
>>> b.encode('utf-8')
'abc\xc5\x94\xc5\x96'
>>> b.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 3-4: ordinal not in range(128)

String a can be encoded using both ASCII and UTF-8 as expected. Also as expected, encoding string b using ASCII results in an error since neither Ŕ nor Ŗ are ASCII-compatible. And encoding string b using UTF-8 renders a string that is a mix of ASCII characters (what python 2 can handle) and the hex representations of the non-ASCII characters python 2 couldn't handle.

A unicode string in python 2 is just a combination of ASCII-compatible characters and code points (as strings) for the non-ASCII compatible characters. What about python 3? Python 3 got rid of the distinction between a regular string (e.g., abc) and a unicode string (e.g., u'abc'), and just has regular strings without any prefixes. Does this mean there are no unicode strings in python 3?

Let's find out using Python 3.5.2:



>>> a = "abc"
>>> b = 'abcŔŖ'
>>> a
'abc'
>>> b
'abcŔŖ'

Python 3 treats every string as a unicode string, and on top of that, can print non-ASCII compatible characters to the console now. Also the encode() function still works the same:



>>> b.encode('utf-8')
b'abc\xc5\x94\xc5\x96'

The only other question remaining is how to print out the code points?



>>> b.encode('unicode_escape')
b'abc\\u0154\\u0156'

Now readers should have a good idea of what Base64 is and how it works, the difference between encoding and encryption, and what python means by unicode strings. That was a lot to get through! But that is indicative of the complexities and overloaded terms surrounding what an "encoding" is.

Decoding the confusing world of encodings (Part 1)

Bradley Neumaier — Wed, 08 May 2019 18:28:58 +0000

What is an encoding?

Have you ever come across some of these statements?

This file is hex encoded

This file uses an ASCII encoding

This string is Unicode encoded

Let's write the output to a UTF-8 encoded file

Our message is safe because it's encoded using Base64

Python uses Unicode strings for encoding

These represent many of the ways the term "encode" is used across the industry. Frankly I found it all really confusing until I set out to write this post! I'm going to address each of these statements and attempt to define and disambiguate exactly what encoding means.

This file is hex encoded

A similar phrase to hex encoding is binary encoding. Personally I don't like the use of the term "encoding" here. Technically an argument could be made that the semantics are correct. However I prefer using the term "representation." It makes encoding less of an overloaded definition. Also, "representation" does a better job (in my mind at least) of describing what is actually happening.

Hexadecimal (abbreviated as hex) and binary are both numeral systems. That's a fancy way of saying, "here's how to represent a number." If you step back and think about it, numbers are funny things. A number seems pretty straightforward, but it's actually an abstract concept. What is the number for how many fingers you have? You could say it's 00001010, 10, or a and all three would be accurate! We learn to say 10 because the easiest and most common numeral system for humans is decimal, also known as base-10. We have 10 fingers and 10 toes, so that makes learning how to count far more intuitive when we are infants.

If we instead applied that ease-of-use criteria to computers we would get binary (or base-2). Why? Because computers fundamentally think of things as being "on" or "off." Computers rely on electrons having either a positive charge or a negative charge to represent 1s and 0s. And it is with these 1s and 0s that the fundamentals of computing are accomplished, such as storing data or performing mathematical calculations.

Great, so we can represent the same number in multiple ways. What use is that? Let's refer back to the number ten. We could represent it in binary (00001010) or in hex (a). It takes eight characters in binary (or four without the padding of 0s), but only one in hex! That's due to the number of symbols each use. Binary uses two: 0 and 1. Hex uses 16: 0-9 and a-f. The difference in representation size was stark enough for just the number ten, but it grows significantly more unequal when using larger numbers. So the advantage is that hex can represent large numbers much more efficiently than binary (and more efficiently than decimal too for that matter).

Let's explore how to turn this theory into practical knowledge. To provide some examples for this post I created two files via the command line: file1.txt and file2.txt. Here are their contents outputted:



$ cat file1.txt
abc



$ cat file2.txt
abcŔŖ

Don't worry about the unfamiliar R characters at the end of file2.txt. I'll go over those details in-depth in the UTF-8 and Unicode sections. For now I will just show the binary and hex representations of each file:



$ xxd -b file1.txt # binary
00000000: 01100001 01100010 01100011 00001010                    abc.



$ xxd file1.txt # hex
00000000: 6162 630a                                abc.



$ xxd -b file2.txt # binary
00000000: 01100001 01100010 01100011 11000101 10010100 11000101  abc...
00000006: 10010110 00001010                                      ..



$ xxd file2.txt # hex
00000000: 6162 63c5 94c5 960a                      abc.....

Again we see the compactness of hex on display. file1.txt requires 32 characters to represent in binary, but only 8 in hex. file2.txt requires 64 characters to represent in binary, but only 16 in hex. If we were to use a hex to binary converter we can see how these representations line up with one another.

Let's dissect file1.txt:

Binary	Hexadecimal	Decimal
`01100001`	`61`	`97`
`01100010`	`62`	`98`
`01100011`	`63`	`99`
`00001010`	`0a`	`10`

As mentioned above, binary is the numeral system that computers "understand." The binary representation of these two files are literally how these files are stored in the computer (as what are known as bits, 1s and 0s, on the computer). The hex and decimal representations are just different ways of representing those same bits. We can see that every byte in binary (1 byte is equal to 8 bits) lines up with 2 hex characters. And we can see what those same values would be if they were represented in decimal. For reference, the largest 1 byte binary value is 11111111, which is ff in hex and 255 in decimal. The smallest 1 byte binary value is 00000000, which is 00 in hex and 0 in decimal. But even armed with this understanding of hex and binary, there's still a lot to go. How does all this relate to the contents of file1.txt?

This file uses an ASCII encoding

Remember that these binary, hex, and decimal representations are all of the same number. But we're not storing a number! We're storing abc. The problem is that computers have no concept of letters. They only understand numbers. So we need a way to say to the computer, "I want this character to translate to number X, this next character to translate to number Y, etc..." Enter ASCII.

Back in the day, ASCII was more or less the de facto standard for encoding text written using the English alphabet. It assigns a numeric value for all 26 lowercase letters, all 26 uppercase letters, punctuation, symbols, and even the digits 0-9. Here is a picture of the ASCII table:

Here is the mapping of file1.txt's hex values to their ASCII characters using the ASCII table:

Hexadecimal	ASCII
`61`	`a`
`62`	`b`
`63`	`c`
`0a`	`LF`

We can see a, b, and c there just as we would expect. What is that LF doing there at the end though? LF is a newline character in Unix (standing for "line feed"). I pressed the Return key when editing file1.txt, so that added a newline.

Any character in the ASCII character set requires only 1 byte to store. ASCII supports 128 characters, as we saw in the ASCII table. However, 1 byte allows for 256 (or 2⁸) values to be represented. In decimal that would be 0 (00000000 in binary) through 255 (11111111 in binary). That should mean ASCII can support 128 more characters. Why isn't that the case? ASCII only required 128 characters to support English text and its accompanying symbols so presumably that was all that was taken into account when the ASCII standard was formalized. As a result, ASCII only uses 7 of the 8 bits in a byte. However, that leads to a lot of waste -- half of the values are unused! 128 additional characters could be supported.

Joel Spolsky wrote an excellent blog post on this problem. Basically the issue was fragmentation. Everyone agreed what the first 128 values should map to, but then everyone went and decided their own usage for the remaining 128 values. As a result there was no consistency among different locales.

Let's review what we learned so far. We saw that the computer encodes the string abc into numbers (which are stored as bits). We can then view these bits as the computer has stored it in binary, or we can use different representations such as hex. a becomes 97, b becomes 98, c becomes 99, and the newline character in Unix is 10. ASCII is just a way to map bits (that computers understand) to characters (that humans understand).

ASCII leaves a gaping issue though. There are a lot more than 128 characters in use! What do we do about characters from other languages? Other random symbols? Emojis???

This string is Unicode encoded

As anglocentric as programming is in 2019, English is not the only language that needs to be supported on the web. ASCII is fine for encoding English, but it is incapable of supporting anything else. This is where Unicode enters the fray. Unicode is not an encoding. That point bears repeating. Unicode is not an encoding.

Wikipedia calls it a standard that can be implemented by different character encodings. I find that definition, while succinct, too abstract. Instead, I prefer to think of it like this:

Imagine you have a giant alphabet. It can support over 1 million characters. It is a superset of every language known to humankind. It can support made-up languages. It contains every bizarre symbol you can think of. It has emojis. And all that only fills about 15% of its character set. There is space for much more to be added. However, it's impractical to have a keyboard that has button combinations for over 1 million different characters. The keyboard I'm using right now has 47 buttons dedicated to typeable characters. With the Shift key that number is doubled. That's nowhere close to 1 million though. There needs to be some way to use the characters in this alphabet!

In order to make this alphabet usable we're going to put it in a giant dictionary. A normal dictionary would map words to their respective definitions. In this special dictionary we'll have numbers mapping to all these characters. So to produce the character you want, you will type the corresponding number for it. And then it will be someone else's job to replace those numbers with the characters that they map to in the dictionary. Just as the words are in alphabetical order, the numbers will be in ascending order. And for the characters not yet filled in, we'll just have a blank entry next to the unused numbers.

This is Unicode in a nutshell. It's a dictionary that supports an alphabet of over 1.1 million characters. It does so through an abstraction called a code point. Every character has a unique code point. For example, a has a code point of U+0061. b has a code point of U+0062. And c has a code point of U+0063. Notice a pattern? 61 is the hex value for the character a in ASCII, and U+0061 is the code point for a in Unicode. I'll come back to this point in the UTF-8 section.

The structure of a code point is as follows: U+ followed by a hex string. The smallest that hex string could be is 0000 and the largest is 10FFFF. So U+0000 is the smallest code point (representing the Null character) and U+10FFFF is the largest code point (currently unassigned). As of Unicode 12.0.0 there are almost 138,000 code points in use, meaning slightly under 1 million remain. I think it's safe to say we won't be running out anytime soon.

ASCII can map bits on a computer to the English alphabet, but it wouldn't know what to do with Unicode. So we need a character encoding that can map bits on a computer to Unicode code points (which in turn maps to a giant alphabet). This is where UTF-8 comes into play.

Let's write the output to a UTF-8 encoded file

UTF-8 is one of several encodings that support Unicode. In fact, the UTF in UTF-8 stands for Unicode Transformation Format. You may have heard of some of the others: UTF-16 LE, UTF-16 BE, UTF-32, UCS-2, UTF-7, etc... I'm going to ignore all the rest of these though. Why? Because UTF-8 is by far the dominant encoding of the group. It is backwards compatible with ASCII, and according to Wikipedia, it accounts for over 90% of all web page encodings.

UTF-8 uses different byte sizes depending on what code point is being referenced. This is the feature that allows it to maintain backwards compatibility with ASCII.

^{Source: Wikipedia}

If UTF-8 encounters a byte that starts with 0, it knows it found a starting byte and that the character is only one byte in length. If UTF-8 encounters a byte that starts with 110 then it knows it found a starting byte and to look for two bytes in total. For three bytes it is 1110, and four bytes it is 11110. All continuation bytes (i.e., the non-starting bytes; bytes 2, 3, or 4) will start with a 10. The reason for these continuation bytes is that it allows you to be able to find the starting byte of a character easily.

As a refresher, this is what file2.txt looks like on the command line:



$ cat file2.txt
abcŔŖ



$ xxd -b file2.txt # binary
00000000: 01100001 01100010 01100011 11000101 10010100 11000101  abc...
00000006: 10010110 00001010                                      ..



$ xxd file2.txt # hex
00000000: 6162 63c5 94c5 960a                      abc.....

Let's dissect file2.txt to understand how UTF-8 works:

Hexadecimal	UTF-8	Unicode Code Point
`61`	`a`	`U+0061`
`62`	`b`	`U+0062`
`63`	`c`	`U+0063`
`c594`	`Ŕ`	`U+0154`
`c596`	`Ŗ`	`U+0156`
`0a`	`LF`	`U+000A`

We can see that the hex representations for a, b, c, and LF are the same as for file1.txt, and that they align perfectly with their respective code points. The hex representations for Ŕ and Ŗ are twice as long as the other hex representations though. This means that they require 2 bytes to store instead of 1 byte.

Here is a table showing the different representations and the type of byte side-by-side:

Byte type	Binary	Hexadecimal	Decimal	UTF-8
Starting Byte	`01100001`	`61`	`97`	`a`
Starting Byte	`01100010`	`62`	`98`	`b`
Starting Byte	`01100011`	`63`	`99`	`c`
Starting Byte	`11000101`	`c5`	`197`	`Ŕ`
Continuation Byte	`10010100`	`94`	`148`	`Ŕ` (contd.)
Starting Byte	`11000101`	`c5`	`197`	`Ŗ`
Continuation Byte	`10010110`	`96`	`150`	`Ŗ` (contd.)
Starting Byte	`00001010`	`0a`	`10`	`LF`

UTF-8 uses 1 byte to encode ASCII characters, and multiple bytes to encode non-ASCII characters. To be precise it uses 7 bits to encode ASCII characters, exactly like ASCII does. Every byte on disk that maps to an ASCII character will map to the exact same character in UTF-8. And any other code point outside of that range will just use additional bytes to be encoded.

As I alluded to earlier, the code points for a, b, and c match up exactly with the hex representations of those letters in ASCII. I suppose that the designers of Unicode did this in the hopes that it would make backwards compatibility with ASCII easier. UTF-8 made full use of this. Its first 128 characters require one byte to encode. Despite having room for 128 more characters in its first byte, UTF-8 instead required its 129th character to use 2 bytes. DEL is the 128th character (#127 on the page because the table starts at 0) and has the hex representation 7F, totalling 1 byte. XXX (no, not the character for porn) is the 129th character and has the hex representation C280, totalling 2 bytes.

If you're curious here are examples of characters requiring over 2 bytes:

3 bytes: 㚈
4 bytes: 🜁

Just to re-emphasize what is happening here: UTF-8 maps bytes on disk to a code point. That code point maps to a character in Unicode. A different encoding, like UTF-32 for example, would map those same bytes to a completely different code point. Or perhaps it wouldn't even have a mapping from those bytes to a valid code point. The point is that a series of bytes could be interpreted in totally different ways depending on the encoding.

That's it for part 1. We covered numeral systems like hex and binary (which I like to call representations instead of encodings), different character encodings such as ASCII and UTF-8, and what Unicode is (and why it's not an encoding). In part 2 we'll address the remaining points and hopefully clear up the confusion surrounding the term "encoding."