<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Bradley Neumaier</title>
    <description>The latest articles on DEV Community by Bradley Neumaier (@neumaneuma).</description>
    <link>https://dev.to/neumaneuma</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F152560%2F194421fe-8ee9-4c84-998c-3cd25fbf3635.jpeg</url>
      <title>DEV Community: Bradley Neumaier</title>
      <link>https://dev.to/neumaneuma</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/neumaneuma"/>
    <language>en</language>
    <item>
      <title>On working incrementally</title>
      <dc:creator>Bradley Neumaier</dc:creator>
      <pubDate>Tue, 21 Jul 2020 15:20:47 +0000</pubDate>
      <link>https://dev.to/neumaneuma/on-working-incrementally-3227</link>
      <guid>https://dev.to/neumaneuma/on-working-incrementally-3227</guid>
      <description>&lt;p&gt;I have learned a few different things through trial and error over the course of my software developer career. The most prominent of which is how to work incrementally. I got this phrase from &lt;a href="https://www.educba.com/agile-frameworks/"&gt;agile&lt;/a&gt; though my use of it here has nothing to do with that methodology. What I am referring to is the mindset and workflow that I use when developing.&lt;/p&gt;

&lt;p&gt;Before defining what I mean in abstract and general terms, allow me to present an example of what I’m talking about. Suppose you have a major refactoring task. Several hours in you try running the code for the first time and find that it doesn’t even build. Frowning, you dive deeper into the compiler errors and emerge with buildable code an hour later.  At this point you finish working for the day thinking that all is well. The next day you log several more hours into refactoring, fix all the compiler errors, and finally finish with refactoring the code. The code is now, of course, a perfection the likes of which no one has ever seen before nor will ever see again. You commit your code into git and sit back thinking that all is right with the world again. Belatedly, you realize you haven’t run any unit tests yet. Doing so produces more red than green (i.e., you’ve broken so many tests that more fail now than pass). Panic begins to settle in. You frantically look through the errors and try to debug the problem, but now you’re dealing with tests and code you had no part in writing. It’s not as easy to figure out how to fix the problem. It’s now the end of the day and you haven’t made any progress fixing the unit tests. You spend the entire next day attempting to fix them with limited success. At this point you begin to wonder if it wouldn’t be easier just to start over again from scratch.&lt;/p&gt;

&lt;p&gt;Ever been in a similar situation before? I have on more than one occasion. If my example didn’t convey a sense of dread, then let me say outright that being in that situation sucks. The solution is, as the title and intro suggested, to work incrementally. Instead of waiting until several hours have lapsed to try building the code, build it frequently. Instead of waiting two days to run the existing unit tests, run them frequently. And instead of waiting until you’re finished to commit your code, commit frequently. TDD advocates a similar approach of building and running unit tests often, but what I’m talking about is more generic and also relies heavily on version control (specifically git, though by no means does it have to be).&lt;/p&gt;

&lt;p&gt;My approach to a giant refactoring task involves a frequent cycle of building (assuming I’m working with a compiled language), running the unit tests, and committing my code. In particular, I make sure my commits are atomic. This means that the code in each of my commits is doing one thing only (sort of like the &lt;a href="https://en.m.wikipedia.org/wiki/Single-responsibility_principle"&gt;single responsibility principle&lt;/a&gt; for git commits). Much like SRP, the granularity can be taken to an excessive degree, but what I generally mean is that I limit myself to, for example, refactoring one class at a time or even one method at a time. And also be aware that I don’t mean you should be committing one file at a time either. If refactoring a class requires making changes to 20 files, then that is totally fine. Atomic commits do not mean small commits. Relative to size, atomic commits mean changing only one thing at a time (which implies that the change will be as small as possible, but doesn’t make being small a goal).&lt;/p&gt;

&lt;p&gt;Atomic commits also mean that I could check out any commit in my git history and I would have code that is in a working state. It builds and all the unit tests pass. Sometimes, however, it isn’t practical to have every unit test passing at all times. You might go an entire day refactoring and still not have code that passes every unit test simply because of the nature of the refactor. It is still useful to frequently commit your code in these instances as well. These commits are called work-in-progress commits. I prefix a &lt;code&gt;WIP&lt;/code&gt; to the commit title to differentiate them from regular commits. I also don’t push them to the branch I’m working on. Either don’t push them at all, or push them to a temp branch. An example of how I might use these work-in-progress commits is to commit after every several tests I fix. The important thing is that once I get all the tests back in a passing condition, I &lt;a href="https://linuxhint.com/how-to-squash-git-commits/"&gt;squash&lt;/a&gt; all those work-in-progress commits into one atomic commit.&lt;/p&gt;

&lt;p&gt;Being this fastidious might seem like major overkill, but believe me when I say that it is a tremendous help when you need to check out or revert an old commit. The person in my example scenario was contemplating having to revert the entirety of their changes. Wouldn’t it be way more preferable to only have to reset to the last commit? This approach also makes finding where a bug was introduced way easier. Start from a commit where you know the bug didn’t exist, and work your way up, commit by commit, checking to see if the bug was introduced there or not. Once you find the offending commit there is a drastically smaller diff you have to pour over to find the bug than if you were going into it blind.&lt;/p&gt;

&lt;p&gt;I’ve even been in a non-coding situation where I used this “working incrementally” mindset. I was helping a friend set up some furniture and we had to use this tool we were unfamiliar with. I approached the problem methodically, seeing how the tool worked on different surfaces and limiting myself to changing only one variable at a time in my “unit tests.” In a sense what I’m describing is just the scientific method.&lt;/p&gt;

&lt;p&gt;Following my “working incrementally“ guidelines might not make much sense to you if you’re still new in your programming career. If someone else had written this blog post and I had tried to read it when I was in college I probably would’ve fallen asleep midway through. It also might feel like a huge pain to constantly run unit tests and commit your code. To which I would respond that putting forth some effort upfront is worth it to save yourself a whole lot of pain down the road.&lt;/p&gt;

&lt;p&gt;For additional resources on atomic commits I would recommend &lt;a href="https://www.freshconsulting.com/atomic-commits/"&gt;this&lt;/a&gt; and &lt;a href="https://curiousprogrammer.dev/blog/how-to-craft-your-changes-into-small-atomic-commits-using-git/"&gt;this&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>tdd</category>
      <category>git</category>
      <category>atomiccommits</category>
    </item>
    <item>
      <title>Why I hate refactoring</title>
      <dc:creator>Bradley Neumaier</dc:creator>
      <pubDate>Mon, 30 Dec 2019 01:54:09 +0000</pubDate>
      <link>https://dev.to/neumaneuma/why-i-hate-refactoring-kaj</link>
      <guid>https://dev.to/neumaneuma/why-i-hate-refactoring-kaj</guid>
      <description>&lt;p&gt;Okay, I admit it. This was a clickbait title. So sue me. I don't actually hate refactoring. What I really hate, and what will be the topic of this post, is being lazy (particularly when it applies to refactoring).&lt;/p&gt;

&lt;p&gt;How does this phenomenon arise? Well, it's pretty simple. We as developers get lazy. For example, suppose someone points out in a code review that we could DRY some code and thereby remove duplicated effort. You respond by saying there's no time for that now and that we'll add it to the technical debt in our backlog. Or say that you settle for a few TODO comments instead of actually doing the due diligence to write clean code. It will get done eventually after all.&lt;/p&gt;

&lt;p&gt;Wrong! These things rarely get done in practice once they've been relegated to the purgatory known as, "to be done later." There's always more important work to be done than humble ol' refactoring. I've worked for 2 different employers since graduating college and I honestly cannot think of a single instance where cleaning up technical debt and refactoring code actually got prioritized in a sprint.&lt;/p&gt;

&lt;p&gt;And to be fair, I'm not arguing that refactoring is always more important than writing new features. I'm not even arguing that it's more important most of the time. At best you could say that, occasionally, the benefits probably outweigh the costs for some specific instance. Which, as the clever reader that you are will surmise, is hardly a standing ovation for refactoring. So what is the point I'm trying to make?&lt;/p&gt;

&lt;p&gt;Stop being lazy! Stop procrastinating! The correct answer to the first hypothetical scenario from above should have been, "I'll DRY the code in my upcoming code review," and the correct way to handle the second scenario is to not leave TODOs littered in your code unless absolutely necessary.&lt;/p&gt;

&lt;p&gt;Legacy TODOs are so obnoxious. No one on the team knows what they mean anymore and are therefore afraid to remove it in case it could be useful. So what you end up with is just training developers to ignore TODOs. Which is not a good habit to ingrain.&lt;/p&gt;

&lt;p&gt;Similarly, technical debt cleanup in the backlog is always relegated to being a second class citizen. It's very hard to justify "improving code quality" over "delivering features that earn money."&lt;/p&gt;

&lt;p&gt;Fortunately there's a simple, albeit effortful, solution to all this. Take pride in writing clean code. Do your due diligence to ensure you follow the &lt;a href="https://medium.com/@biratkirat/step-8-the-boy-scout-rule-robert-c-martin-uncle-bob-9ac839778385"&gt;&lt;del&gt;boy&lt;/del&gt; person scout rule&lt;/a&gt;. Taking an extra hour in the sprint to refactor code to be maintainable could mean you won't have to spend 15 hours a year from now to refactor the bigger mess that your procrastination induced. Or perhaps, as is more likely, you never get around to refactoring it and any work involving that gross, legacy part of the codebase just becomes more and more difficult as time goes on.&lt;/p&gt;

</description>
      <category>refactoring</category>
      <category>todo</category>
      <category>lazy</category>
    </item>
    <item>
      <title>Decoding the confusing world of encodings (Part 2)</title>
      <dc:creator>Bradley Neumaier</dc:creator>
      <pubDate>Thu, 23 May 2019 00:54:07 +0000</pubDate>
      <link>https://dev.to/neumaneuma/decoding-the-confusing-world-of-encodings-part-2-4lo</link>
      <guid>https://dev.to/neumaneuma/decoding-the-confusing-world-of-encodings-part-2-4lo</guid>
      <description>&lt;h1&gt;
  
  
  What is an encoding? Part 2
&lt;/h1&gt;

&lt;p&gt;In &lt;a href="https://dev.to/neumaneuma/decoding-the-confusing-world-of-encodings-part-1-3oke"&gt;part 1&lt;/a&gt; we demystified the following ways the term "encoding" is used:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;This file is hex encoded&lt;/p&gt;

&lt;p&gt;This file uses an ASCII encoding&lt;/p&gt;

&lt;p&gt;This string is Unicode encoded&lt;/p&gt;

&lt;p&gt;Let's write the output to a UTF-8 encoded file&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In part 2 we'll address the remaining ways "encoding" could be used:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Our message is safe because it's encoded using Base64&lt;/p&gt;

&lt;p&gt;Python uses Unicode strings for encoding&lt;/p&gt;
&lt;/blockquote&gt;




&lt;blockquote&gt;
&lt;h2&gt;
  
  
  Our message is safe because it's encoded using Base64
&lt;/h2&gt;
&lt;/blockquote&gt;

&lt;p&gt;This statement deals with several different concepts. I'll start by going over the types of encoding.&lt;/p&gt;

&lt;p&gt;As best as I can tell there are 2 different categories for encoding: &lt;a href="https://en.wikipedia.org/wiki/Character_encoding" rel="noopener noreferrer"&gt;character encodings&lt;/a&gt; and &lt;a href="https://en.wikipedia.org/wiki/Binary-to-text_encoding" rel="noopener noreferrer"&gt;binary-to-text encodings&lt;/a&gt;. ASCII and UTF-8 are examples of character encodings. Base64 is an example of a binary-to-text encoding.&lt;/p&gt;

&lt;p&gt;What's the difference? Both character encodings and binary-to-text encodings share the same goal of turning bits into characters. However, character encodings are designed to produce human-readable output. Binary-to-text encodings are designed to turn bits into human-printable output.&lt;/p&gt;

&lt;p&gt;Wait, what? That was a nebulous distinction you say? Okay, let me try to explain it in a different way. A character encoding like ASCII is really good for data storage and transmission. For example, say you're writing a speech. You want to save it on your computer so you don't have to re-type it every time. The computer stores that speech as a bunch of &lt;code&gt;1&lt;/code&gt;s and &lt;code&gt;0&lt;/code&gt;s. ASCII is needed to translate those bits back into the words, letters, and punctuation that make up the speech. In the same way, say you want to upload the speech to the cloud. The exact same process is needed to transport that speech over the Internet.&lt;/p&gt;

&lt;p&gt;Base64 is an example of a binary-to-text encoding. In fact, it's pretty much the only one in use, much like UTF-8 is for character encodings on the web. It is a subset of ASCII, containing 64 of the 128 ASCII characters: &lt;code&gt;a-z&lt;/code&gt;, &lt;code&gt;A-Z&lt;/code&gt;, &lt;code&gt;0-9&lt;/code&gt;, &lt;code&gt;+&lt;/code&gt;, and &lt;code&gt;/&lt;/code&gt;. It doesn't contain characters like &lt;code&gt;NUL&lt;/code&gt; or &lt;code&gt;EOF&lt;/code&gt; (which are examples of non-printable characters). Base64 is often used to translate a binary file to text, or even a text file with non-printable characters to one with only printable characters. The benefits of this are that you can output the contents of any type of file, no matter what data it contains. It doesn't have to be limited to a file either; it can be just a string, such as a password. Also, you are guaranteed to always have characters that can be displayed, no matter what the underlying bits are. That is something UTF-8 cannot accomplish. How does Base64 do it?&lt;/p&gt;

&lt;p&gt;I described in the UTF-8 section in part 1 how certain bit patterns at the start of a byte indicate how many bytes the character will be. &lt;code&gt;0&lt;/code&gt; for 1 byte, &lt;code&gt;110&lt;/code&gt; for 2 bytes, &lt;code&gt;1110&lt;/code&gt; for 3 bytes, and &lt;code&gt;11110&lt;/code&gt; for 4 bytes. And it uses &lt;code&gt;10&lt;/code&gt; to indicate a byte is a continuation byte. This means that byte sequences that don't follow this pattern are incomprehensible to UTF-8. A byte that doesn't start with &lt;code&gt;0&lt;/code&gt;, &lt;code&gt;10&lt;/code&gt;, &lt;code&gt;110&lt;/code&gt;, &lt;code&gt;1110&lt;/code&gt;, or &lt;code&gt;11110&lt;/code&gt; wouldn't be rendered properly by UTF-8. For example, UTF-8 doesn't understand &lt;code&gt;11111111&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Let's show this on the command line with a new file, &lt;code&gt;file3.txt&lt;/code&gt;:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;

&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;cat &lt;/span&gt;file3.txt
123


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;

&lt;span class="nv"&gt;$ &lt;/span&gt;xxd &lt;span class="nt"&gt;-b&lt;/span&gt; file3.txt
00000000: 00110001 00110010 00110011 00001010                    123.


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;

&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;printf&lt;/span&gt; &lt;span class="s1"&gt;'\xff'&lt;/span&gt; | &lt;span class="nb"&gt;dd &lt;/span&gt;&lt;span class="nv"&gt;of&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;file3.txt &lt;span class="nv"&gt;bs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1 &lt;span class="nv"&gt;seek&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0 &lt;span class="nv"&gt;count&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1 &lt;span class="nv"&gt;conv&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;notrunc &lt;span class="c"&gt;# overwrite the first byte with 11111111&lt;/span&gt;
1+0 records &lt;span class="k"&gt;in
&lt;/span&gt;1+0 records out
1 byte copied, 0.0009188 s, 1.1 kB/s


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;

&lt;span class="nv"&gt;$ &lt;/span&gt;xxd &lt;span class="nt"&gt;-b&lt;/span&gt; file3.txt
00000000: 11111111 00110010 00110011 00001010                    .23.


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This is what the file looked like in VSCode using a UTF-8 encoding before being overwritten with the &lt;code&gt;printf '\xff' | dd...&lt;/code&gt; command:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgcdd6ptj3ekhti8qnv5c.JPG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgcdd6ptj3ekhti8qnv5c.JPG"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And this is what it looked like after:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6cyccm4f2rmx4dx0weff.JPG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6cyccm4f2rmx4dx0weff.JPG"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As mentioned before, Base64 can always display printable characters, even when UTF-8 cannot. Let's see that in action:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;

&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;base64 &lt;/span&gt;file3.txt &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; file4.txt


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;And now the file has printable characters:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnjpdjvc2v8xp1uko3405.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnjpdjvc2v8xp1uko3405.jpg"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Okay, great. But how did we end up with &lt;code&gt;/zIzCg==&lt;/code&gt;? I'll take this one step at a time to avoid confusion.&lt;/p&gt;

&lt;p&gt;Base64 has 64 characters in its alphabet. That means it only needs 6 bits to represent the whole alphabet (2&lt;sup&gt;6&lt;/sup&gt; == 64). UTF-8 uses the leading bits in a byte as metadata to determine whether it's a starting byte or a continuation byte. Those bytes don't hold any information about the character being stored (i.e., the actual data). In contrast, Base64 uses the entire byte as data. It has no metadata. However, as I mentioned it only uses 6 bits. A byte has 8 bits. How does this math line up?&lt;/p&gt;

&lt;p&gt;Let's start by examining the Base64 table, which looks very similar to the ASCII table:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgot48wfz8cj9x4ajma8u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgot48wfz8cj9x4ajma8u.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;file3.txt&lt;/code&gt;'s binary representation is &lt;code&gt;11111111 00110010 00110011 00001010&lt;/code&gt;. The way Base64 works is to interpret the bits in groups of 6. So even though the logical grouping of a byte is 8 bits, we're going to modify the groupings to be 6 bits (to reflect how Base64 sees this): &lt;code&gt;111111 110011 001000 110011 000010 10&lt;/code&gt;. In fact, let's look at it in a table format to make things easier:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Bytes&lt;/th&gt;
&lt;th&gt;Base64 character&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;111111&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;110011&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;z&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;001000&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;I&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;110011&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;z&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;000010&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;C&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;10&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;???&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The first 5 groupings of 6 bits line up perfectly with the first 5 characters of our Base64 encoded &lt;code&gt;file4.txt&lt;/code&gt;. But we only have 2 bits remaining at the end, which is not enough to make a valid character in Base64. &lt;code&gt;file3.txt&lt;/code&gt; had 4 bytes, which is 32 bits. 32 is not divisible by 6.&lt;/p&gt;

&lt;p&gt;When a file size is not divisible by 6 bits, Base64 resorts to padding. To make a 32 bit file compatible with Base64 we'll append &lt;code&gt;0000&lt;/code&gt; to the end of the file so that the final character can be properly rendered by Base64. Here is the new bit string: &lt;code&gt;111111 110011 001000 110011 000010 100000&lt;/code&gt;. Let's view it in a table format too:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Bytes&lt;/th&gt;
&lt;th&gt;Base64 character&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;111111&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;110011&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;z&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;001000&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;I&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;110011&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;z&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;000010&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;C&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;100000&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;g&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That's much better. Now the first 6 characters match. But what about the &lt;code&gt;==&lt;/code&gt; at the end? We have no bits remaining. In fact, &lt;code&gt;=&lt;/code&gt; isn't even in the Base64 table! What gives?&lt;/p&gt;

&lt;p&gt;Base64 requires that the number of characters outputted be divisible by 4. This means that those &lt;code&gt;=&lt;/code&gt; are padding characters to satisfy that requirement. But why does that requirement exist? Well, let's hypothesize a bit here. Base64 characters use 6 bits each. A byte uses 8 bits. Bytes are fundamental building blocks in a file system. We don't measure things in bits, but rather in bytes. So how many Base64 characters does it take so that the total number of bits fit neatly into a string of bytes (i.e., is divisible by 8)?&lt;/p&gt;

&lt;p&gt;It takes 24 bits, which is 3 bytes. And there are 4 Base64 characters (of 6 bits each) in 24 bits. I suppose this was the rationale behind the &lt;code&gt;=&lt;/code&gt; padding requirement.&lt;/p&gt;

&lt;p&gt;Here is a table that displays how the original file size affects the Base64 output:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Original file size&lt;/th&gt;
&lt;th&gt;# of Base64 characters&lt;/th&gt;
&lt;th&gt;
&lt;code&gt;=&lt;/code&gt; padding&lt;/th&gt;
&lt;th&gt;
&lt;code&gt;0&lt;/code&gt; padding&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1 byte&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;&lt;code&gt;==&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0000&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2 bytes&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;&lt;code&gt;=&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;00&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3 bytes&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4 bytes&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;&lt;code&gt;==&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0000&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5 bytes&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;&lt;code&gt;=&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;00&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6 bytes&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Let's walk through some examples of strings that both require padding and do not require it.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;2 &lt;code&gt;=&lt;/code&gt; of padding: &lt;code&gt;@&lt;/code&gt; (&lt;code&gt;01000000&lt;/code&gt;)&lt;/em&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Bytes&lt;/th&gt;
&lt;th&gt;UTF-8 character&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;01000000&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;@&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Bytes&lt;/th&gt;
&lt;th&gt;Bit positions&lt;/th&gt;
&lt;th&gt;Base64 character&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;010000&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;010000&lt;/strong&gt; 00&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Q&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;000000&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;010000 &lt;strong&gt;00&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;&lt;code&gt;A&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;padding&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;none&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;=&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;padding&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;none&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;=&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Notice that since there were only 2 bits to use at the end, &lt;code&gt;0000&lt;/code&gt; was used as padding to the end to make the bit length (excluding any &lt;code&gt;=&lt;/code&gt; padding) divisible by 6.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;1 &lt;code&gt;=&lt;/code&gt; of padding: &lt;code&gt;AB&lt;/code&gt; (&lt;code&gt;0100000101000010&lt;/code&gt;)&lt;/em&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Bytes&lt;/th&gt;
&lt;th&gt;UTF-8 character&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;01000001&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;A&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;01000010&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;B&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Bytes&lt;/th&gt;
&lt;th&gt;Bit positions&lt;/th&gt;
&lt;th&gt;Base64 character&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;010000&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;010000&lt;/strong&gt; 0101000010&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Q&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;010100&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;010000 &lt;strong&gt;010100&lt;/strong&gt; 0010&lt;/td&gt;
&lt;td&gt;&lt;code&gt;U&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;001000&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;010000010100 &lt;strong&gt;0010&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;&lt;code&gt;I&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;padding&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;none&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;=&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This time &lt;code&gt;00&lt;/code&gt; was used as padding at the end of the string.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;No padding: &lt;code&gt;v3c&lt;/code&gt; (&lt;code&gt;011101100011001101100011&lt;/code&gt;)&lt;/em&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Bytes&lt;/th&gt;
&lt;th&gt;UTF-8 character&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;01110110&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;v&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;00110011&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;3&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;01100011&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;c&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Bytes&lt;/th&gt;
&lt;th&gt;Bit positions&lt;/th&gt;
&lt;th&gt;Base64 character&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;011101&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;011101&lt;/strong&gt; 100011001101100011&lt;/td&gt;
&lt;td&gt;&lt;code&gt;d&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;100011&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;011101 &lt;strong&gt;100011&lt;/strong&gt; 001101100011&lt;/td&gt;
&lt;td&gt;&lt;code&gt;j&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;001101&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;011101100011 &lt;strong&gt;001101&lt;/strong&gt; 100011&lt;/td&gt;
&lt;td&gt;&lt;code&gt;N&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;100011&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;011101100011001101 &lt;strong&gt;100011&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;&lt;code&gt;j&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;No &lt;code&gt;0&lt;/code&gt;s needed as padding this time since the number of bits was divisible by 6.&lt;/p&gt;




&lt;p&gt;Now we should be able to understand when padding is required and when it isn't. Let's take a look at the completed table of &lt;code&gt;file4.txt&lt;/code&gt; (the Base64 representation of &lt;code&gt;file3.txt&lt;/code&gt;):&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Raw binary of &lt;code&gt;file3.txt&lt;/code&gt; (4 bytes in total): &lt;code&gt;11111111001100100011001100001010&lt;/code&gt;&lt;/em&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Bytes&lt;/th&gt;
&lt;th&gt;Bit positions&lt;/th&gt;
&lt;th&gt;Base64 character&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;111111&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;111111&lt;/strong&gt; 11001100100011001100001010&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;110011&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;111111 &lt;strong&gt;110011&lt;/strong&gt; 00100011001100001010&lt;/td&gt;
&lt;td&gt;&lt;code&gt;z&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;001000&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;111111110011 &lt;strong&gt;001000&lt;/strong&gt; 11001100001010&lt;/td&gt;
&lt;td&gt;&lt;code&gt;I&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;110011&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;111111110011001000 &lt;strong&gt;110011&lt;/strong&gt; 00001010&lt;/td&gt;
&lt;td&gt;&lt;code&gt;z&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;000010&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;111111110011001000110011 &lt;strong&gt;000010&lt;/strong&gt; 10&lt;/td&gt;
&lt;td&gt;&lt;code&gt;C&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;100000&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;111111110011001000110011000010 &lt;strong&gt;10&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;&lt;code&gt;g&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;padding&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;none&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;=&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;padding&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;none&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;=&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Since &lt;code&gt;file3.txt&lt;/code&gt; is 4 bytes, it required  &lt;code&gt;0000&lt;/code&gt; as padding for the last Base64 character and &lt;code&gt;==&lt;/code&gt; as padding for the complete Base64 output.&lt;/p&gt;

&lt;p&gt;One last thing to be aware of is that &lt;code&gt;file4.txt&lt;/code&gt;, whose contents are &lt;code&gt;/zIzCg==&lt;/code&gt;, will be stored as UTF-8 (which will be the exact same as ASCII in this instance since Base64 is a subset of the ASCII alphabet). Remember that Base64 isn't a character encoding! It's a binary-to-text encoding. Character encodings are the ones that are stored on disk. One mistaken assumption I had while learning this was that the Base64 file would have the exact same bytes on disk as the original file (i.e., &lt;code&gt;file4.txt&lt;/code&gt; and &lt;code&gt;file3.txt&lt;/code&gt; would have the same bytes). However this is not the case! Observe:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;

&lt;span class="nv"&gt;$ &lt;/span&gt;xxd &lt;span class="nt"&gt;-b&lt;/span&gt; file4.txt
00000000: 00101111 01111010 01001001 01111010 01000011 01100111  /zIzCg
00000006: 00111101 00111101 00001010                             &lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="nb"&gt;.&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;So Base64 took the underlying bits of &lt;code&gt;file3.txt&lt;/code&gt;, used its algorithm to map those to Base64 characters, and then wrote those characters to &lt;code&gt;file4.txt&lt;/code&gt; in UTF-8. If we created a new file and manually typed in &lt;code&gt;/zIzCg==&lt;/code&gt;, it would have the exact same binary representation. This is simply a UTF-8 encoding of text.&lt;/p&gt;




&lt;h3&gt;
  
  
  What is Base64url?
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://en.wikipedia.org/wiki/Base64#URL_applications" rel="noopener noreferrer"&gt;Base64url&lt;/a&gt; is something that will occasionally show up. This is a variant on Base64 where &lt;code&gt;+&lt;/code&gt; and &lt;code&gt;/&lt;/code&gt; are replaced with &lt;code&gt;-&lt;/code&gt; and &lt;code&gt;_&lt;/code&gt; so that the output will be &lt;a href="https://en.wikipedia.org/wiki/Percent-encoding" rel="noopener noreferrer"&gt;URL-safe&lt;/a&gt;. &lt;code&gt;+&lt;/code&gt; and &lt;code&gt;/&lt;/code&gt; must be encoded in a URL (i.e., &lt;code&gt;+&lt;/code&gt; becomes &lt;code&gt;%2B&lt;/code&gt;, &lt;code&gt;/&lt;/code&gt; becomes &lt;code&gt;%2F&lt;/code&gt;), but &lt;code&gt;-&lt;/code&gt; and &lt;code&gt;_&lt;/code&gt; are considered safe.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;=&lt;/code&gt; is also not URL-safe, but there is no standardization on how to handle it. Some libraries will percent-encode it (&lt;code&gt;%3D&lt;/code&gt;) and some will encode it as a period (&lt;code&gt;.&lt;/code&gt;).&lt;/p&gt;




&lt;h3&gt;
  
  
  Encoding vs. encryption
&lt;/h3&gt;

&lt;p&gt;For some reason people often mix these two terms up. I think the reason why, specifically when it involves Base64, is because of the &lt;a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Authorization#Examples" rel="noopener noreferrer"&gt;HTTP &lt;code&gt;Authorization&lt;/code&gt; request header&lt;/a&gt; and &lt;a href="https://jwt.io/" rel="noopener noreferrer"&gt;JWTs&lt;/a&gt;. Both of these concepts are security-related and involve Base64 to transform plaintext into seemingly "scrambled" output. As a result, people mistakenly think Base64 encoding is the same thing as encryption.&lt;/p&gt;

&lt;p&gt;Well it's not.&lt;/p&gt;

&lt;p&gt;Encryption is the process of mathematically transforming plaintext into ciphertext (a bunch of gibberish) using a key (basically just a random number). Depending on the type of encryption used, the only way to transform ciphertext back to plaintext is with that same key (symmetric encryption) or with a different-but-mathematically-related key (asymmetric encryption). The only way to break encryption without the key is through brute force, which depending on the strength of encryption used, could take &lt;a href="https://www.thesslstore.com/blog/what-is-256-bit-encryption/" rel="noopener noreferrer"&gt;6.4 quadrillion years&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Encoding, in the binary-to-text sense, is the process of transforming bits into an output that's human-printable. It's meant to be a trivially reversible process that anyone can do. Even if a different encoding than Base64 were used, there are a pretty finite amount of encodings out there. Brute forcing that would probably take a modern computer a handful of milliseconds to accomplish.&lt;/p&gt;

&lt;p&gt;This of course implies that the HTTP &lt;code&gt;Authorization&lt;/code&gt; request header and JWTs do not provide any inherent data confidentiality. Not to say that they are useless, but just that encryption is not one of their benefits. Anyone who intercepts those pieces of data can simply decode the Base64 with ease (if they are technically savvy enough to sniff network traffic then the odds are pretty good they also know what Base64 is). Base64 is meant to ensure that you won't have to deal with binary data (i.e., bytes that the standard character encodings don't know how to interpret) or characters like &lt;code&gt;NUL&lt;/code&gt; or &lt;code&gt;EOF&lt;/code&gt;. It is often used in security-related concepts (such as the &lt;a href="https://en.wikipedia.org/wiki/Privacy-Enhanced_Mail" rel="noopener noreferrer"&gt;PEM format&lt;/a&gt; for example), but it is not itself a security technique!&lt;/p&gt;




&lt;blockquote&gt;
&lt;h2&gt;
  
  
  Python uses Unicode strings for encoding
&lt;/h2&gt;
&lt;/blockquote&gt;

&lt;p&gt;In python 2 there are a class of string literals that are known as &lt;a href="https://docs.python.org/2/howto/unicode.html#encodings" rel="noopener noreferrer"&gt;unicode strings&lt;/a&gt;. They are delineated by prefixing the character &lt;code&gt;u&lt;/code&gt; to a string literal (e.g., &lt;code&gt;u'abc'&lt;/code&gt;). I am not a fan of the term unicode string because it leads to the confusion that unicode is an encoding. So what exactly does python mean when it refers to unicode strings?&lt;/p&gt;

&lt;p&gt;Let's look at some examples in Python 2.7.12:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;u&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;abc&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;u&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;abcŔŖ&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;
&lt;span class="sa"&gt;u&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;abc&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;
&lt;span class="sa"&gt;u&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;abc&lt;/span&gt;&lt;span class="se"&gt;\u0154\u0156&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;So we define 2 strings, &lt;code&gt;a&lt;/code&gt; and &lt;code&gt;b&lt;/code&gt;, which contain the same contents as &lt;code&gt;file1.txt&lt;/code&gt; and &lt;code&gt;file2.txt&lt;/code&gt; from &lt;a href="https://dev.to/neumaneuma/decoding-the-confusing-world-of-encodings-part-1-3oke"&gt;part 1&lt;/a&gt; did. &lt;code&gt;a&lt;/code&gt; is able to be printed out to the console without an issue, but the console can't render &lt;code&gt;ŔŖ&lt;/code&gt; at the end of &lt;code&gt;b&lt;/code&gt;. Instead those characters are replaced with their unicode code points: &lt;code&gt;\u0154&lt;/code&gt; (&lt;code&gt;U+0154&lt;/code&gt;) and &lt;code&gt;\u0156&lt;/code&gt; (&lt;code&gt;U+0156&lt;/code&gt;). It appears that the python 2 interpreter can only print strings using ASCII, and not a unicode-compatible encoding.&lt;/p&gt;

&lt;p&gt;Let's try explicitly encoding these strings:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;abc&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ascii&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;abc&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;abc&lt;/span&gt;&lt;span class="se"&gt;\xc5\x94\xc5\x96&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ascii&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nc"&gt;Traceback &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;most&lt;/span&gt; &lt;span class="n"&gt;recent&lt;/span&gt; &lt;span class="n"&gt;call&lt;/span&gt; &lt;span class="n"&gt;last&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
  &lt;span class="n"&gt;File&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;stdin&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;module&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="nb"&gt;UnicodeEncodeError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ascii&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="n"&gt;codec&lt;/span&gt; &lt;span class="n"&gt;can&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t encode characters in position 3-4: ordinal not in range(128)


&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;String &lt;code&gt;a&lt;/code&gt; can be encoded using both ASCII and UTF-8 as expected. Also as expected, encoding string &lt;code&gt;b&lt;/code&gt; using ASCII results in an error since neither &lt;code&gt;Ŕ&lt;/code&gt; nor &lt;code&gt;Ŗ&lt;/code&gt; are ASCII-compatible. And encoding string &lt;code&gt;b&lt;/code&gt; using UTF-8 renders a string that is a mix of ASCII characters (what python 2 can handle) and the hex representations of the non-ASCII characters python 2 couldn't handle.&lt;/p&gt;

&lt;p&gt;A unicode string in python 2 is just a combination of ASCII-compatible characters and code points (as strings) for the non-ASCII compatible characters. What about python 3? Python 3 got rid of the distinction between a regular string (e.g., &lt;code&gt;abc&lt;/code&gt;) and a unicode string (e.g., &lt;code&gt;u'abc'&lt;/code&gt;), and just has regular strings without any prefixes. Does this mean there are no unicode strings in python 3?&lt;/p&gt;

&lt;p&gt;Let's find out using Python 3.5.2:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;abc&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;abcŔŖ&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;
&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;abc&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;
&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;abcŔŖ&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Python 3 treats every string as a unicode string, and on top of that, can print non-ASCII compatible characters to the console now. Also the &lt;code&gt;encode()&lt;/code&gt; function still works the same:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;abc&lt;/span&gt;&lt;span class="se"&gt;\xc5\x94\xc5\x96&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The only other question remaining is how to print out the code points?&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;unicode_escape&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;abc&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s"&gt;u0154&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s"&gt;u0156&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;




&lt;p&gt;Now readers should have a good idea of what Base64 is and how it works, the difference between encoding and encryption, and what python means by unicode strings. That was a lot to get through! But that is indicative of the complexities and overloaded terms surrounding what an "encoding" is.&lt;/p&gt;

</description>
      <category>unicode</category>
      <category>utf8</category>
      <category>ascii</category>
      <category>base64</category>
    </item>
    <item>
      <title>Decoding the confusing world of encodings (Part 1)</title>
      <dc:creator>Bradley Neumaier</dc:creator>
      <pubDate>Wed, 08 May 2019 18:28:58 +0000</pubDate>
      <link>https://dev.to/neumaneuma/decoding-the-confusing-world-of-encodings-part-1-3oke</link>
      <guid>https://dev.to/neumaneuma/decoding-the-confusing-world-of-encodings-part-1-3oke</guid>
      <description>&lt;h1&gt;
  
  
  What is an encoding?
&lt;/h1&gt;

&lt;p&gt;Have you ever come across some of these statements?&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;This file is hex encoded&lt;/p&gt;

&lt;p&gt;This file uses an ASCII encoding&lt;/p&gt;

&lt;p&gt;This string is Unicode encoded&lt;/p&gt;

&lt;p&gt;Let's write the output to a UTF-8 encoded file&lt;/p&gt;

&lt;p&gt;Our message is safe because it's encoded using Base64&lt;/p&gt;

&lt;p&gt;Python uses Unicode strings for encoding&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;These represent many of the ways the term "encode" is used across the industry. Frankly I found it all really confusing until I set out to write this post! I'm going to address each of these statements and attempt to define and disambiguate exactly what encoding means.&lt;/p&gt;




&lt;blockquote&gt;
&lt;h2&gt;
  
  
  This file is hex encoded
&lt;/h2&gt;
&lt;/blockquote&gt;

&lt;p&gt;A similar phrase to hex encoding is binary encoding. Personally I don't like the use of the term "encoding" here. Technically an argument could be made that the semantics are correct. However I prefer using the term "representation." It makes encoding less of an overloaded definition. Also, "representation" does a better job (in my mind at least) of describing what is actually happening.&lt;/p&gt;

&lt;p&gt;Hexadecimal (abbreviated as hex) and binary are both numeral systems. That's a fancy way of saying, "here's how to represent a number." If you step back and think about it, numbers are funny things. A number seems pretty straightforward, but it's actually an abstract concept. What is the number for how many fingers you have? You could say it's &lt;code&gt;00001010&lt;/code&gt;, &lt;code&gt;10&lt;/code&gt;, or &lt;code&gt;a&lt;/code&gt; and all three would be accurate! We learn to say &lt;code&gt;10&lt;/code&gt; because the easiest and most common numeral system for humans is decimal, also known as base-10. We have 10 fingers and 10 toes, so that makes learning how to count far more intuitive when we are infants.&lt;/p&gt;

&lt;p&gt;If we instead applied that ease-of-use criteria to computers we would get binary (or base-2). Why? Because computers fundamentally think of things as being &lt;a href="https://www.howtogeek.com/367621/what-is-binary-and-why-do-computers-use-it/" rel="noopener noreferrer"&gt;"on" or "off."&lt;/a&gt; Computers rely on electrons having either a positive charge or a negative charge to represent &lt;code&gt;1&lt;/code&gt;s and &lt;code&gt;0&lt;/code&gt;s. And it is with these &lt;code&gt;1&lt;/code&gt;s and &lt;code&gt;0&lt;/code&gt;s that the fundamentals of computing are accomplished, such as storing data or performing mathematical calculations.&lt;/p&gt;

&lt;p&gt;Great, so we can represent the same number in multiple ways. What use is that? Let's refer back to the number ten. We could represent it in binary (&lt;code&gt;00001010&lt;/code&gt;) or in hex (&lt;code&gt;a&lt;/code&gt;). It takes eight characters in binary (or four without the padding of &lt;code&gt;0&lt;/code&gt;s), but only one in hex! That's due to the number of symbols each use. Binary uses two: &lt;code&gt;0&lt;/code&gt; and &lt;code&gt;1&lt;/code&gt;. Hex uses 16: &lt;code&gt;0&lt;/code&gt;-&lt;code&gt;9&lt;/code&gt; and &lt;code&gt;a&lt;/code&gt;-&lt;code&gt;f&lt;/code&gt;. The difference in representation size was stark enough for just the number ten, but it grows significantly more unequal when using larger numbers. So the advantage is that hex can represent large numbers much more efficiently than binary (and more efficiently than decimal too for that matter).&lt;/p&gt;

&lt;p&gt;Let's explore how to turn this theory into practical knowledge. To provide some examples for this post I created two files via the command line: &lt;code&gt;file1.txt&lt;/code&gt; and &lt;code&gt;file2.txt&lt;/code&gt;. Here are their contents outputted:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;

&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;cat &lt;/span&gt;file1.txt
abc


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;

&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;cat &lt;/span&gt;file2.txt
abcŔŖ


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Don't worry about the unfamiliar &lt;code&gt;R&lt;/code&gt; characters at the end of &lt;code&gt;file2.txt&lt;/code&gt;. I'll go over those details in-depth in the UTF-8 and Unicode sections. For now I will just show the binary and hex representations of each file:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;

&lt;span class="nv"&gt;$ &lt;/span&gt;xxd &lt;span class="nt"&gt;-b&lt;/span&gt; file1.txt &lt;span class="c"&gt;# binary&lt;/span&gt;
00000000: 01100001 01100010 01100011 00001010                    abc.


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;

&lt;span class="nv"&gt;$ &lt;/span&gt;xxd file1.txt &lt;span class="c"&gt;# hex&lt;/span&gt;
00000000: 6162 630a                                abc.


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;

&lt;span class="nv"&gt;$ &lt;/span&gt;xxd &lt;span class="nt"&gt;-b&lt;/span&gt; file2.txt &lt;span class="c"&gt;# binary&lt;/span&gt;
00000000: 01100001 01100010 01100011 11000101 10010100 11000101  abc...
00000006: 10010110 00001010                                      ..


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;

&lt;span class="nv"&gt;$ &lt;/span&gt;xxd file2.txt &lt;span class="c"&gt;# hex&lt;/span&gt;
00000000: 6162 63c5 94c5 960a                      abc.....


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Again we see the compactness of hex on display. &lt;code&gt;file1.txt&lt;/code&gt; requires 32 characters to represent in binary, but only 8 in hex. &lt;code&gt;file2.txt&lt;/code&gt; requires 64 characters to represent in binary, but only 16 in hex. If we were to use a &lt;a href="https://www.mathsisfun.com/binary-decimal-hexadecimal-converter.html" rel="noopener noreferrer"&gt;hex to binary converter&lt;/a&gt; we can see how these representations line up with one another.&lt;/p&gt;

&lt;p&gt;Let's dissect &lt;code&gt;file1.txt&lt;/code&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Binary&lt;/th&gt;
&lt;th&gt;Hexadecimal&lt;/th&gt;
&lt;th&gt;Decimal&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;01100001&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;61&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;97&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;01100010&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;62&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;98&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;01100011&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;63&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;99&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;00001010&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0a&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;10&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;As mentioned above, binary is the numeral system that computers "understand." The binary representation of these two files are literally how these files are stored in the computer (as what are known as bits, &lt;code&gt;1&lt;/code&gt;s and &lt;code&gt;0&lt;/code&gt;s, on the computer). The hex and decimal representations are just different ways of representing those same bits. We can see that every byte in binary (1 byte is equal to 8 bits) lines up with 2 hex characters. And we can see what those same values would be if they were represented in decimal. For reference, the largest 1 byte binary value is &lt;code&gt;11111111&lt;/code&gt;, which is &lt;code&gt;ff&lt;/code&gt; in hex and &lt;code&gt;255&lt;/code&gt; in decimal. The smallest 1 byte binary value is &lt;code&gt;00000000&lt;/code&gt;, which is &lt;code&gt;00&lt;/code&gt; in hex and &lt;code&gt;0&lt;/code&gt; in decimal. But even armed with this understanding of hex and binary, there's still a lot to go. How does all this relate to the contents of &lt;code&gt;file1.txt&lt;/code&gt;?&lt;/p&gt;

&lt;blockquote&gt;
&lt;h2&gt;
  
  
  This file uses an ASCII encoding
&lt;/h2&gt;
&lt;/blockquote&gt;

&lt;p&gt;Remember that these binary, hex, and decimal representations are all of the same number. But we're not storing a number! We're storing &lt;code&gt;abc&lt;/code&gt;. The problem is that computers have no concept of letters. They only understand numbers. So we need a way to say to the computer, "I want this character to translate to number X, this next character to translate to number Y, etc..." Enter ASCII.&lt;/p&gt;

&lt;p&gt;Back in the day, ASCII was more or less the de facto standard for encoding text written using the English alphabet. It assigns a numeric value for all 26 lowercase letters, all 26 uppercase letters, punctuation, symbols, and even the digits 0-9. Here is a picture of the ASCII table:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyoyoret9crloksu3ycfn.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyoyoret9crloksu3ycfn.jpg" alt="asciitable"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here is the mapping of &lt;code&gt;file1.txt&lt;/code&gt;'s hex values to their ASCII characters using the ASCII table:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Hexadecimal&lt;/th&gt;
&lt;th&gt;ASCII&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;61&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;a&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;62&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;b&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;63&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;c&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;0a&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;LF&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;We can see &lt;code&gt;a&lt;/code&gt;, &lt;code&gt;b&lt;/code&gt;, and &lt;code&gt;c&lt;/code&gt; there just as we would expect. What is that &lt;code&gt;LF&lt;/code&gt; doing there at the end though? &lt;code&gt;LF&lt;/code&gt; is a newline character in Unix (standing for "line feed"). I pressed the &lt;code&gt;Return&lt;/code&gt; key when editing &lt;code&gt;file1.txt&lt;/code&gt;, so that added a newline.&lt;/p&gt;

&lt;p&gt;Any character in the ASCII character set requires only 1 byte to store. ASCII supports 128 characters, as we saw in the ASCII table. However, 1 byte allows for 256 (or 2&lt;sup&gt;8&lt;/sup&gt;) values to be represented. In decimal that would be &lt;code&gt;0&lt;/code&gt; (&lt;code&gt;00000000&lt;/code&gt; in binary) through &lt;code&gt;255&lt;/code&gt; (&lt;code&gt;11111111&lt;/code&gt; in binary). That should mean ASCII can support 128 more characters. Why isn't that the case? ASCII only required 128 characters to support English text and its accompanying symbols so presumably that was all that was taken into account when the ASCII standard was formalized. As a result, ASCII only uses 7 of the 8 bits in a byte. However, that leads to a lot of waste -- half of the values are unused! 128 additional characters could be supported.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/" rel="noopener noreferrer"&gt;Joel Spolsky&lt;/a&gt; wrote an excellent blog post on this problem. Basically the issue was fragmentation. Everyone agreed what the first 128 values should map to, but then everyone went and decided their own usage for the remaining 128 values. As a result there was no consistency among different locales.&lt;/p&gt;

&lt;p&gt;Let's review what we learned so far. We saw that the computer encodes the string &lt;code&gt;abc&lt;/code&gt; into numbers (which are stored as bits). We can then view these bits as the computer has stored it in binary, or we can use different representations such as hex. &lt;code&gt;a&lt;/code&gt; becomes &lt;code&gt;97&lt;/code&gt;, &lt;code&gt;b&lt;/code&gt; becomes &lt;code&gt;98&lt;/code&gt;, &lt;code&gt;c&lt;/code&gt; becomes &lt;code&gt;99&lt;/code&gt;, and the newline character in Unix is &lt;code&gt;10&lt;/code&gt;. ASCII is just a way to map bits (that computers understand) to characters (that humans understand).&lt;/p&gt;

&lt;p&gt;ASCII leaves a gaping issue though. There are a lot more than 128 characters in use! What do we do about characters from other languages? Other random symbols? Emojis???&lt;/p&gt;

&lt;blockquote&gt;
&lt;h2&gt;
  
  
  This string is Unicode encoded
&lt;/h2&gt;
&lt;/blockquote&gt;

&lt;p&gt;As anglocentric as programming is in 2019, English is not the only language that needs to be supported on the web. ASCII is fine for encoding English, but it is incapable of supporting anything else. This is where Unicode enters the fray. Unicode is not an encoding. That point bears repeating. Unicode is &lt;em&gt;not&lt;/em&gt; an encoding.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://en.wikipedia.org/wiki/Unicode" rel="noopener noreferrer"&gt;Wikipedia&lt;/a&gt; calls it a standard that can be implemented by different character encodings. I find that definition, while succinct, too abstract. Instead, I prefer to think of it like this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Imagine you have a giant alphabet. It can support over 1 million characters. It is a superset of every language known to humankind. It can support made-up languages. It contains every bizarre symbol you can think of. It has emojis. And all that only fills about 15% of its character set. There is space for much more to be added. However, it's impractical to have a keyboard that has button combinations for over 1 million different characters. The keyboard I'm using right now has 47 buttons dedicated to typeable characters. With the &lt;code&gt;Shift&lt;/code&gt; key that number is doubled. That's nowhere close to 1 million though. There needs to be some way to use the characters in this alphabet!&lt;/p&gt;

&lt;p&gt;In order to make this alphabet usable we're going to put it in a giant dictionary.  A normal dictionary would map words to their respective definitions. In this special dictionary we'll have numbers mapping to all these characters. So to produce the character you want, you will type the corresponding number for it. And then it will be someone else's job to replace those numbers with the characters that they map to in the dictionary. Just as the words are in alphabetical order, the numbers will be in ascending order. And for the characters not yet filled in, we'll just have a blank entry next to the unused numbers.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is Unicode in a nutshell. It's a dictionary that supports an alphabet of over &lt;a href="https://stackoverflow.com/questions/27415935/does-unicode-have-a-defined-maximum-number-of-code-points#27416004" rel="noopener noreferrer"&gt;1.1 million characters&lt;/a&gt;. It does so through an abstraction called a code point. Every character has a &lt;a href="https://unicode-table.com/en/" rel="noopener noreferrer"&gt;unique code point&lt;/a&gt;. For example, &lt;code&gt;a&lt;/code&gt; has a code point of &lt;code&gt;U+0061&lt;/code&gt;. &lt;code&gt;b&lt;/code&gt; has a code point of &lt;code&gt;U+0062&lt;/code&gt;. And &lt;code&gt;c&lt;/code&gt; has a code point of &lt;code&gt;U+0063&lt;/code&gt;. Notice a pattern? &lt;code&gt;61&lt;/code&gt; is the hex value for the character &lt;code&gt;a&lt;/code&gt; in ASCII, and &lt;code&gt;U+0061&lt;/code&gt; is the code point for &lt;code&gt;a&lt;/code&gt; in Unicode. I'll come back to this point in the UTF-8 section.&lt;/p&gt;

&lt;p&gt;The structure of a code point is as follows: &lt;code&gt;U+&lt;/code&gt; followed by a hex string. The smallest that hex string could be is &lt;code&gt;0000&lt;/code&gt; and the largest is &lt;code&gt;10FFFF&lt;/code&gt;. So &lt;code&gt;U+0000&lt;/code&gt; is the smallest code point (representing the &lt;code&gt;Null&lt;/code&gt; character) and &lt;code&gt;U+10FFFF&lt;/code&gt; is the largest code point (currently unassigned). As of &lt;a href="http://www.unicode.org/versions/Unicode12.0.0/" rel="noopener noreferrer"&gt;Unicode 12.0.0&lt;/a&gt; there are almost 138,000 code points in use, meaning slightly under 1 million remain. I think it's safe to say we won't be running out anytime soon.&lt;/p&gt;

&lt;p&gt;ASCII can map bits on a computer to the English alphabet, but it wouldn't know what to do with Unicode. So we need a character encoding that can map bits on a computer to Unicode code points (which in turn maps to a giant alphabet). This is where UTF-8 comes into play.&lt;/p&gt;

&lt;blockquote&gt;
&lt;h2&gt;
  
  
  Let's write the output to a UTF-8 encoded file
&lt;/h2&gt;
&lt;/blockquote&gt;

&lt;p&gt;UTF-8 is one of several encodings that support Unicode. In fact, the UTF in UTF-8 stands for Unicode Transformation Format. You may have heard of some of the others: UTF-16 LE, UTF-16 BE, UTF-32, UCS-2, UTF-7, etc... I'm going to ignore all the rest of these though. Why? Because UTF-8 is by far the dominant encoding of the group. It is backwards compatible with ASCII, and according to &lt;a href="https://en.wikipedia.org/wiki/UTF-8" rel="noopener noreferrer"&gt;Wikipedia&lt;/a&gt;, it accounts for over 90% of all web page encodings.&lt;/p&gt;

&lt;p&gt;UTF-8 uses different byte sizes depending on what code point is being referenced. This is the feature that allows it to maintain backwards compatibility with ASCII.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw3os3g91qpzc1wqylz9b.JPG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw3os3g91qpzc1wqylz9b.JPG" alt="UTF8"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;sup&gt;Source: Wikipedia&lt;/sup&gt;&lt;/p&gt;

&lt;p&gt;If UTF-8 encounters a byte that starts with &lt;code&gt;0&lt;/code&gt;, it knows it found a starting byte and that the character is only one byte in length. If UTF-8 encounters a byte that starts with &lt;code&gt;110&lt;/code&gt; then it knows it found a starting byte and to look for two bytes in total. For three bytes it is &lt;code&gt;1110&lt;/code&gt;, and four bytes it is &lt;code&gt;11110&lt;/code&gt;. All continuation bytes (i.e., the non-starting bytes; bytes 2, 3, or 4) will start with a &lt;code&gt;10&lt;/code&gt;. The &lt;a href="https://www.quora.com/Why-do-subsequent-bytes-in-UTF-8-need-to-start-with-10-when-the-first-byte-already-contains-the-information-on-how-many-bytes-in-total-are-used" rel="noopener noreferrer"&gt;reason for these continuation bytes&lt;/a&gt; is that it allows you to be able to find the starting byte of a character easily.&lt;/p&gt;

&lt;p&gt;As a refresher, this is what &lt;code&gt;file2.txt&lt;/code&gt; looks like on the command line:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;

&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;cat &lt;/span&gt;file2.txt
abcŔŖ


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;

&lt;span class="nv"&gt;$ &lt;/span&gt;xxd &lt;span class="nt"&gt;-b&lt;/span&gt; file2.txt &lt;span class="c"&gt;# binary&lt;/span&gt;
00000000: 01100001 01100010 01100011 11000101 10010100 11000101  abc...
00000006: 10010110 00001010                                      ..


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;

&lt;span class="nv"&gt;$ &lt;/span&gt;xxd file2.txt &lt;span class="c"&gt;# hex&lt;/span&gt;
00000000: 6162 63c5 94c5 960a                      abc.....


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Let's dissect &lt;code&gt;file2.txt&lt;/code&gt; to understand how UTF-8 works:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Hexadecimal&lt;/th&gt;
&lt;th&gt;UTF-8&lt;/th&gt;
&lt;th&gt;Unicode Code Point&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;61&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;a&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;U+0061&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;62&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;U+0062&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;63&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;c&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;U+0063&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;c594&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Ŕ&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;U+0154&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;c596&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Ŗ&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;U+0156&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;0a&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;LF&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;U+000A&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;We can see that the hex representations for &lt;code&gt;a&lt;/code&gt;, &lt;code&gt;b&lt;/code&gt;, &lt;code&gt;c&lt;/code&gt;, and &lt;code&gt;LF&lt;/code&gt; are the same as for &lt;code&gt;file1.txt&lt;/code&gt;, and that they align perfectly with their respective code points. The hex representations for &lt;code&gt;Ŕ&lt;/code&gt; and &lt;code&gt;Ŗ&lt;/code&gt; are twice as long as the other hex representations though. This means that they require 2 bytes to store instead of 1 byte.&lt;/p&gt;

&lt;p&gt;Here is a table showing the different representations and the type of byte side-by-side:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Byte type&lt;/th&gt;
&lt;th&gt;Binary&lt;/th&gt;
&lt;th&gt;Hexadecimal&lt;/th&gt;
&lt;th&gt;Decimal&lt;/th&gt;
&lt;th&gt;UTF-8&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Starting Byte&lt;/td&gt;
&lt;td&gt;&lt;code&gt;01100001&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;61&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;97&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;a&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Starting Byte&lt;/td&gt;
&lt;td&gt;&lt;code&gt;01100010&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;62&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;98&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;b&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Starting Byte&lt;/td&gt;
&lt;td&gt;&lt;code&gt;01100011&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;63&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;99&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;c&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Starting Byte&lt;/td&gt;
&lt;td&gt;&lt;code&gt;11000101&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;c5&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;197&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Ŕ&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Continuation Byte&lt;/td&gt;
&lt;td&gt;&lt;code&gt;10010100&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;94&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;148&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;Ŕ&lt;/code&gt; (contd.)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Starting Byte&lt;/td&gt;
&lt;td&gt;&lt;code&gt;11000101&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;c5&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;197&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Ŗ&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Continuation Byte&lt;/td&gt;
&lt;td&gt;&lt;code&gt;10010110&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;96&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;150&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;Ŗ&lt;/code&gt; (contd.)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Starting Byte&lt;/td&gt;
&lt;td&gt;&lt;code&gt;00001010&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0a&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;10&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;LF&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;UTF-8 uses 1 byte to encode ASCII characters, and multiple bytes to encode non-ASCII characters. To be precise it uses 7 bits to encode ASCII characters, exactly like ASCII does. Every byte on disk that maps to an ASCII character will map to the exact same character in UTF-8. And any other code point outside of that range will just use additional bytes to be encoded.&lt;/p&gt;

&lt;p&gt;As I alluded to earlier, the code points for &lt;code&gt;a&lt;/code&gt;, &lt;code&gt;b&lt;/code&gt;, and &lt;code&gt;c&lt;/code&gt; match up exactly with the hex representations of those letters in ASCII. I suppose that the designers of Unicode did this in the hopes that it would make backwards compatibility with ASCII easier. UTF-8 made full use of this. Its first 128 characters require one byte to encode. Despite having room for 128 more characters in its first byte, UTF-8 instead required its 129th character to use 2 bytes. &lt;a href="https://unicode-table.com/en/007F/" rel="noopener noreferrer"&gt;&lt;code&gt;DEL&lt;/code&gt;&lt;/a&gt; is the 128th character (#127 on the page because the table starts at 0) and has the hex representation &lt;code&gt;7F&lt;/code&gt;, totalling 1 byte. &lt;a href="https://unicode-table.com/en/0080/" rel="noopener noreferrer"&gt;&lt;code&gt;XXX&lt;/code&gt;&lt;/a&gt; (no, not the character for porn) is the 129th character and has the hex representation &lt;code&gt;C280&lt;/code&gt;, totalling 2 bytes.&lt;/p&gt;

&lt;p&gt;If you're curious here are examples of characters requiring over 2 bytes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;3 bytes: &lt;a href="https://unicode-table.com/en/3688/" rel="noopener noreferrer"&gt;&lt;code&gt;㚈&lt;/code&gt;&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;4 bytes: &lt;a href="https://unicode-table.com/en/1F701/" rel="noopener noreferrer"&gt;&lt;code&gt;🜁&lt;/code&gt;&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Just to re-emphasize what is happening here: UTF-8 maps bytes on disk to a code point. That code point maps to a character in Unicode. A different encoding, like UTF-32 for example, would map those same bytes to a completely different code point. Or perhaps it wouldn't even have a mapping from those bytes to a valid code point. The point is that a series of bytes could be interpreted in totally different ways depending on the encoding.&lt;/p&gt;




&lt;p&gt;That's it for part 1. We covered numeral systems like hex and binary (which I like to call representations instead of encodings), different character encodings such as ASCII and UTF-8, and what Unicode is (and why it's &lt;em&gt;not&lt;/em&gt; an encoding). In &lt;a href="https://dev.to/neumaneuma/decoding-the-confusing-world-of-encodings-part-2-4lo"&gt;part 2&lt;/a&gt; we'll address the remaining points and hopefully clear up the confusion surrounding the term "encoding."&lt;/p&gt;

</description>
      <category>unicode</category>
      <category>utf8</category>
      <category>ascii</category>
      <category>base64</category>
    </item>
  </channel>
</rss>
