jD91mZM2

Posted on Feb 9, 2018

Base64 is fun!

#base64 #encoding #rust

Hello! Today I decided to stop feeling like Base64 is black magic, and go really learn it. You can see the result here. I learned a lot of interesting things, so I'll share it with you here.
Because of mixed backgrounds I'll make sections for everything so you can skip anything you already know.

Bitwise operators

To properly learn Base64, you need to learn bitwise operators. I'm sure you've seen stuff like 50 & 1 == 0 somewhere before. What are these weird half-boolean operators doing on the integers?!
As you know, computers work with binary. But the relationship is closer than you think. What bitwise operators do is flip certain bits.
Each binary digit (0 or 1) is called a bit.

Bitwise And: &. This calculates the integer result of two integers where each bit is the same. In other words, it sees which bits number one and number two has. So it's called bitwise and. Examples:

  42 // <- Binary: 101010
& 15 // <- Binary: 001111
= 10 // <- Binary: 001010

  5 // <- Binary: 101
& 6 // <- Binary: 110
= 4 // <- Binary: 100

Bitwise Shift: << and >>. Basically "shifts" all numbers to left or right. These can be compared to multiplying by 10^n in decimal. In fact, it's exactly the same as multiplying by 2^n. When you multiply something by 2, the compiler sometimes optimizes it to bit-shifting by one.

   16 // <- Binary:  10000
<< 1
=  32 // <- Binary: 100000

   42 // <- Binary: 101010
>> 3
=  5  // <- Binary:    101

Byte order

As you may know, an int is a 32-bit integer and a byte is an 8-bit integer.
Since 32 (and all other types of integers) is a multiple of 8, you could technically convert a bigger type of integer to multiple bytes and then back... right?
Like say 666, 0000001010011010, could be stored as 00000010 and 10011010?

Turns out, of course, I wasn't the first one to think of this. In fact, in memory they're represented like this too. In C you could simply cast a int* to a byte* and then access the specific bytes inside the integer.
What you may find (it depends), is that they are in the reverse order of what I just showed (10011010 first, not last). This is called little endian, where as what I showed is big endian. Big endian means the "big end" (the byte that makes the largest impact, just like changing the one in 100 makes a larger impact than changing the one in 10) is first.

Putting it all together

Normal binary data is stored in 8 bits, which means it can go to 256.
64 happened to be a good amount of characters that is a power of 2.
The reason it's a power of two is quite brilliant: It can be converted to by simply reading 6 bits instead of 8.

Say we have the following: "hello".
In bytes, this is 104 101 108 108 111.
In binary: 01101000 01100101 01101100 01101100 01101111.

Now we make this into one integer. Before we do that though, there's a problem. This quickly gets unbelievably large.
We solve this using a convenient fact: 4 6-bit integers has the same bit length as 3 8-bit ones. That's also why it's padded to 4: 6*4 = 8*3. This is really convenient for the programmer when decoding, so he/she doesn't have to bounds check valid base64.

Because of this, we only have to worry about 3 bytes at a time (4 when decoding).

for bytes in input.chunks(3) {
    // Fun fact, you *could* do this, but it's more fun to use bit shifting.
    // let buf = mem::transmute::<&[u8], u32>(bytes).to_be();
    let mut buf = 0;
    buf += (bytes.get(0).cloned().unwrap_or(0) as u32) << 8*2;
    buf += (bytes.get(1).cloned().unwrap_or(0) as u32) << 8*1;
    buf +=  bytes.get(2).cloned().unwrap_or(0) as u32;

The reason for .get(n).cloned().unwrap_or(0) is basically because indexing can crash. The input doesn't have to be dividable by 3.
Now we have

011010000110010101101100

and

011011000110111100000000

.
Note: The last number gets padded out with 0s.

Second step now sounds really simple: Re-split the numbers every 6 bytes and insert = if it runs out before it's a multiple of 4.

    for i in 0..4 {
        let j = (3-i) * 6;
        let buf = ((buf >> j) & 0x3F) as u8;

The first part is actually rather simple but unecessarily verbose. It counts down from 24, 6 steps at a time. It could probably have been written like for j in (0..=24).rev().step_by(6), but I need i later.

But next comes the fun part:
Let's imagine j is 18.

   011010000110010101101100
>> 24
=  011010000110

   011010000110
&  000000111111 // <- 0x3F in binary.
=  000110

See the pattern? In a way, we basically just took a substring... of the integer! The bitwise shift allowed us to fetch a value down, and the bitwise and allowed us to specify which values we cared about.

Now we have

011010 000110 010101 101100

and

011011 000110 111100 000000

Finally, let's insert the value, or = if out of bounds.

        if bytes.len() >= i {
            output.push(char(buf));
        } else {
            output.push('=');
        }
    }
}

Note: I didn't include the definition of char function. It basically just maps a value to a base64 character, like 0 to A.

We now have

a G V s

and

b G 8 =

aGVsbG8=

Top comments (1)

David Paine20 • Jul 3 '20

I must say, that is some unique you shared. Really a wonderful sharing. And something new I learned. As my Google search, for base64, I landed on your page. There is a tool, I would recommend you, for base64 decode/encode, that I also found during my Google search. That tool is url-decode.com/tool/base64-decode. You also check it out as well.