Quack Quack

Posted on Feb 15, 2023 • Edited on Feb 17, 2023

Implement base64 encoding using Rust - [Part 1] Base64 for non-unicode characters

#rust #algorithms #programming #tutorial

Introduction

I've been using base64 encoding for a while and haven't got an idea how it works internally so i decided to read some documentation and implement it using Rust.

What will we learn?

Some bit wise and bit shift operations
How base64 works internally?
Why base64 takes up more space?

Implementation

First, let's read up the RFC 4648 to know how things work.

How it works

Basically, the base64 encoding takes a chunk of 3 characters and gives back 4 characters by using the following rule. Each character has 8 bits so 3 character has 3*8 = 24 bit, we simple just split that 24 bits into 4 part (6 bits every part) and encode it using the table below to get 4 characters.

If we don't have enough character to fill up 3 bytes segment. For example, our input is just an "a" which is 1 byte, we will fill up 2 empty byte 0x0 to get the 3 bytes segment. The first 2 characters should be encoded using "a" first 6 left most bits and 2 right most bits following by 2 padding characters "=".

The code

First, we define our alphabet.

static alphabet: [char; 64] = [
    'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S',
    'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l',
    'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '0', '1', '2', '3', '4',
    '5', '6', '7', '8', '9', '+', '/',
];

And also the padding character

static padding_char: char = '=';

We also need to define the signature of our base64_encode function. It will take a &str as input and output a String

fn base64_encoding(input:&str) -> String
{
    todo!()
}

I initially wanted to loop over the string input and create 3 characters segment first, each group contains a pointer to a portion of the input string. Then i would iterate over each group and process them. But i found that it would consume more memory if the input is a very long string. So, i decided to go with one character at the time.

for char in input.chars(){
//do stuff here
}

Before we can process the list, we need to define some rules to process each character.

Each character is a byte, which is 0b00000000. (If you don't know about binary numbers, please spend sometimes to get familiar with the concept Google). So, 3 bytes could be represent like this

0b00000000 0b00000000 0b00000000.

If the character is the first one in the group, then we take it's first 6 leftmost bits and encode it as a character.

0b*000000*00 0b00000000 0b00000000. (take the bold bits)
If the character is the second one in the group, then we take the 2 right most bit of the first character in group and combine it with the 4 left most bits of the second character to get the 2nd encoded character.

0b000000*00* 0b*0000*0000 0b00000000.
If the character is the third one in the group, then we take the 4 right most bits of the second character and combine it with the 2 left most bits of the third characters to get the 3rd encoded character and take the rest for the 4th encoded character. 3rd encoded character should look like this: 0b00000000 0b0000*0000* 0b*000000 and 4th as follow 0b00000000 0b00000000 0b00000000*

Before we can process inside the loop, we will need to create a couple of variables

    let mut result = String::new(); //The result to return
    let mut counter = 0; //the current index in the loop, could only be 0 1 2 represent 1st 2nd 3rd byte in the 3 bytes segment
    let mut left_carry: u8 = 0; //bits left from the previous byte

We will also define the function to handle each character in 3 bytes segment

//we will return the first encoded char and 
fn handle_first_char(first_char: char) -> (char, u8) {
    let char = alphabet[(first_char as u8 >> 2) as usize]; //take the first 6 bits and encode it as a char;
    let carry = ((first_char as u8) & 0x3) as u8; //last 2 bits;

    (char, carry)
}

Next, we write 2 separate functions to handle the second and third character

fn handle_second_char(second_char: char, carry: u8) -> (char, u8) {
    let char = alphabet[(((second_char as u8) >> 4) | (carry << 4)) as usize];
    let carry = (second_char as u8) & 0xF;
    (char, carry)
}

fn handle_third_char(third_char: char, carry: u8) -> (char, char) {
    let first_char = alphabet[((carry << 2) | ((third_char as u8) >> 6)) as usize];
    let second_char = alphabet[((third_char as u8) & 0x3F) as usize];
    (first_char, second_char)
}

Let's go back to our base64_encode function, we will use these handle_first_character, handle_second_character and handle_third_character inside this method.


pub fn base64_encode(input: &str) -> String {
    let mut result = String::new();
    let mut counter = 0;
    let mut left_carry: u8 = 0;

    //return early if the string is empty
    if input.is_empty() {
        return result;
    }

    for char in input.chars() {
        println!("{}", char);
        counter += 1;

        if counter == 1 {
            let (char, carry) = handle_first_char(char);
            result.push(char);
            left_carry = carry;
        }

        if counter == 2 {
            let (char, carry) = handle_second_char(char, left_carry);
            result.push(char);
            left_carry = carry;
        }

        if counter == 3 {
            let (first_char, second_char) = handle_third_char(char, left_carry);

            result.push(first_char);
            result.push(second_char);

            counter = 0; //set counter = 0 to create another 3 byte segment
            left_carry = 0; //reset the carry here
        }
    }

Add padding characters

The algorithm works now, but it's only correct if our string has enough character to fill up 3 bytes segment without any odd characters. If our string has only one character, it would give us wrong result.
Let's add a bit more code to handle padding.

We've mentioned above that a segment is made of 3 characters. If we have 1 character left, we should add 2 more characters as padding and add 1 character in case we have 2 characters left.

//from the padding above
fn calculate_padding(counter: u8) -> u8 {
    //the result could only be 1 or 2
    3 - counter
}

Let's add 2 more functions to handle padding for each case (1 or 2 padding char).

fn handle_padding_first_char(carry: u8) -> char {
    let char = alphabet[(0x00 | (carry << 4)) as usize];
    char
}

fn handle_padding_second_char(carry: u8) -> char {
    let char = alphabet[((carry << 2) | (0x00)) as usize];
    char
}

And modify our base64_encode accordingly

pub fn base64_encode(input: &str) -> String
...
    if counter != 0 {
        let padding = calculate_padding(counter);
        if padding == 2 {
            let char = handle_padding_first_char(left_carry);
            result.push(char);
            result.push(padding_char);
            result.push(padding_char);
        } else {
            let char = handle_padding_second_char(left_carry);
            result.push(char);
            result.push(padding_char);
        }
    }

    println!("{}", result);
    result

Now that's what we call base64 for non-unicode characters.

Testing

Let's write some test as well and use an online base64_encode website to verify that we got the correct answer (You can use this site).

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn it_works() {
        assert_eq!(&base64_encode("crab"), "Y3JhYg==");
        assert_eq!(
            &base64_encode("the brown fox jump over the lazy dog!"),
            "dGhlIGJyb3duIGZveCBqdW1wIG92ZXIgdGhlIGxhenkgZG9nIQ=="
        );
        assert_eq!(&base64_encode(""), "");
        assert_eq!(&base64_encode("a"), "YQ==");
    }
}

Conclusion

You have just learned:

Bit wise and bit shift operations.
How base64 works internally.
The base64 take 3 characters as input then gives back 4 characters as output, that's why it occupies more space than the original string.

As i mentioned above, our implementation only works for non-unicode characters. In the next part, i will show you guys how to do that.

The source code is available on Github here

Top comments (2)

Julien Cretin • Feb 23 '23

Hey,

If you're interested to understand all the subtleties of base64 and understand the general ideas behind RFC 4648, then you might be interested by the data-encoding crate (which also supports base32, base16 aka hexadecimal, base8 aka octal, base4, and base2 aka binary with the same core function).

In particular you can: