Victoria Drake

Posted on Jan 7, 2018 • Edited on Mar 16, 2019

A unicode substitution cipher algorithm

#javascript #showdev #algorithms

Full transparency: I occasionally waste time messing around on Twitter. (Gasp! Shock!) One of the ways I waste time messing around on Twitter is by writing my name in my profile with different unicode character "fonts," 𝖑𝖎𝖐𝖊 𝖙𝖍𝖎𝖘 𝖔𝖓𝖊. I previously did this by searching for different unicode characters on Google, then one-by-one copying and pasting them into the "Name" field on my Twitter profile. Since this method of wasting time was a bit of a time waster, I decided (in true programmer fashion) to write a tool that would help me save some time while wasting it.

I dubbed the tool uni-pretty. It lets you type any characters into a field and then converts them into unicode characters that also represent letters, giving you fancy "fonts" that override a website's CSS, like in your Twitter profile. (Sorry, Internet.)

The tool's first naive iteration existed for about twenty minutes while I copy-pasted unicode characters into a data structure. This approach of storing the characters in the JavaScript file, called hard-coding, is fraught with issues. Besides having to store every character from every font style, it's painstaking to build, hard to update, and more code means it's susceptible to more possible errors.

Fortunately, working with unicode means that there's a way to avoid the whole mess of having to store all the font characters: unicode numbers are sequential. More importantly, the special characters in unicode that could be used as fonts (meaning that there's a matching character for most or all of the letters of the alphabet) are always in the following sequence: capital A-Z, lowercase a-z.

For example, in the fancy unicode above, the lowercase letter "L" character has the unicode number U+1D591 and HTML code 𝖑. The next letter in the sequence, a lowercase letter "M," has the unicode number U+1D592 and HTML code 𝖒. Notice how the numbers in those codes increment by one.

Why's this relevant? Since each special character can be referenced by a number, and we know that the order of the sequence is always the same (capital A-Z, lowercase a-z), we're able to produce any character simply by knowing the first number of its font sequence (the capital "A"). If this reminds you of anything, you can borrow my decoder pin.

In cryptography, the Caesar cipher (or shift cipher) is a simple method of encryption that utilizes substitution of one character for another in order to encode a message. This is typically done using the alphabet and a shift "key" that tells you which letter to substitute for the original one. For example, if I were trying to encode the word "cat" with a right shift of 3, it would look like this:

c a t
f d w

With this concept, encoding our plain text letters as a unicode "font" is a simple process. All we need is an array to reference our plain text letters with, and the first index of our unicode capital "A" representation. Since some unicode numbers also include letters (which are sequential, but an unnecessary complication) and since the intent is to display the page in HTML, we'll use the HTML code number 𝕬, with the extra bits removed for brevity.

var plain = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'];

var fancyA = 120172;

Since we know that the letter sequence of the fancy unicode is the same as our plain text array, any letter can be found by using its index in the plain text array as an offset from the fancy capital "A" number. For example, capital "B" in fancy unicode is the capital "A" number, 120172 plus B's index, which is 1: 120173.

Here's our conversion function:

function convert(string) {
    // Create a variable to store our converted letters
    let converted = [];
    // Break string into substrings (letters)
    let arr = string.split('');
    // Search plain array for indexes of letters
    arr.forEach(element => {
        let i = plain.indexOf(element);
        // If the letter isn't a letter (not found in the plain array)
        if (i == -1) {
            // Return as a whitespace
            converted.push(' ');
        } else {
            // Get relevant character from fancy number + index
            let unicode = fancyA + i;
            // Return as HTML code
            converted.push('&#' + unicode + ';');
        }

    });
    // Print the converted letters as a string
    console.log(converted.join(''));
}

A neat possibility for this method of encoding requires a departure from my original purpose, which was to create a human-readable representation of the original string. If the purpose was instead to produce a cipher, this could be done by using any unicode index in place of fancyA as long as the character indexed isn't a representation of a capital "A."

Here's the same code set up with a simplified plain text array, and a non-letter-representation unicode key:

var plain = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'];

var key = 9016;

You might be able to imagine that decoding a cipher produced by this method would be relatively straightforward, once you knew the encoding secret. You'd simply need to subtract the key from the HTML code numbers of the encoded characters, then find the relevant plain text letters at the remaining indexes.

Well, that's it for today. Be sure to drink your Ovaltine and we'll see you right here next Monday at 5:45!

Oh, and... ⍔⍠⍟⍘⍣⍒⍥⍦⍝⍒⍥⍚⍠⍟⍤ ⍒⍟⍕ ⍨⍖⍝⍔⍠⍞⍖ ⍥⍠ ⍥⍙⍖ ⍔⍣⍪⍡⍥⍚⍔ ⍦⍟⍚⍔⍠⍕⍖ ⍤⍖⍔⍣⍖⍥ ⍤⍠⍔⍚⍖⍥⍪

Top comments (9)

Ben Halpern • Jan 7 '18

𝔻𝕒𝕞𝕟 𝕥𝕙𝕒𝕥 𝕚𝕤 𝕒 𝕗𝕦𝕟 𝕥𝕠𝕠𝕝

𝔸𝕞𝕒𝕫𝕚𝕟𝕘 𝕛𝕠𝕓 𝕨𝕚𝕥𝕙 𝕒𝕝𝕝 𝕠𝕗 𝕥𝕙𝕚𝕤 𝕍𝕚𝕔𝕜𝕪

Dave Cridland • Jan 8 '18 • Edited

>>> u''.join([ c if c == ' ' else unichr(ord(c) - 0x2352 + ord('A')) for c in s ])

I can never resist these things.

The three letter words were helpful - there's limited options there, so I thought aiming for an AND or a THE would be a good crib. The fact that you've left spaces unencoded does, of course, make this much simpler.

I do get, though, that this article isn't about cryptography. :-)

Victoria Drake • Jan 8 '18

Python3:

''.join([ c if c == ' ' else chr(ord(c) - 0x2352 + ord('A')) for c in s ])

I love this. Let's be secret code buddies.

Dave Cridland • Jan 8 '18

⍢⎄⎁⍴⌽⌯⍙⎄⎂⎃⌯⎁⍴⍼⍴⍼⍱⍴⎁⌯⎃⍾⌯⍴⍽⍲⍾⍳⍴⌯⎈⍾⎄⎁⌯⎂⍿⍰⍲⍴⎂⌯⍽⍴⎇⎃⌯⎃⍸⍼⍴⌽

Max Cerrina • Jan 8 '18

I feel so 𝓯𝓪𝓷𝓬𝔂!

Mihail Malo • Feb 2 '19

I like this approach in node:

big = str => {
  const out = Buffer.from(str, "ucs2"),
    len = out.length
  for (let i = 0; i < len; i += 2) {
    const ascii = out[i]
    if (ascii < 0x21 || ascii > 0x7E) continue
    out[i] = ascii - 0x20
    out[i + 1] = 0xff
  }
  return out.toString("ucs2")
}
big("Big Chungus")