Massimo Artizzu

Posted on Aug 28, 2021

Let's develop a QR Code Generator, part VII: other kinds of data

#javascript #qrcode #tutorial

So now we're able do create QR Codes for certain data. Specifically, short Latin-1 strings (i.e., strings with just 256 possible symbols, defined in the Latin-1 table). But, as we've said since part 1, we can encode also numbers, alphanumeric strings and even Kanji characters, thus wasting less of our available space.

After all, it's a shame if we can use 256 symbols but end up using just a limited set, no? But we're still working with codewords, and a codeword roughly translates into 8-bit bytes. So we need a way to stick more data in those bytes.

In the end, what we'll need is some function that spouts values that we'll need to write in our buffer (that consists of codewords, or better our Uint8Arrays). Keep in mind that those values aren't going to be 8-bit long, but rather they'll have variable bit length, as we'll see.

Preparing the field

Since we're using JavaScript, what better function to emit values than a generator? We'll come out with 4 different generator functions - one for each encoding mode - with the following signature (pardon the TypeScript):



type ContentValuesFn = (content: string) => Generator<{
  value: number;
  bitLength: number;
}, void, unknown>;

Each yielded value will go with its length in bits. Our old function getByteData (see part 2) will be replaced by a generic encoding function with the same arguments, and a fairly simple getByteValues generator function like this:



function* getByteValues(content) {
  for (const char of content) {
    yield {
      value: char.charCodeAt(0),
      bitLength: 8
    };
  }
}

Numbers

If we improperly accepted that a kilobyte is not 1000 bytes (as it should be), but rather 1024, it's because 1024 and 1000 are so close. We can actually take advantage of that!

So, how do we encode numbers? Let's start with a large number, for example the 10th perfect number: it's 191561942608236107294793378084303638130997321548169216, a 54 digits behemot (yes, perfect numbers grow quite fast).

The next step is to split the number in groups of 3 digits:

191 561 942 608 236 107 294 793 378 084 303 638 130 997 321 548 169 216

Each of these group can be stored in 10 bits (as 2¹⁰ = 1024), wasting just above 2% of space. If the last group is just 2 digit long, instead of 10 bits it will take 7 (since 2⁷ = 128 is enough to cover 100 values), and if the last group is just one digit it will take 4.

This will be the final result (version 2 QR Code, medium correction):

In code

We need to come up with a function that does just the above. We'll also use a BIT_WIDTHS constant as something to map the length of the group to its bit length:



const BIT_WIDTHS = [0, 4, 7, 10];
function* getNumericValues(content) {
  for (let index = 0; index < content.length; index += 3) {
    const chunk = content.substr(index, 3);
    const bitLength = BIT_WIDTHS[chunk.length];
    const value = parseInt(chunk, 10);
    yield { value, bitLength };
  }
}

Alphanumeric

Only 45 symbols are supported in alphanumeric mode, and they are:

numeric Arabic digits (codes from 0 to 9);
uppercase Latin letters (codes 10-35);
the following symbols: " " (space, code 36), "$" (37), "%" (38), "*" (39), "+" (40), "-" (41), "." (42), "/" (43), ":" (44).

If you notice, these symbols are enough for most URLs, although in uppercase and without query strings or fragments (as in our example from the previous parts, we'd encode HTTPS://WWW.QRCODE.COM/), but more in general alphanumeric mode should be used for simple messages in Latin letters and Arabic digits, plus some punctuation.

Why 45 symbols? I think it's because 45² = 2025. So, since 2¹¹ = 2048, similarly to numeric mode, we can encode two characters using 11 bits, wasting even less space (~1%).

All we have to do, then, is splitting our string into groups of two characters:



HT TP S: // WW W. QR CO DE .C OM /

Then, for each group, map each character to its alphanumeric code, multiply the first by 45 and add the second (as you'd do in a base-45 arithmetic). For the first group, H is code 17, T is 29, so the value to be written in our buffer is 17 * 45 + 29 = 794.

If the last group consists of only one character (as in our case), we'd need only 6 bits to write its value.

We'll some this result (version 2, quartile quality):

In code

The generator function for alphanumeric mode will be, predictably, very similar to the one for numeric mode. We'll use a constant string as a lookup table for mapping characters to their alphanumeric codes.



const ALPHACHAR_MAP = '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ $%*+-./:';
function* getAlphanumericValues(content) {
  for (let index = 0; index < content.length; index += 2) {
    const chunk = content.substr(index, 2);
    const bitLength = chunk.length === 1 ? 6 : 11;
    const codes = chunk.split('').map(
      char => ALPHACHAR_MAP.indexOf(char)
    );
    const value = chunk.length === 1
      ? codes[0]
      : codes[0] * ALPHACHAR_MAP.length + codes[1];
    yield { value, bitLength };
  }
}

Kanji mode

Kanji is a very complex alphabet. I don't even know if it can be actually called that, as it's not phoneme-based, but rather a set of logographic characters. But being so complex, it wouldn't expect encoding Kanji characters to be simple, would you?

Encoding Kanji in QR Codes uses the so-called Shift JIS code table, so for each character we'll have to find its equivalent code in Shift JIS. Not only that: QR Codes can accept characters with codes from (in hex) 0x8140 to 0x9FFC, and again from 0xE040 to 0xEBBF, for 6593 characters in total.

I won't go into detail about how to map a character into its Shift JIS code for now, as there are good libraries for the job (iconv-lite comes to mind, and you can even have a look to the actual table if you want to whip up your own solution). It's sufficient to say that we'll need 13 bits (2¹³ = 8192) for each one of them.

But we won't use the Shift JIS code directly, as they're all well above 8192 in values. We'll need to to the following:

get the Shift JIS code;
if the code is between 0x8140 and 0x9FFC, subtract 0x8140; otherwise, subtract 0xC140;
get the most significant byte from difference above (basically, shift the value 8 bits to the right), and multiply it by 0xC0 (192);
add the least significant byte of the difference (i.e., get the rest modulo 256).

For example, the character 荷 is 0x89D7 in Shift JIS, and the operations above will give us 1687; 茗 is 0xE4AA, so we'll get 6826.

In code

Let's suppose we have a magical getShiftJISCode function, so we won't need to write our own:



function* getKanjiValues(content) {
  for (const char of content) {
    const code = getShiftJISCode(char);
    const reduced = code - (code >= 0xe040 ? 0xc140 : 0x8140);
    const value = (reduced >> 8) * 192 + (reduced & 255);
    yield { value, bitLength: 13 };
  }
}

Wrap everything up

In part 2 we had a getByteData function to fill our available codewords, so we'll need something similar.

But first, we need a function to actually write value bits into our buffer. Something like this:



function putBits(buffer, value, bitLength, offset) {
  const byteStart = offset >> 3;
  const byteEnd = (offset + bitLength - 1) >> 3;
  let remainingBits = bitLength;
  for (let index = byteStart; index <= byteEnd; index++) {
    const availableBits = index === byteStart ? 8 - (offset & 7) : 8;
    const bitMask = (1 << availableBits) - 1;
    const rightShift = Math.max(0, remainingBits - availableBits);
    const leftShift = Math.max(0, availableBits - remainingBits);
    // chunk might get over 255, but it won't fit a Uint8 anyway, so no
    // problem here. Watch out using other languages or data structures!
    const chunk = ((value >> rightShift) & bitMask) << leftShift;
    buffer[index] |= chunk;
    remainingBits -= availableBits;
  }
}

It takes four arguments:

buffer is a Uint8Array (where we need to write);
value is the value we need to write;
bitLength is the length in bits of value;
offset is the index of the bit we'll start writing from.

I won't go into details, but basically it takes 8-bit chunks from value and write them into the buffer, preserving the existing data (that's why the OR assignment |=).

Next, we'll need to map the encoding mode values to our generator functions:



const valueGenMap = {
  [0b0001]: getNumericValues,
  [0b0010]: getAlphanumericValues,
  [0b0100]: getByteValues,
  [0b1000]: getKanjiValues
};

Then, we're going to refactor the mentioned function into something similar but functional for every encoding mode:



function getData(content, lengthBits, dataCodewords) {

  const encodingMode = getEncodingMode(content);

  let offset = 4 + lengthBits;

  const data = new Uint8Array(dataCodewords);

  putBits(data, encodingMode, 4, 0);

  putBits(data, content.length, lengthBits, 4);

  const dataGenerator = valueGenMap[encodingMode];

  for (const { value, bitLength } of dataGenerator(content)) {

    putBits(data, value, bitLength, offset);

    offset += bitLength;

  }

  const remainderBits = 8 - (offset & 7);

  const fillerStart = (offset >> 3) + (remainderBits < 4 ? 2 : 1);

  for (let index = 0; index < dataCodewords - fillerStart; index++) {

    const byte = index & 1 ? 17 : 236;

    data[fillerStart + index] = byte;

  }

  return data;

}

Coming soon…

We've come around the first of the main limitations of our QR Code generator so far: the encoding mode. We haven't seen ECI mode yet, but we've covered the basic 4 modes.

In the next parts, we'll create QR Codes of different sizes too, as we've only created version 2 codes. So keep in touch and see you around! 👋

DEV Community

Let's develop a QR Code Generator, part VII: other kinds of data

Preparing the field

Numbers

In code

Alphanumeric

In code

Kanji mode

In code

Wrap everything up

Coming soon…

Top comments (0)