DEV Community

Shalvah
Shalvah

Posted on • Edited on • Originally published at blog.shalvah.me

Packing and unpacking bytes

Goal: A brief exploration of what it means to "pack" and "unpack" bytes.

Inspiration

I've come across Ruby's Array#pack and String#unpack methods, but never had the time to dive into them. While researching another article, I came across this question and decided to stop to explore it.

Exploration 1: Packing into two bytes

I can't define "packing", but I've gathered that it's a term for representing a series of bytes as a string. And depending on how you do it, you can even do this in fewer bytes than the original. Unpacking is the reverse: recovering the original information.

Trying an example based on the Stack Overflow question. I have a bunch of bytes, ie values between 0 (00000000) and 255 (11111111). Supposing I take two at random, maybe 126 and 2.

let [a, b] = [126, 2]

console.log(a.toString(2).padStart(8, '0'))  // 01111110
console.log(b.toString(2).padStart(8, '0'))  // 00000010
Enter fullscreen mode Exit fullscreen mode

I could represent them in a string by using the JS escape hexadecimal sequence:

console.log(a.toString(16).padStart(2, '0'))  // 7e
console.log(b.toString(16).padStart(2, '0'))  // 02

console.log('\x7E\x02') // "~"
Enter fullscreen mode Exit fullscreen mode

However, this isn't what I want, as this string has two characters. JavaScript strings are UTF-16 [note 1], so this string has 4 bytes, which is more than the original.

Buffer.from('\x7E\x02', 'utf16le').byteLength
// 4
Enter fullscreen mode Exit fullscreen mode

This string has two characters of two bytes each: 00 7e and 00 02. I want to pack the bytes so the string has only one character, 7e 02. Here's how:

let char = String.fromCharCode((a << 8) | b)
console.log(char); // "縂"
Buffer.from(char, 'utf16le').byteLength // 2
Enter fullscreen mode Exit fullscreen mode

This is a bit of bit arithmetic (haha).

  • a << 8 means "shift the bits in a left 8 times"
  • shifting 126 (01111110) left 8 times gives us 01111110 00000000
  • | b is a bitwise OR operation
  • 01111110 00000000 ORed with 2 (00000010) gives 01111110 00000010, which is what I want (7E 02)

So there it is. I started with two bytes, and was able to fit them into a 2-byte character [note 2]. How about unpacking? Some more bitwise magic.

let bytes = char.charCodeAt(0)
let byteA = bytes >> 8 // Shift the bits to the right 8 times to get the first byte
let byteB = bytes & 0xFF // Bitwise AND the bits with 11111111 to keep only the second byte
// Alternative:
// byteB = bytes ^ (byteA << 8)
console.log(byteA, byteB) // 126, 2
Enter fullscreen mode Exit fullscreen mode

Cool, cool.

I also found out you can do this packing natively with the TextDecoder API! [note 3]

let byteArray = new Uint8Array([a, b])
let packedStr = new TextDecoder('utf-16be').decode(byteArray)
console.log(packedStr) // "縂"
Enter fullscreen mode Exit fullscreen mode

However, unpacking with TextEncoder gives wrong results for this use case, since it only supports UTF-8:

let unpackedArray = new TextEncoder.encode(packedStr)
console.log(unpackedArray) // Uint8Array [231, 184, 130]
Enter fullscreen mode Exit fullscreen mode

Exploration 2: packing into one byte

Speaking of UTF-8, it's time to try that. But I'm changing some things:

  1. I won't use JS here, since its strings are UTF-16. I probably can use it, but I don't want that headache. Plus, I love any excuse to work with Ruby.
  2. All the bytes I'll pack are in the range 0 to 15. I've intentionally made it smaller so that I can pack two bytes into one UTF-8 character (one byte). I'll use 13 and 2 as my test bytes.

Packing in Ruby is pretty similar:

a, b = 13, 2

puts a.to_s(2).rjust(8, '0') # 00001101
puts b.to_s(2).rjust(8, '0') # 00000010

# hex
puts a.to_s(16).rjust(2, '0') # 0d
puts b.to_s(16).rjust(2, '0') # 02

char = ((a << 4) | b).chr # Shift by 4 bits, not 8, since I'm now packing in one byte
puts char # => "\xD2"
puts char.length # => 1
puts char.bytes.length # => 1

bytes = char[0].ord
byteA = bytes >> 4
byteB = bytes & 0x0F # AND with 0F, not FF, since I'm splitting up one byte
puts byteA, byteB # 13, 2
Enter fullscreen mode Exit fullscreen mode

The output string here is a single byte "\xD2"...which is simply the original 0D and 02 bytes packed together 😀 Unfortunately, it's not a valid printable character, so printing it shows , but it's there.

As mentioned earlier, Ruby has inbuilt pack and unpack methods, but they can only map byte to byte, so i couldn't use them for this example.

packed = [a, b].pack('c*') # => "\r\x02"
packed.unpack('c*') # => [13, 2]
Enter fullscreen mode Exit fullscreen mode

But they work with the original UTF-16 example:

a, b = 126, 2
packed = [a, b].pack('c*') # => "~\x02"
packed.unpack('c*') # => [126, 2]
Enter fullscreen mode Exit fullscreen mode

It may not look like that, but the packed version here ("~\x02") is exactly the same as my manually packed JavaScript version. It contains the exact two bytes, 7E 02. The difference is the encoding; in Ruby, this string is UTF-8, so it's rendered differently. But I can change the encoding and see for myself!

packed.force_encoding 'utf-16be' # => "\u7E02" 
packed.length # => 1
packed.bytes.length # => 2
Enter fullscreen mode Exit fullscreen mode

Possible uses of packing

Why would you want to pack, though? I'm thinking, perhaps in a constrained environment like gaming over the Internet. If there is a limited number of possible buttons a player can press (say 12), instead of transmitting each button press as one byte, I could:

  • wait for a few milliseconds, to gather the next few keypresses and send in a batch
  • pack these keypresses into a byte. 12 possible buttons can fit in 4 bits (2^4 = 16), so two keypresses can go in one byte (8 bits).

In this, packing serves as a form of compression, to send less data over the network and improve the gaming experience (less data to download, so responses can be faster).

I also found this question, from a user who wanted to send a UUID as binary data. This is a valid use, since UUIDs are often rendered as strings, but they're actually a sequence of 16 bytes. Sending them as a string would take 36 bytes, so packing is useful here. You could also do this for other "binary-but-look-like-strings" data, like SHA-512 hashes for instance.

Let me know if you can think of any other uses.

Notes

1. The ECMAScript spec says:

When a String contains actual textual data, each element is considered to be a single UTF-16 code unit.

So JS strings are UTF-16. However, many modern Web APIs, like Blob and TextEncoder, and even older Node.js ones like Buffer assume (or accept only) UTF-8. My guess is that they expect the string to be from the outside world (reading a file, an API response, etc), in which case, it's most likely UTF-8.

2.The only reliable way I found to get the byte length of a native JS string (UTF-16) is Buffer.from(string, 'utf16le').byteLength. Commonly suggested ways I found include TextEncoder and Blob, but they always assume UTF-8.

3.For this to work as expected, I had to specify UTF-16 Big Endian (utf-16be) as the encoding. UTF-16 because I want 2-bytes per character, and big-endian because I want the big digits at the end, like I did in the custom packer.

Image of Timescale

🚀 pgai Vectorizer: SQLAlchemy and LiteLLM Make Vector Search Simple

We built pgai Vectorizer to simplify embedding management for AI applications—without needing a separate database or complex infrastructure. Since launch, developers have created over 3,000 vectorizers on Timescale Cloud, with many more self-hosted.

Read more →

Top comments (0)

Image of Docusign

🛠️ Bring your solution into Docusign. Reach over 1.6M customers.

Docusign is now extensible. Overcome challenges with disconnected products and inaccessible data by bringing your solutions into Docusign and publishing to 1.6M customers in the App Center.

Learn more