When I first began researching binary file formats, I was met with a complete absence of human-friendly explanations anywhere online. All of the resources that I came across were full of unfamiliar technical terms that made me feel like I was reading a textbook written for someone with twice my IQ. Hence, this article. This is a summary of my research, explained in English so that you can understand it without the trouble that I had. Now, without further delays, let's get into the learning!
As I said, there were a number of technical terms that I encountered in my research that I had never heard of before. these made little sense to me at the time, and took even more painful research. Here is a list of those terms, explained in friendly ways:
Binary - binary is a number system that has a base of
2. That means that the only digits it uses are
Bit - a bit is the smallest unit of data in a computer, consisting of nothing more that a single binary digit (a
Byte - a byte is the next unit of data in a computer. A byte consists of
Signed Integer - a signed integer is an integer (whole number) that is associated with a sign that declares whether the number is positive or negative. Basically, it is an integer with a
-attached to it.
Unsigned integer - an unsigned integer is an integer that is not associated with a sign(
-), and is always considered to be a positive value.
A file is a collection of bits that is stored within a computer's memory. Files are generally separated into bytes, and measured by the number of bytes that they contain (Kilo*bytes*, Mega*bytes*, Giga*bytes*, etc.).
Before we can truly begin working with binary file encoding, we need to understand the types of data that can be stored within a file. there are two main types of data in a file: integers and strings.
Integers are separated into two sub-types, signed and unsigned, as I explained at the beginning of the article.
Unsigned Integers, thanks to the lack of a sign, can store a number with roughly twice the maximum value than that of a signed integer.
Note: unsigned integer is abbreviated as
Uintin almost every programming situation.
Ints) are pretty much exactly the same, aside from their reduced maximum value.
Ints are found in different sizes in binary files, and are generally named based on how many bits they use to store their value. For example,
Int8 both use
8 bits in the file,
16 bits, and so on. These numbers have different value limits based on the number of bits that they use:
|Type||Minimum Value||Maximum Value|
These are not the only sizes that can be used, however, and you may sometimes encounter some oddball sizes. Here are a couple that I ran into:
|Type||Minimum Value||Maximum Value|
Strings, as you may already know, are ordered sets of text characters. When a string is stored in a binary file, it is converted into a set of bytes, with each byte storing one
utf-8 character in each byte as a
Uint8 character ID.
Endianness is one of the more difficult concepts to grasp, so I'm going to take an example-based approach here. Let's say I have a variable that is a
Uint16 with the value of
255. In order to store this variable in a binary file, if must first be converted into a set of bytes. Because our variable is a
Uint16, taking up 16 bits of space, it requires two bytes in the file to store. But ... Which of those two bytes comes first in the file? This is where endianness comes into play.
Let's convert our
Uint16 variable into binary, making sure to keep 16 digits:
0000000011111111. Notice how the number is split evenly between
1s. Each of those groups is a single byte. The byte on the left (the
0s) is the high byte, meaning it comes first in the normal order of the value. The byte on the right (the
1s) is called the low byte. As you go across the bytes from left to right, the bytes go from higher to lower "order". So, in this number, the left byte is of a higher order than the right byte.
Now, there are two types of "endianness" that a number can be encoded in: little endian and big endian format. A number that is encoded in little endian format will have the bytes ordered from highest to lowest, meaning that the lowest("littlest") byte comes at the end. Numbers encoded in big endian format are ordered in the reverse of little endian numbers, meaning that the highest("bigest") byte comes at the end instead.
Returning to our
Uint16 variable, here is what the number will look like when in the file:
|Endianness||Byte 0||Byte 1|
From here on, we can only discuss the general patter of binary file formats, so keep in mind any file format you see does not need to adhere to what is discussed here.
In most binary files, the data is generally split into two main types of sections: the file header and binary data blocks.
A header in a binary file is generally a collection of bytes that have fixed positions at the beginning of the file. A file header typically contains data about the version of the file format that the file was encoded with along with some other data about the contents of the file. Headers do not, however, have to be the same length in every file of the same type. Sometimes, the header needs to contain data that has a variable length, such as a string that is provided by the program. Strings are not always the same length, so in order to accomodate for this, headers will typically define the total length of the header and a constant position for the string to start. Any program that parses the file will then start reading the string at the constant index and continue untill it has read the full length of the header.
The rest of a file, after the header, is generally devoted to data blocks. Each data block can be either a fixed or variable size, and will commonly have it's own header that tells the program that is parsing the file how to use the data inside the block, as well as the lenght of the block if it is a variable size.
Hope that I have successfully explained how binary file formatting works. If you have any questions or see an issue with a part of this article, feel free to drop a comment below and I will fix it as soon as I can. Happy Hacking!
I am a 17 year old, self-taught, web developer trying to make a living while stuck with oppressive parents, and trying to find a way to pay for college while not being allowed to have a job. I would appreciate any donations.
Even if you don't donate, simply liking this post or sharing it with anyone that might find it useful is a huge help.