Binary File Formats Explained

#programming #binary #file #beginners

When I first began researching binary file formats, I was met with a complete absence of human-friendly explanations anywhere online. All of the resources that I came across were full of unfamiliar technical terms that made me feel like I was reading a textbook written for someone with twice my IQ. Hence, this article. This is a summary of my research, explained in English so that you can understand it without the trouble that I had. Now, without further delays, let's get into the learning!

All of Those Confusing Technical Terms

As I said, there were a number of technical terms that I encountered in my research that I had never heard of before. these made little sense to me at the time, and took even more painful research. Here is a list of those terms, explained in friendly ways:

Binary - binary is a number system that has a base of 2. That means that the only digits it uses are 0 and 1.
Bit - a bit is the smallest unit of data in a computer, consisting of nothing more that a single binary digit (a 0 or 1).
Byte - a byte is the next unit of data in a computer. A byte consists of 8 individual bits.
Signed Integer - a signed integer is an integer (whole number) that is associated with a sign that declares whether the number is positive or negative. Basically, it is an integer with a + or - attached to it.
Unsigned integer - an unsigned integer is an integer that is not associated with a sign(+ or -), and is always considered to be a positive value.

What exactly is a file, anyway?

A file is a collection of bits that is stored within a computer's memory. Files are generally separated into bytes, and measured by the number of bytes that they contain (Kilo*bytes*, Mega*bytes*, Giga*bytes*, etc.).

Data Types

Before we can truly begin working with binary file encoding, we need to understand the types of data that can be stored within a file. there are two main types of data in a file: integers and strings.

Integers

Integers are separated into two sub-types, signed and unsigned, as I explained at the beginning of the article.

Unsigned Integers, thanks to the lack of a sign, can store a number with roughly twice the maximum value than that of a signed integer.

Note: unsigned integer is abbreviated as Uint in almost every programming situation.

Signed Integers(Ints) are pretty much exactly the same, aside from their reduced maximum value.

Both Uints and Ints are found in different sizes in binary files, and are generally named based on how many bits they use to store their value. For example, Uint8 and Int8 both use 8 bits in the file, Uint16 and Int16 use 16 bits, and so on. These numbers have different value limits based on the number of bits that they use:

Type	Minimum Value	Maximum Value
`Uint8`	0	255
`Int8`	-128	127
`Uint16`	0	65,535
`Int16`	-32,768	32,767
`Uint32`	0	4,294,967,295
`Int32`	-2,147,483,648	2,147,483,647
`Uint64`	0	18,446,744,073,709,551,615
`Int64`	-9,223,372,036,854,775,808	9,223,372,036,854,775,807

These are not the only sizes that can be used, however, and you may sometimes encounter some oddball sizes. Here are a couple that I ran into:

Type	Minimum Value	Maximum Value
`Uint4`	0	15
`Uint24`	0	8,388,608

Strings

Strings, as you may already know, are ordered sets of text characters. When a string is stored in a binary file, it is converted into a set of bytes, with each byte storing one utf-8 character in each byte as a Uint8 character ID.

Endianness

Endianness is one of the more difficult concepts to grasp, so I'm going to take an example-based approach here. Let's say I have a variable that is a Uint16 with the value of 255. In order to store this variable in a binary file, if must first be converted into a set of bytes. Because our variable is a Uint16, taking up 16 bits of space, it requires two bytes in the file to store. But ... Which of those two bytes comes first in the file? This is where endianness comes into play.

High and Low Bytes

Let's convert our Uint16 variable into binary, making sure to keep 16 digits: 0000000011111111. Notice how the number is split evenly between 0s and 1s. Each of those groups is a single byte. The byte on the left (the 0s) is the high byte, meaning it comes first in the normal order of the value. The byte on the right (the 1s) is called the low byte. As you go across the bytes from left to right, the bytes go from higher to lower "order". So, in this number, the left byte is of a higher order than the right byte.

Now, there are two types of "endianness" that a number can be encoded in: little endian and big endian format. A number that is encoded in little endian format will have the bytes ordered from highest to lowest, meaning that the lowest("littlest") byte comes at the end. Numbers encoded in big endian format are ordered in the reverse of little endian numbers, meaning that the highest("bigest") byte comes at the end instead.

Returning to our Uint16 variable, here is what the number will look like when in the file:

Endianness	Byte 0	Byte 1
little	00000000	11111111
big	11111111	00000000

Generic File Format

From here on, we can only discuss the general patter of binary file formats, so keep in mind any file format you see does not need to adhere to what is discussed here.

In most binary files, the data is generally split into two main types of sections: the file header and binary data blocks.

File Headers

A header in a binary file is generally a collection of bytes that have fixed positions at the beginning of the file. A file header typically contains data about the version of the file format that the file was encoded with along with some other data about the contents of the file. Headers do not, however, have to be the same length in every file of the same type. Sometimes, the header needs to contain data that has a variable length, such as a string that is provided by the program. Strings are not always the same length, so in order to accomodate for this, headers will typically define the total length of the header and a constant position for the string to start. Any program that parses the file will then start reading the string at the constant index and continue untill it has read the full length of the header.

Data Blocks

The rest of a file, after the header, is generally devoted to data blocks. Each data block can be either a fixed or variable size, and will commonly have it's own header that tells the program that is parsing the file how to use the data inside the block, as well as the lenght of the block if it is a variable size.

Hope that I have successfully explained how binary file formatting works. If you have any questions or see an issue with a part of this article, feel free to drop a comment below and I will fix it as soon as I can. Happy Hacking!

Feeling Generous?

I am a 17 year old, self-taught, web developer trying to make a living while stuck with oppressive parents, and trying to find a way to pay for college while not being allowed to have a job. I would appreciate any donations.