DEV Community

Cover image for How Really File Compressors Actually Works
something something
something something

Posted on

How Really File Compressors Actually Works

You may have used zip files once in your life.If you have not it's a technique to smaller the size of a file before sending it or storing it.

In modern ages the need for zip files are less because of the faster internet and free storage capacity.But Method itself can be applied for many things in computer science such a you want store your database but also want to save some space then the methods can be handy

So today We are going to learn How actually the native file compressor works in our computers

For clarity zip is one of the method in many other options for file compression. We will walk-through them

There Are 2 kinds of file compression actually

  1. Lossy Compression
  2. Lossless Compression

Lossy reduce the size of your file but it losses some data which in some cases may not matter

example: You are watching a youtube video in 4K suddenly the internet speed drops it becomes blurry.Here it losses data while compressing.Note that this kind of compression is very essential for streaming type services

Lossless reduce the size of your file without losing your data. It uses some algorithm which we are going to discuss to compress and decompress your file

Example: When you compress a file to make it .zip or .tar.gz file

Lossy compression.

Note that based on the usecase the technique can change we are going to talk about the most common one

Let's analyze a image. If you understand how it works for image you can guess how it works for videos (stream of images)

Windows

So this is probably the most views image for computer guys. Now the portion I highlighted kind of share same or similar blue color. So for lossy compression it just replace any variant of it and shows only the fixed color. In this way the file don't need to contain extra information hence the size reduced

Lossless Compression

So Now comes the most interesting one lossless compression.
Note: There are many algorithm and format for lossless compression but we will discuss the very basic and common types of them

The most popular algorithm still now is deflate algortihm which is a mixture of LZ77 & Huffman coding

For simplicity I will explain the very basic form of this algorithm using an example

Suppose our file have following text only

"ABABABA"

Step 1: LZ77 Compression

LZ77 works by finding repeated substrings and replacing them with pointers to where those substrings appeared earlier. The pointer typically consists of:

Distance: How far back to look.
Length: How long the repeated substring is.
Let’s compress "ABABABA" using LZ77:

First three characters "ABA" are unique, so we leave them as they are.
The next two characters "BA" match the pattern found at the beginning. Instead of writing "BA" again, we can create a reference to the earlier "BA".
So, after LZ77, the string is transformed into:

ABA(3, 2)

This means:

"ABA" is written as it is.
The sequence "BA" is replaced by (3, 2), which refers to a match 3 positions back, 2 characters long.

Step 2: Huffman Coding

Now, we apply Huffman Coding to the result of LZ77. Huffman coding replaces frequently occurring symbols with shorter bit representations. To keep it simple, let's assume we're only compressing characters and using a fixed binary code for them.

Assume we have the following Huffman codes for the characters:

A → 0
B → 10
(3,2) → 110
So, after applying Huffman coding to the LZ77 output "ABA(3,2)", we get:

A → 0
B → 10
A → 0
(3,2) → 110
Now, the compressed output using DEFLATE would be:

0100110
(Note: This seems like it has not lessen the size much but in larger file it will reduce the length in rapid manner)

Final Compression Steps
LZ77 found the repeated substring "BA" and replaced it with a pointer.
Huffman Coding then further compressed the result by encoding the characters and pointer using shorter codes.

Decompression Process

Huffman Decoding: Read the bitstream and convert back to characters and LZ77 references.
LZ77 Decompression: Use the distance and length to reconstruct the original data.

So Now we have the basic algorithm idea of compression.
Now let's explore archieving

Tar vs Zip

So we know how the compression algorithm works but there is still some mystery in archieving them. Say for example we 500 photos Here different compressing software will act differently but we will focus on the most common one for windows(zip) and linux(tar)

  1. Tar (Linux)

Step 1:
Archiving (Pipe Creation):
Tar stands for "Tape Archive." It first bundles files together into a single archive (without compression) in a .tar file. This step preserves file structure and metadata (like creation date, permissions).

Step 2:
Compression:
After creating the archive, you can apply a compression algorithm we discusses but there are more modern version available (like gzip or 7z) to the archive. But the underlying basics are same

The compressed archive is named tar.gz, tar.bz2, or tar.7z, where the extension indicates the compression method used.

Advantage of Tar:

Efficiency: Since it archives first, it compresses all files as one block, leading to better compression for many small files.

Metadata Preservation: Tar keeps full file metadata like creation dates, permissions, and ownership intact.

Disadvantage of Tar:
Decompression is All-or-Nothing: You need to decompress the whole archive to access any file within it.

  1. Zip (Windows)

Step 1: Compress Each File:

Zip compresses each file individually using algorithms like Deflate. It creates a compressed file for each and then combines them into a single .zip file.

Step 2: Merge into Archive:
After compressing, zip combines these compressed files into a single archive file. The metadata (file names, structure) is stored alongside the compressed data.

Advantage of Zip:

Random Access: You can extract individual files from a .zip file without decompressing the whole archive.

Cross-Platform: .zip is natively supported on Windows, macOS, and many other systems.

Disadvantage of Zip:

Less Compression for Many Files: Since it compresses each file separately, it may not be as efficient for compressing many small files compared to tar.gz.

Metadata Support: It doesn’t preserve all Unix-specific metadata (like ownership, permissions) as well as tar.

Among searching 7z is the better and recommended

One interesting story to wrap up this article
The pioneer zip software pkzip which inspired all the zip softwares was created by phill katz who survived a very struggling life because of some bad habits and also accused of coping the source code of then popular another zip software company.You can follow an video about him in the reference.

Thanks For Reading

Ref:
Articles:
Deflate Algorithm
How File compression works
Types of file compression

Videos:

Probably Best Video Explanation
Kind of Talks About All Zip Files Types And The Algorithm They Use
Phil Katz
Tar Vs Zip
Lempel-zip
Zip in Linux
LZ Method

Top comments (0)