Pavel Zeman

Posted on May 12

The Pitfalls of Streamed ZIP Decompression: An In-Depth Analysis

#zip #streaming #node #unzipper

Wikipedia says it clearly: "Tools that correctly read ZIP archives ... must not scan for entries from the top of the ZIP file". In other words, streamed decompression of ZIP archives is not possible. Still, there are some libraries (e.g. unzipper), which support it. Is it safe to use them? Does it work? Let's try to analyze it.

Motivation

As a motivation and basis for further discussion, let's create a simple ZIP archive using the following simple script (you can clone it from my GitHub repository):

const JSZip = require("jszip");
const fs = require("fs");

const zipItem = new JSZip();
for(let i = 0; i < 3; i++) {
  const content = Array.from({ length: 100 }, (_, j) => String.fromCharCode(j)).join("");
  zipItem.file(`dummy-${i}.txt`, content);
}

zipItem.generateAsync({ type: "nodebuffer", streamFiles: true })
  .then((content) => {
    const zip = new JSZip();
    for (let i = 0; i < 2; i++) {
      zip.file(`invalid-item-${i}.zip`, content);
    }
    zip.generateNodeStream({ type: "nodebuffer", streamFiles: true })
      .pipe(fs.createWriteStream("invalid.zip"));
  });

The script creates a ZIP archive invalid.zip, which contains nested ZIP archives in the following structure:

invalid-item-0.zip - 664 bytes
- dummy-0.txt - 100 bytes
- dummy-1.txt - 100 bytes
- dummy-2.txt - 100 bytes
invalid-item-1.zip - 664 bytes
- dummy-0.txt - 100 bytes
- dummy-1.txt - 100 bytes
- dummy-2.txt - 100 bytes

We can verify, that the ZIP archive is valid using unzip:

user@localhost:~$ unzip -tl invalid.zip
Archive:  invalid.zip
    testing: invalid-item-0.zip       OK
    testing: invalid-item-1.zip       OK
No errors detected in compressed data of invalid.zip.

Now let's try to decompress the ZIP archive as a stream using the unzipper library and list all files inside it together with their sizes:

const fs = require("fs");
const unzipper = require("unzipper");

fs.createReadStream("invalid.zip")
  .pipe(unzipper.Parse())
  .on("entry", async (entry) => {
    let size = 0;
    entry.on("data", (chunk) => size += chunk.length);
    entry.on("end", () => console.log(`File: ${entry.path}, size: ${size} bytes`));
  })
  .on("finish", () => console.log("Finished processing all entries"))
  .on("error", (err) => console.error("Error during processing:", err));

The output is as follows:

File: invalid-item-0.zip, size: 141 bytes
File: dummy-1.txt, size: 100 bytes
File: dummy-2.txt, size: 100 bytes

We expect the output to contain just 2 files - invalid-item-0.zip and invalid-item-1.zip. The size of each of them should be 664 bytes. Instead, we get 3 files. The first one has invalid size and the others are read from inside of invalid-item-0.zip, which is completely nonsense. Additionally, you can see, that the finish event is never processed and its log record is missing.

If you try to modify the archive content, you may get other results including various errors. And if you search for issues of the unzipper library, you can find about 10 of them mentioning similar problems.

All of these have the same root cause - zipped data cannot be reliably decompressed as a stream. The only way to reliably decompress zipped data is to save it to a file and then decompress the file.

We can use the same library and decompress the same ZIP archive as a file as follows:

const unzipper = require("unzipper");

(async () => {
  const directory = await unzipper.Open.file("invalid.zip")
  for (const file of directory.files) {
    let size = 0;
    const stream = file.stream();
    stream.on("data", (chunk) => size += chunk.length);
    await new Promise((resolve) => stream.on("finish", resolve));
    console.log(`File: ${file.path}, size: ${size} bytes`);
  }
  console.log("Finished processing all entries");
})();

Now we get the expected output:

File: invalid-item-0.zip, size: 664 bytes
File: invalid-item-1.zip, size: 664 bytes
Finished processing all entries

ZIP archive structure

In order to understand the problem, let's analyze the ZIP archive structure. The complete description is available in Wikipedia and all the details are available in PKWARE Inc. ZIP File Format Specification. This text contains only a brief summary needed to understand the presented problem.

The ZIP archive (usually) starts with a series of entries, each of them representing a single stored file. Each entry consists of the following items:

Local file header - Contains signature (4-byte constant), file name, compressed and uncompressed size, and other metadata. Compressed and uncompressed sizes are optional and can be set to 0, when they are not known during compression. This is used, when the compression is streamed and the compressed and uncompressed sizes are not known in advance.
Actual compressed data - Byte stream containing the compressed data.
Optional Data descriptor - Contains signature (4-byte constant), compressed and uncompressed size and other metadata. It is present only when the compression is streamed and in this case, the sizes are not optional, because even when streaming, they are already known.

At the end of the archive, there is a central directory. It consists of the following items:

Central directory file header - An extension of the Local file header for each stored file. It always contains real compressed and uncompressed size, and it also contains offset of the Local file header for the file inside the archive.
End of central directory record - Data structure, which must be always present at the very end of the archive. Among others, it contains offset of the start of the central directory.

The End of central directory record is the only data structure, that has a guaranteed position in the archive - it must be at the very end. Positions of all other data structures are not guaranteed, although the specification mentions, that they should be in the order mentioned in this description with no gaps between them.

The whole structure is summarized in the following diagram. Again, please note, that the archives should be created with data structures in the order shown in the diagram, but it is not guaranteed. For example, an archive with gaps between entries is perfectly valid and must be correctly decompressed.

ZIP archive decompression from a file

To decompress a ZIP archive, we just need to follow the arrows in the previous diagram. You may notice, that all the arrows go bottom-up, which means, that the archive needs to be read from the end to the beginning as follows:

We read the End of central directory record. It is always located at the very end of the archive. Among others, it contains the offset of the start of the central directory.
We scan the central directory to get list of all files from Central directory file headers. We get file names, offsets in the archive, and compressed and uncompressed sizes.
For each file, we get its compressed data and decompress it using the ZIP decompression algorithm (details about the compression method are stored in the Central directory file header as well).

To summarize, the only reliable method to read a ZIP archive is to read it from the end to the beginning. This makes it impossible to decompress the zipped content as a stream, because it can be only read from the beginning to the end.

ZIP archive decompression from a stream

But wait. The unzipper library is actually able to decompress a ZIP archive from a stream. How is it implemented?

The library is based on a simple assumption: The ZIP entries start at the beginning of the archive and follow one by one with no gaps between them. This is quite safe assumption, since the ZIP file format specification states, that all tools should create ZIP archives exactly this way.

With this assumption in mind, we can design an alternative algorithm to decompress a ZIP archive from a stream:

We read the Local file header of a ZIP entry (the first one is at the beginning of the stream, the other ones follow without gaps). It contains the file name as well as its compressed and uncompressed sizes.
Based on the compressed size from the previous step, we read the compressed data and decompress it using the ZIP decompression algorithm.
If there is a Data descriptor present, we can either skip it or use it to verify the uncompressed data checksum (CRC-32).
We continue from the first step until we reach the start of central directory. This can be easily recognized using the signature of the Central directory file header or End of central directory record, if there is no file.
We skip all central directory entries as well as the End of central directory record. These are not needed anymore.

This algorithm works, but there is a catch. The sizes in the Local file header are optional, and they are not set, when the ZIP archive is created as a stream. As a result, we do not know the size of the compressed data. How to solve this? Let's check the library source code. The relevant part is located in parse.js at lines 181 through 187:

if (fileSizeKnown) {
  entry.size = vars.uncompressedSize;
  eof = vars.compressedSize;
} else {
  eof = Buffer.alloc(4);
  eof.writeUInt32LE(0x08074b50, 0);
}

We can see, that if the compressed size is known, it is used. Otherwise, the library searches for the end of the compressed data based on a 4-byte signature (0x08074b50). This is the signature of the Data descriptor, which is located immediately after the compressed data. Searching for the compressed data using the 4-byte signature may work, but it fails, when the signature is present in the compressed data itself. As we can hardly assume anything about the compressed data, I consider this approach to be too risky.

To summarize, the library makes the following assumptions in order to decompress a ZIP archive from a stream:

The ZIP entries start at the beginning of the archive and follow one by one with no gaps between them.
One of the following is true:
1. The archive was not created as a stream (i.e. all the sizes in the Local file header are known).
2. The compressed data does not contain the signature of the Data descriptor.

Based on these assumptions, the algorithm to decompress a ZIP archive from a stream can be refined as follows (this is the algorithm used by the library):

We read the Local file header of a ZIP entry (the first one is at the beginning of the stream, the other ones follow without a gap). It contains the file name as well as its compressed and uncompressed sizes.
If the compressed data size from the previous step is known, we read the compressed data of that size. If it is unknown, we read the compressed data until we find the Data descriptor signature.
We decompress the compressed data using the ZIP decompression algorithm.
If there is a Data descriptor present, we just discard it.
We continue from the first step until we reach the start of central directory. This can be easily recognized using the signature of the Central directory file header or End of central directory record, if there is no file.
We skip all central directory entries as well as the End of central directory record. These are not needed anymore.

invalid.zip archive analysis

Now, it should be clear, why our invalid.zip archive cannot be decompressed as a stream. The first assumption is satisfied, but the second one is not. The invalid.zip archive is intentionally created as a stream (notice streamFiles: true in the source code) and its compressed data contains the signature of the Data descriptor.

The last point does not have to be clear at first sight, so let's analyze it in more detail. First of all, notice that there is no compression level specified in the source code, when creating the invalid.zip archive. As a result, compression level 0 (i.e. no compression) is used by default. This means, that the contents of the invalid-item-0.zip and invalid-item-1.zip files are simply copied to the invalid.zip archive, which leads to the structure shown in the following diagram.

This structure is then processed as follows:

We read Local file header of the first file in the archive, i.e. invalid-item-0.zip.
The Local file header contains unknown data size, so we search for the signature of the Data descriptor.
The first Data descriptor is the first Data descriptor of the first file inside invalid-item-0.zip, i.e. file dummy-0.txt. This is not the Data descriptor that we want, but we don't know that, so we finish the processing of the first file.
We read the Data descriptor and discard it (the library does not use it in any way).
We continue processing with the next Local file header, which is the Local file header of dummy-1.txt. Please note, that now we process a file, that does not exist in invalid.zip archive, it exists only inside invalid-item-0.zip.
We search for the following Data descriptor, find it and finish processing of dummy-1.txt.
In the same way, we process dummy-2.txt.
Now, we identify a Central directory file header based on its signature. This means, that we are at the end of the archive.
We drain all following Central directory file headers until End of central directory record is reached.
End of central directory record denotes the very end of the archive, so we stop processing here without any error, but without processing invalid-item-1.zip at all.

Trying to improve it

Based on the previous text, it should be clear, that streamed decompression of zipped data is a bad idea and can never be reliable. The ZIP file format is simply not designed for it. Still, it is tempting to use it, when it works in many cases. So how to improve the unzipper library so that it provides better results than presented in this text?

I would suggest the following improvements:

Documentation - Add a big red warning to the documentation stating, that streamed decompression of zipped data is not reliable and users of this feature do it at their own risk.
Verify compressed data size based on the Data descriptor - Currently, the library does not use the Data descriptor, which contains useful information - among others there is the real compressed data size. We can leverage it and compare it with the amount of data already processed. And if there is a difference, we can either fail with a reasonable error message stating, that the archive cannot be decompressed as a stream, or we can continue processing the archive until we read all the compressed data.
Verify CRC-32 of the uncompressed data based on the Data descriptor - Same as the previous one, but in this case CRC-32 of the uncompressed data is verified.

Alternative libraries

There are other libraries, which can be used to decompress ZIP archives. Their support of streamed decompression is as follows:

adm-zip - Streamed decompression is not supported. The API only accepts a file name as its input.
decompress - Streamed decompression is not supported. The API only accepts a file name or a buffer as its input.
decompress-zip - Streamed decompression is not supported. The API only accepts a file name as its input.
extract-zip - Based on yauzl, so streamed decompression is not supported.
jszip - Streamed decompression is not supported. The API only accepts a file name or in-memory data as its input.
node-stream-zip - Streamed decompression is not supported. The API only accepts a file name as its input.
unzip-stream - Based on unzipper, so streamed decompression is supported. And it works better than unzipper, because it implements my second improvement mentioned above. Thanks to it, it is even able to successfully decompress the invalid.zip file as a stream. However, the documentation clearly states, that streamed decompression of a ZIP archive is not supported and the library may fail in some cases.
yauzl - Streamed decompression is not supported. The documentation explicitly states, that this is intentional, because streamed decompression of a ZIP archive is not possible.
zip-lib - Streamed decompression is not supported. The API only accepts a file name as its input.

Key takeaways

The ZIP file format is not designed for streamed decompression, because the archive must be read from the end to the beginning.
Avoid streamed decompression of zipped data as much as you can. Always prefer to store the stream to a file or a memory buffer and decompress it from there.
Streamed decompression of zipped data is reliable, when you can make certain assumptions about the archive, that you decompress. But this is rarely the case.
The unzipper can be improved so that it at least fails with a reasonable error message, when the archive cannot be decompressed as a stream.
When streamed decompression is required, consider using unzip-stream, which provides better results than unzipper.