DEV Community

ndesmic
ndesmic

Posted on

Writing a simple browser zip file decompressor with CompressionStreams

Zips are a pretty common and ubiquitous format for compression. You've probably gotten them as downloads and other things to pack multiple files together. This is great for reducing bandwidth. In fact you might even have formats that you didn't even know were zips. A classic example is Microsoft offices files. Try changing the extension to "zip" and see what happens when you open it. They are just zip archives! However, zip has traditionally been a bit hard to work with when in the browser. There are certainly good libraries out there but as good front-end engineers we want to minimize our dependencies and the amount of code we ship. This was traditionally perplexing because the browser already needs to support the DEFLATE algorithm which is used by zip. This is used to decompress responses from the server with the header type Accept-Encoding. So why couldn't devs have access to that? Well, more recently as of making this post this is possible in some environments using CompressionStreams and DecompressionStreams. However we need to do a little bit of work to deal with the file format.

Zip, Gzip, Zlib, Deflate

You'll likely see these terms used interchangeably, and this makes sense because they all refer to a piece of data that has been compressed with the DEFLATE algorithm. However they are slightly different. "Gzip" and "Zlib" are both pieces of data with some headers and a checksum which represent compressed data. "zlib" also refers to a C library which is one of the most popular implementations of DEFLATE from which the zlib data format descends from. "Gzip" can also be seen as a file format containing a gzip stream and usually uses the file-extension .gz.

DEFLATE is the algorithm itself which combines huffman encoding and LZ77. Web devs might also have noticed the Accept-Encoding: deflate value. This is actually "zlib" format.

"Zip" is a file format and what we are most interested in here. Zips can take multiple files, put them together and then compress them using DEFLATE. While the spec technically allows for other algorithms only DEFLATE ever seems to be used.

For more indepth I suggest looking at this post: https://dev.to/biellls/compression-clearing-the-confusion-on-zip-gzip-zlib-and-deflate-15g1

A couple utilities

Streams aren't the easiest thing to work with especially if we're looking at whole files and ArrayBuffers. For this I've created some helpers to turn a stream into a blob and ArrayBuffers into streams. I'm also adding a few helper functions for reading different data types out of a DataView and being able to download arrays and blobs for verification.

//utils.js
export function bufferToStream(arrayBuffer) {
    return new ReadableStream({
        start(controller) {
            controller.enqueue(arrayBuffer);
            controller.close();
        }
    });
}

export function downloadArrayBuffer(arrayBuffer, fileName) {
    const blob = new Blob([arrayBuffer]);
    downloadBlob(blob, fileName);
}
export function downloadBlob(blob, fileName = "download.x"){
    const blobUrl = window.URL.createObjectURL(blob);
    const link = document.createElement('a');
    link.href = blobUrl;
    link.setAttribute('download', fileName);
    link.click();
    window.URL.revokeObjectURL(blobUrl);
}

export async function streamToBlob(stream, type) {
    const reader = stream.getReader();
    let done = false;
    const data = [];

    while (!done) {
        const result = await reader.read();
        done = result.done;
        if (result.value) {
            data.push(result.value);
        }
    }

    return new Blob(data, { type });
}

export function unixTimestampToDate(timestamp){
    return new Date(timestamp * 1000);
}

export function readString(dataView, offset, length) {
    const str = [];
    for (let i = 0; i < length; i++) {
        str.push(String.fromCharCode(dataView.getUint8(offset + i)));
    }
    return str.join("");
}

export function readTerminatedString(dataView, offset) {
    const str = [];
    let val;
    let i = 0;

    while (val != 0) {
        val = dataView.getUint8(offset + i);
        if (val != 0) {
            str.push(String.fromCharCode(val));
        }
        i++
    }
    return str.join("");
}

export function readBytes(dataView, offset, length) {
    const bytes = [];
    for (let i = 0; i < length; i++) {
        bytes.push(dataView.getUint8(offset + i));
    }
    return bytes;
}

export function readFlags(dataView, offset, flagLabels) {
    const flags = {};

    for (let i = 0; i < flagLabels.length; i++) {
        const byte = dataView.getUint8(offset + Math.min(i / 8));
        flags[flagLabels[i]] = (((1 << i) & byte) >> i) === 1;
    }

    return flags;
}
Enter fullscreen mode Exit fullscreen mode

Gzip file

On Unix-based we often see .gz files (or even .tar.gz) for compressed archives of data. This is the gzip format and it's basically a DEFLATE stream with some extra data.

Note: I'll be working with DataViews on ArrayBuffers, so you just need to get your data into an ArrayBuffer of some sort.

If you don't care about the gzip file details you can run the whole thing through a DecompressionStream with type gzip:

const blob = streamToBlob(asReadableStream(this.#dataView.buffer).pipeThrough(new DecompressionStream("gzip")));
Enter fullscreen mode Exit fullscreen mode

Easy! But if you actually want to read it we start at the beginning of the file:

1) Signature (2 bytes) - Should be 0x1f,0x8b
2) Compression Method (1 byte) - Always 8 (DEFLATE)
3) flags (1 byte) - A set of 5 bit flags
1) ftext - is the file text (can be ignored)
2) fhcrc - has crc16 (probably can be ignored)
3) fextra - indicates the presence of extra data (important)
4) fname - the original file name is present (important)
5) fcomment - comment field is present (important)
The rest are reserved and can be discarded...
4) modification time (4 bytes) - seconds since UTC 1970-01-01
5) extra flags (1 byte) - Either 2 or 4 for smallest or fastest compression (can be ignored)
6) OS - The OS the file was created on (can be ignored)

This header should be 10 bytes total. Once we have that data we can use the flag values to see if there are extra fields we need to check.

1) If fextra, then we read a 16-bit value indicating the length of the field, followed by that many bytes. This data isn't really specified so it can be discarded and is unlikely to be used.
2) If fname, then we read a string ended with a null terminator 0 which represents the original file name.
3) If fcomment, then we read a string ended with a null terminator. This is just comments and can be ignored.
4) If fcrc, we read a 16-bit value representing the CRC16 checksum. Unclear to be if this actually used in practice.

These flag checks cannot be ignored because the data will start in a different place if they are present.

After that comes the actual DEFLATE bitstream. Unless you loaded the whole file you won't know how big it is. If you do have the whole file it'll be up-until the last 8 bytes.

Finally there's 8 bytes of trailers.

1) CRC32 (4 bytes) - The checksum for the file
2) Original Size - The original uncompressed file size

You can use the last 2 to check if things went ok while compressing. But for simplicity we can just skip (CompressionStreams will error if there was a problem anyway).

Here's some code:

import { bufferToStream, streamToBlob, unixTimestampToDate, readBytes, readString, readTerminatedString, readFlags } from "./utils.js";

export class GZip {
    #dataView;
    #index = 0;
    #crc16;

    header;
    fileName;
    comment;

    constructor(arrayBuffer) {
        this.#dataView = new DataView(arrayBuffer);
        this.read();
    }
    read(){
        this.header = {
            signature: readString(this.#dataView, 0, 2), //should be xf8b
            compressionMethod: this.#dataView.getUint8(1), //should be 0x08
            flags: readFlags(this.#dataView, 3, ["ftext", "fhcrc", "fextra", "fname", "fcomment", "reserved1", "reserved2", "reserved3"]), //need to figure out if we read extra data in stream
            modificationTime: unixTimestampToDate(this.#dataView.getUint32(4, true)),
            extraFlags: this.#dataView.getUint8(8), //not important but is either 2 (best compression) or 4 (fast)
            os: this.#dataView.getUint8(9), //not useful but usually 0 on windows, 3 Unix, 7 mac
        };
        this.#index = 10;

        if(this.header.flags.fextra){
            const extraLength = this.#dataView.getUint16(this.#index, true);
            this.extra = readBytes(this.#dataView, this.#index + 2, extraLength);
            this.#index += extraLength + 2;
        } else {
            this.extra = [];
        }

        if(this.header.flags.fname){
            this.fileName = readTerminatedString(this.#dataView, this.#index);
            this.#index += this.fileName.length + 1; //+1 for null terminator
        } else {
            this.fileName = "";
        }

        if(this.header.flags.fcomment){
            this.comment = readTerminatedString(this.#dataView, this.#index);
            this.#index += this.comment.length + 1; //+1 for null terminator
        } else {
            this.comment = "";
        }

        if(this.header.flags.fhcrc){
            this.#crc16 = this.#dataView.getUint16(this.#index, true);
            this.#index += 2;
        } else {
            this.#crc16 = null;
        }

        //footer
        this.footer = {
            crc: this.#dataView.getUint32(this.#dataView.byteLength - 8, true),
            uncompressedSize: this.#dataView.getUint32(this.#dataView.byteLength - 4, true),
        }
    }
    extract(){
        //If you don't care about the file data just do this:
        //return streamToBlob(bufferToStream(this.#dataView.buffer).pipeThrough(new DecompressionStream("gzip")));
        //Otherwise slice where the data starts to the last 8 bytes
        return streamToBlob(bufferToStream(this.#dataView.buffer.slice(this.#index, this.#dataView.byteLength - 8)).pipeThrough(new DecompressionStream("deflate-raw")));
    }
}
Enter fullscreen mode Exit fullscreen mode

The DEFLATE bitstream can be decoded by running it through a DecompressionStream with type deflate-raw. I've actually separated it into a method, which will mainly be to maintain parity with the zip file reader.

extract(){
    return streamToBlob(asReadableStream(this.#dataView.buffer.slice(this.#startsAt, 
this.#dataView.byteLength - 8)).pipeThrough(new DecompressionStream
("deflate-raw")));
}
Enter fullscreen mode Exit fullscreen mode

This should illustrate the difference between the gzip and deflate-raw types. It's just extra fields to parse.

For further improvement we could actually verify the checksums. Also, .gz can actually contain multiple file records. We're just assuming it's one because it much more common to use .tar.gz for bundles but it doesn't have to be.

Zip Archive

Zip was originally created by a company called PKWare (after the founder Phil Katz). I'm not exactly sure why but the format proved so popular it wound up in a bunch of places as the defacto compression format, at least on Windows.

The Zip format combines multiple files into an archive so internally it actually has a directory structure. You've probably noticed this when extracting zip files before. The way it keeps track of the files is with something called the "central directory". This a series of sections of data at the very end of the archive that can be used to figure out where data is inside the archive. This is important because when the format was created, space was small and so archives could span multiple disks which is why it's setup like this. You can see all the files in the archive with just the final disk. So each entry in the central directory actually gives an offset and a disk number to locate it (we can ignore disk number and pretty much the whole central directory this these days because everything fits on a harddrive). Preceding all that is the actual list of files with compressed data.

The way the archive is structured is:
1) A series of local file entries with signature 0x04034b50. Each file header is immediately followed by the compressed data (the header tells you how long it is).
2) A series of central directory entries with signature 0x02014b50.
3) An "end of central directory" block with signature 0x06054b50 that tells you that you are at the end.

Image description

Reading the zip archive

Reading isn't too hard. We just need to grab chunks from the file using the headers to tell us what type and how long.

1) Signature (2 bytes) - Should be 0x1f,0x8b
2) Compression Method (1 byte) - Always 8 (DEFLATE)
3) flags (1 byte) - A set of 5 bit flags
1) ftext - is the file text (can be ignored)
2) fhcrc - has crc16 (probably can be ignored)
3) fextra - indicates the presence of extra data (important)
4) fname - the original file name is present (important)
5) fcomment - comment field is present (important)
The rest are reserved and can be discarded...
4) modification time (4 bytes) - seconds since UTC 1970-01-01
5) extra flags (1 byte) - Either 2 or 4 for smallest or fastest compression (can be ignored)
6) OS - The OS the file was created on (can be ignored)

We start with the local file entries:
1) Signature (2 bytes) - Should be 0x04034b50 for local files.
2) version (2 bytes) - The version of zip needed to extract (I don't think this is updated anymore...)
3) General Purpose flags (2 bytes) - These vary depending on the compression used but for our purposes (deflate)
1) bit 0 - is the entry encrypted
2) bit 1 and 2 - speed vs size flags used (don't care)
3) bit 3 - the crc32 and sizes are unknown. If so we'll get some data at the end of the entry's bitstream and those values will be set to zero in the local file header.
4) bits 4- 15 - Some random stuff I don't think is used anymore involving encryption, and some PKWare specific stuff.
4) Compression Method - will be either 0 for uncompressed or 8 for DEFLATE. Others are theoretically available but no longer used.
5) Last Modified Time. The time of modification in MS-DOS format.
6) Last Modified Date. The date of modification in MS-DOS format.
7) CRC - The checksum to validate compression
8) Compressed Size - How big the data is compressed so we know how far to read.
9) Uncompressed Size - The size of the original data, mostly to verify decompression was correct.
10) Filename Length (2 byteS) - How long the filename field is in bytes
11) Extra Length (2 bytes) - The length of the extra field in bytes.

Following the end of these field we read the filename followed by the extra field using the length of each from above. After this the bitstream data starts and should be of compressed size length.

Since we don't really need the central directory I'll omit that part here but both wikipedia and the official documentation are pretty readable for getting those fields.

I'll make a class to do this:

import { bufferToStream, streamToBlob, readString } from "./utils.js";

export class Zip {
    #dataView;
    #index = 0;
    #localFiles = [];
    #centralDirectories = [];
    #endOfCentralDirectory;

    constructor(arrayBuffer){
        this.#dataView = new DataView(arrayBuffer);
        this.read();
    }
    async extract(entry) {
        const buffer = this.#dataView.buffer.slice(entry.startsAt, entry.startsAt + entry.compressedSize);

        if(entry.compressionMethod === 0x00){
            return new Blob([buffer]);
        } else if(entry.compressionMethod === 0x08) {
            const decompressionStream = new DecompressionStream("deflate-raw");
            const stream = bufferToStream(buffer);
            const readable = stream.pipeThrough(decompressionStream);
            return await streamToBlob(readable);
        }
    }
    read(){
        while(!this.#endOfCentralDirectory){
            const signature = this.#dataView.getUint32(this.#index, true);
            if (signature === 0x04034b50){ //local file
                const entry = this.readLocalFile(this.#index);
                entry.startsAt = this.#index + 30 + entry.fileNameLength + entry.extraLength;
                entry.extract = this.extract.bind(this, entry);
                this.#localFiles.push(entry);
                this.#index += entry.startsAt + entry.compressedSize;
            } else if (signature === 0x02014b50){ //central directory
                const entry = this.readCentralDirectory(this.#index);
                this.#centralDirectories.push(entry);
                this.#index += 46 + entry.fileNameLength + entry.extraLength + entry.fileCommentLength;
            } else if (signature === 0x06054b50) { //end of central directory
                this.#endOfCentralDirectory = this.readEndCentralDirectory(this.#index);
            } 
        }
    }
    readLocalFile(offset){
        const fileNameLength = this.#dataView.getUint16(offset + 26, true);
        const extraLength = this.#dataView.getUint16(offset + 28, true);

        const entry = {
            signature: readString(this.#dataView, offset, 4),
            version: this.#dataView.getUint16(offset + 4, true),
            generalPurpose: this.#dataView.getUint16(offset + 6, true),
            compressionMethod: this.#dataView.getUint16(offset + 8, true),
            lastModifiedTime: this.#dataView.getUint16(offset + 10, true),
            lastModifiedDate: this.#dataView.getUint16(offset + 12, true),
            crc: this.#dataView.getUint32(offset + 14, true),
            compressedSize: this.#dataView.getUint32(offset + 18, true),
            uncompressedSize: this.#dataView.getUint32(offset + 22, true),
            fileNameLength,
            fileName: readString(this.#dataView, offset + 30, fileNameLength),
            extraLength,
            extra: readString(this.#dataView, offset + 30 + fileNameLength, extraLength),
        }

        return entry;
    }
    readCentralDirectory(offset) {
        const fileNameLength = this.#dataView.getUint16(offset + 28, true);
        const extraLength = this.#dataView.getUint16(offset + 30, true);
        const fileCommentLength = this.#dataView.getUint16(offset + 32, true);

        const centralDirectory = {
            signature: readString(this.#dataView, offset, 4),
            versionCreated: this.#dataView.getUint16(offset + 4, true),
            versionNeeded: this.#dataView.getUint16(offset + 6, true),
            generalPurpose: this.#dataView.getUint16(offset + 8, true),
            compressionMethod: this.#dataView.getUint16(offset + 10, true),
            lastModifiedTime: this.#dataView.getUint16(offset + 12, true),
            lastModifiedDate: this.#dataView.getUint16(offset + 14, true),
            crc: this.#dataView.getUint32(offset + 16, true),
            compressedSize: this.#dataView.getUint32(offset + 20, true),
            uncompressedSize: this.#dataView.getUint32(offset + 24, true),
            fileNameLength,
            extraLength,
            fileCommentLength,
            diskNumber: this.#dataView.getUint16(offset + 34, true),
            internalAttributes: this.#dataView.getUint16(offset + 36, true),
            externalAttributes: this.#dataView.getUint32(offset + 38, true),
            offset: this.#dataView.getUint32(42, true),
            fileName: readString(this.#dataView, offset + 46, fileNameLength),
            extra: readString(this.#dataView, offset + 46 + fileNameLength, extraLength),
            comments: readString(this.#dataView, offset + 46 + fileNameLength + extraLength, fileCommentLength),
        }

        return centralDirectory;
    }
    readEndCentralDirectory(offset){
        const commentLength = this.#dataView.getUint16(offset + 20, true);

        const endOfDirectory = {
            signature: readString(this.#dataView, offset, 4),
            numberOfDisks: this.#dataView.getUint16(offset + 4, true),
            centralDirectoryStartDisk: this.#dataView.getUint16(offset + 6, true),
            numberCentralDirectoryRecordsOnThisDisk: this.#dataView.getUint16(offset + 8, true),
            numberCentralDirectoryRecords: this.#dataView.getUint16(offset + 10, true),
            centralDirectorySize: this.#dataView.getUint32(offset + 12, true),
            centralDirectoryOffset: this.#dataView.getUint32(offset + 16, true),
            commentLength: commentLength,
            comment: readString(this.#dataView, offset + 22, commentLength)
        };

        return endOfDirectory;
    }
    get entries(){
        return this.#localFiles;
    }
}
Enter fullscreen mode Exit fullscreen mode

There a bunch here but it's not complicated. We setup a class to hold the ArrayBuffer and a read index while it reads data in. When read() is called things kick off. We start at the first 4 bytes and check the signature, depending on if it's a local file, central directory or end of central directory we read in that data structure with a DataView and then advance the index by the length of that section. Once we've hit the end of central directory we're done.

Most of the data isn't necessary for getting the file but we do need to read it correctly to know where the bitstream starts. So you need to check the filename, extra and comment lengths and add them to the final offset. I add this to the entry as the property startsAt because it's useful to have precomputed. Also, When extracting there are 2 main compression modes. If the value is 0x00 then there's no compression, just return the stream. If it's 0x08 then it's deflate compressed and we need to run it through a DecompressionStream with type deflate-raw.

This isn't a complete implementation though. We could also deal with Zip64 which adds some data, encypted zips, or the lesser (never) used compression methods but this should be enough to read basic zips.

Creating a very basic UI

To start let's build a small application. I've created a simple component that implements file drag and drop and a manual file picker:

import { Zip } from "../libs/zip.js";
import { GZip } from "../libs/gzip.js";
import { downloadBlob } from "../libs/utils.js";

export class WcUnzip extends HTMLElement {
    #zip;

    static get observedAttributes() {
        return [];
    }
    constructor() {
        super();
        this.bind(this);
    }
    bind(element) {
        element.attachEvents = element.attachEvents.bind(element);
        element.cacheDom = element.cacheDom.bind(element);
        element.onChange = element.onChange.bind(element);
        element.onDragLeave = element.onDragLeave.bind(element);
        element.onDragOver = element.onDragOver.bind(element);
        element.onDrop = element.onDrop.bind(element);
    }
    connectedCallback() {
        this.render();
        this.cacheDom();
        this.attachEvents();
    }
    render(){
        this.attachShadow({ mode: "open" });
        this.shadowRoot.innerHTML = `
            <style>
                :host { display: block; background: #999; min-inline-size: 320px; min-block-size: 240px; }
                :host(.over) {
                    border: 8px solid green;
                }
            </style>
            <input type="file" id="file" accept=".gz,.zip">
            <ul id="list"></ul>
        `
    }
    cacheDom() {
        this.dom = {
            list: this.shadowRoot.querySelector("#list"),
            file: this.shadowRoot.querySelector("#file")
        };
    }
    attachEvents() {
        this.addEventListener("dragover", this.onDragOver);
        this.addEventListener("dragleave", this.onDragLeave);
        this.addEventListener("drop", this.onDrop);
        this.dom.file.addEventListener("change", this.onChange);
    }
    onDragOver(e){
        e.preventDefault();
        this.classList.add("over");
    }
    onDragLeave(e){
        e.preventDefault();
        this.classList.remove("over");
    }
    onDrop(e){
        e.preventDefault();
        const file = e.dataTransfer.files[0];
        this.readFile(file);
    }
    onChange(e){
        e.preventDefault();
        const file = this.dom.file.files[0];
        this.readFile(file)
    }
    readFile(file){
        this.dom.list.innerHTML = "";

        if (file.type === "application/x-zip-compressed"){
            const reader = new FileReader();
            reader.onload = () => {
                this.#zip = new Zip(reader.result);
                this.#zip.entries.forEach(e => {
                    const li = document.createElement("li");
                    li.textContent = e.fileName;
                    li.addEventListener("click", async () => {
                        const blob = await e.extract();
                        downloadBlob(blob, e.fileName);
                    });
                    this.dom.list.append(li);
                });
            }
            reader.readAsArrayBuffer(file);
        }

        if(file.type === "application/x-gzip"){
            const reader = new FileReader();
            reader.onload = () => {
                this.#zip = new GZip(reader.result);
                const li = document.createElement("li");
                li.textContent = file.name;
                li.addEventListener("click", async () => {
                    const blob = await this.#zip.extract();
                    downloadBlob(blob, this.#zip.fileName);
                });
                this.dom.list.append(li);
            }
            reader.readAsArrayBuffer(file);
        }

        this.classList.remove("over");
    }
    attributeChangedCallback(name, oldValue, newValue) {
        this[name] = newValue;
    }
}

customElements.define("wc-unzip", WcUnzip);


Enter fullscreen mode Exit fullscreen mode

The interesting parts here are that on dragover and dragleave I add and remove classes to show it change as at least at the moment there is not good psuedo-class for this. We take the first file (as users can drop multiple selected files) and then investigate the mime-type. If it's a zip file then we'll try to open it with our zip library and if it's a gzip file then we'll use the gzip library. In the case of a zip file we'll list out the entries and if you click on we'll download the uncompressed payload with the file name. For gzip, we only considered 1 file and we download it the same way.

Conclusion

So hopefully if you need to read some .zips, .xlxss, .docxs, .pptxs, .jars, .wars or anything gzipped etc. in the browser you can do so pretty easily and without 3rd party dependecies. Note that some other implementations can actually be faster but if we assume that browsers get optimized over time, it's most sustainable to use what you are given.

Code

Here's the code: https://github.com/ndesmic/zip/tree/v1.0

Also check the repo's examples folders if you need some test cases. hello.zip is uncompressed, lore-ipsum.zip is compressed.

Resources

Top comments (0)