Stefor07

Posted on Oct 27 • Edited on Nov 12

Decompression in C/C++: Virtual Files and Memory I/O Principles and Techniques

#cpp #programming #libraries

In this article, we will analyze techniques for managing buffers and virtual files during decompression operations in C/C++, with practical examples showing how to work with both high-level and low-level libraries.

Why Decompress Directly in Memory?

Decompressing archives directly in memory instead of using temporary files offers several advantages:

Performance: Avoiding disk I/O significantly reduces latency, especially for repeated reads or streaming scenarios.
Security: Sensitive data never touches the disk, reducing the risk of exposure or leaving traces.
Single-File Deployment: Combined with techniques like embedding binary data in executables, you can load and unpack archives without deploying external files.
Flexible Data Sources: Works with network streams, embedded resources, or any custom buffer—not bound by the file system.
Cleaner Resource Management: No temporary files means less cleanup, fewer permissions issues, and simpler code.

Memory vs Disk Decompression: Understanding the Landscape

Decompression libraries fall into two categories based on how they handle data sources:

High-level libraries provide native APIs for extracting data directly into memory buffers. These abstract away the complexity of I/O operations, making memory-based decompression straightforward.

Low-level, file-centric libraries assume data comes from disk files. They expect standard file operations (read, write, seek) and require developers to implement custom I/O callbacks to work with memory buffers.

Understanding this distinction helps you choose the right tool and integration approach for your use case.

High-Level Libraries with Native Memory Support

Several modern libraries provide built-in memory extraction without requiring custom callbacks:

libzip

zip_fread(): Reads files directly into memory buffers
zip_source_buffer_create(): Creates a ZIP source from existing memory
zip_open_from_source(): Opens archives using in-memory sources

libarchive

archive_read_data(): Extracts entries into memory buffers
archive_write_data(): Writes from memory into archives
Supports multiple formats (ZIP, TAR, etc.)

zlib

uncompress(): Single-call buffer decompression
inflateInit(), inflate(): Stream-based API for progressive decompression

minizip (part of zlib)

unzOpenMemory(): Opens ZIP archives stored in memory
unzReadCurrentFile(): Reads files directly into buffers

These libraries handle all the complexity internally—reading compressed chunks, decompressing them, and writing output—while exposing simple, memory-friendly APIs.

Low-Level Libraries: The Custom Callback Approach

Unlike high-level libraries, some decompression tools are designed exclusively for file-based I/O:

LZMA SDK (7-Zip)
Windows Cabinet APIs (FCICreate, FDICopy, etc.)

These libraries don't provide memory extraction APIs. Instead, they expect standard file operations and process data through internal buffers that aren't exposed to developers.

To work with memory buffers, you must implement custom I/O callbacks that simulate file behavior. The library requests data—thinking it's reading from a file—while your callbacks feed it data from memory instead.

How Custom Callbacks Work

The process is conceptually simple:

Your compressed archive is loaded into a memory buffer
The library performs read, seek, or write operations
Your callbacks intercept these operations and provide data from your buffer
The library processes this data through its internal buffers, decompressing as if reading from a real file

This approach requires more boilerplate but provides full control over memory usage and enables working with non-disk data sources.

Implementing Memory Callbacks

To make file-centric libraries work with memory, we encapsulate the buffer in a structure that tracks position:

#include <stdint.h>
#include <string.h>
#include <stdbool.h>

typedef struct
{
    const uint8_t* data;      // pointer to memory buffer
    size_t size;           // total size of buffer
    size_t pos;            // current position
} MemoryStream;

Core Callback Implementations

Read Callback
Reads up to bytes from the stream into dst.
Returns the number of bytes actually read.

size_t Read(MemoryStream* ms, void* dst, size_t bytes) 
{
    if (!ms || !dst || ms->pos >= ms->size)
    {
        return 0;
    }

    size_t remainingBytes = ms->size - ms->pos;

    if (bytes > remainingBytes)
    {
        bytes = remainingBytes;
    }

    memcpy(dst, ms->data + ms->pos, bytes);

    ms->pos += bytes;
    return bytes;
}

Seek Callback
Moves the current position relative to the beginning, current position, or end of the stream.

typedef enum SeekOrigin { SEEK_SET, SEEK_CUR, SEEK_END };

bool Seek(MemoryStream* ms, long offset, SeekOrigin origin) 
{
    if (!ms)
    {
        return false;
    }

    uint64_t newPos = 0;

    switch (origin) 
    {
        case SEEK_SET: 
        {
            newPos = (uint64_t)offset; 
            break;
        }
        case SEEK_CUR:
        {
            newPos = (uint64_t)ms->pos + offset; 
            break;
        } 
        case SEEK_END: 
        {
            newPos = (uint64_t)ms->size + offset; 
            break;
        }

        default:
        {
            return false;
        }
    }

    if (newPos < 0)
    {
        newPos = 0;
    }
    else if (newPos > (uint64_t)ms->size)
    {
        newPos = (uint64_t)ms->size;
    }

    ms->pos = (size_t)newPos;
    return true;
}

⚠️ Warning: Inside this callback, the newPos variable is declared as an uint64_t. This is done for safety, to avoid overflows when adding or subtracting long values with the current position or the stream size (size_t). In particular, since ms->size and ms->pos are of type size_t (typically unsigned), converting to uint64_t allows for correct handling of negative offsets and prevents undefined behavior resulting from operations between signed and unsigned types. Additionally, using int64_t provides extra safety when working with very large archives, ensuring that file positions larger than 4 GB or 2^32 bytes are handled correctly. This choice does not affect behavior on 32-bit processes.

Look Callback
Provides a direct pointer to the unread portion of the buffer (for “peek” access).

bool Look(MemoryStream* ms, const void** buf, size_t* size)
{
    if (!ms || !buf || !size)
    {
        return false;
    }

    if (ms->pos >= ms->size)
    {
        *buf = NULL;
        *size = 0;
        return false;
    }

    *buf = ms->data + ms->pos;
    *size = ms->size - ms->pos;

    return true;
}

Skip Callback
Advances the current position by offset bytes (without reading).

bool Skip(MemoryStream* ms, size_t offset) 
{
    if (!ms)
    {
        return false;
    }

    if (offset > SIZE_MAX - ms->pos)
    {
        return false;
    }

    if (ms->pos + offset > ms->size)
    {
        ms->pos = ms->size;
    }
    else
    {
        ms->pos += offset;
    }

    return true;
}

Write Callback
Writes up to bytes from src into the buffer.
Returns the number of bytes successfully written.

bool Write(MemoryStream* ms, const void* src, size_t size)
{
    size_t remainingBytes = ms->size - ms->pos;

    if (size > remainingBytes)
    {
        size = remainingBytes;
    }

    uint8_t* dest = (uint8_t*)ms->data + ms->pos;
    memcpy(dest, src, size);

    ms->pos += size;
    return true;
}

Tell Callback
Returns the current position within the buffer.

size_t Tell(const MemoryStream* ms)
{
    return (ms && ms->pos <= ms->size) ? ms->pos : 0;
}

Callback Purpose and Usage

Each callback serves a specific role when simulating file access in memory:

Read: Provides the next chunk of data from the memory buffer. Essential for decompression, parsing, or streaming input.
Seek: Moves the current position relative to the beginning, current position, or end of the buffer. Required for random-access libraries.
Look: Provides a pointer to the unread portion of the buffer for “peek” operations without changing the current position. Optional but useful for inspection.
Skip: Advances the current position by a given number of bytes without copying data. Useful for efficiently ignoring unwanted sections.
Write: Stores processed or decompressed output back into memory. Not required for read-only operations.
Tell: Returns the current position within the buffer. Needed for offset tracking or progress reporting.

Usage Notes

Sequential operations (e.g., streaming decompression) may only require Read and Write.
Random-access archives or seeking operations require Seek, Skip, and Tell.
Inspection-only operations might only need Read and Look, skipping Write entirely.

Adapting Callbacks to Specific Libraries

The callbacks above are a starting template. Each library has different requirements for function signatures, parameters, or behavior. Some might pass opaque pointers instead of MemoryStream*, or require additional flags.

The key principle is adapting memory buffers to behave like files. From this foundation, customize the callbacks to match your library's API.

Library-Specific Requirements

LZMA SDK
Requires ILookInStream interface with 4 callbacks:

Look → peek at upcoming bytes without advancing
Skip → skip bytes forward
Read → read and advance cursor
Seek → move to specific position

Windows Cabinet (CAB) SDK
Requires 2 main callbacks:

File operations (CabRead, CabWrite, CabSeek)
Notify callback (CabNotify) for operation events

This comparison shows how different libraries require varying levels of manual I/O adaptation.

Library Comparison

Library	Memory Support	Custom Callbacks	Complexity	Key Features
libzip	✅ Native	❌ Not needed	Low	`zip_fread()`, `zip_source_buffer_create()`, `zip_open_from_source()`
libarchive	✅ Native	❌ Not needed	Low	Multi-format support, `archive_read_data()`, `archive_write_data()`
zlib	✅ Native	❌ Not needed	Low/Medium	`uncompress()` and streaming APIs, efficient for simple compression
minizip (part of zlib)	✅ Native	❌ Not needed	Medium	`unzOpenMemory()`, `unzReadCurrentFile()` for complete in-memory ZIP handling
LZMA SDK (7-Zip)	❌ Manual	✅ Required	High	Requires `ISeqInStream` (1 callback) or `ILookInStream` (4 callbacks), granular control
Windows Cabinet (CAB) APIs	❌ Manual	✅ Required	Medium	Core callbacks: `Read`, `Write`, `Seek`, plus `CabNotify`, simpler than LZMA

Conclusion

Memory-based decompression in C/C++ requires understanding two distinct approaches: high-level libraries with native memory APIs, and low-level file-centric libraries requiring custom callbacks.

High-level libraries like libzip, libarchive, zlib, and minizip provide straightforward, efficient APIs for working directly with memory buffers. For most use cases, these should be your first choice.

When working with specialized formats or legacy libraries like LZMA SDK or Windows Cabinet APIs, custom callbacks bridge the gap. By implementing virtual file operations on memory buffers, you gain the same flexibility while maintaining full control over data flow.

Whether you're embedding archives in executables, processing network streams, or avoiding disk I/O for security reasons, these techniques enable efficient, flexible decompression workflows entirely in memory.

Apply these patterns in your projects: start with high-level APIs when available, and implement custom callbacks when necessary. With this foundation, you're equipped to build secure, performant decompression solutions in C++.

DEV Community