DEV Community

Stefor07
Stefor07

Posted on • Edited on

Decompression in C/C++: Virtual Files and Memory I/O Principles and Techniques

In this article, we will analyze techniques for managing buffers and virtual files during decompression operations in C/C++, with practical examples showing how to work with both high-level and low-level libraries.

Why Decompress Directly in Memory?

Decompressing archives directly in memory instead of using temporary files offers several advantages:

  • Performance: Avoiding disk I/O significantly reduces latency, especially for repeated reads or streaming scenarios.
  • Security: Sensitive data never touches the disk, reducing the risk of exposure or leaving traces.
  • Single-File Deployment: Combined with techniques like embedding binary data in executables, you can load and unpack archives without deploying external files.
  • Flexible Data Sources: Works with network streams, embedded resources, or any custom buffer—not bound by the file system.
  • Cleaner Resource Management: No temporary files means less cleanup, fewer permissions issues, and simpler code.

Memory vs Disk Decompression: Understanding the Landscape

Decompression libraries fall into two categories based on how they handle data sources:

High-level libraries provide native APIs for extracting data directly into memory buffers. These abstract away the complexity of I/O operations, making memory-based decompression straightforward.

Low-level, file-centric libraries assume data comes from disk files. They expect standard file operations (read, write, seek) and require developers to implement custom I/O callbacks to work with memory buffers.

Understanding this distinction helps you choose the right tool and integration approach for your use case.

High-Level Libraries with Native Memory Support

Several modern libraries provide built-in memory extraction without requiring custom callbacks:

libzip

  • zip_fread(): Reads files directly into memory buffers
  • zip_source_buffer_create(): Creates a ZIP source from existing memory
  • zip_open_from_source(): Opens archives using in-memory sources

libarchive

  • archive_read_data(): Extracts entries into memory buffers
  • archive_write_data(): Writes from memory into archives
  • Supports multiple formats (ZIP, TAR, etc.)

zlib

  • uncompress(): Single-call buffer decompression
  • inflateInit(), inflate(): Stream-based API for progressive decompression

minizip (part of zlib)

  • unzOpenMemory(): Opens ZIP archives stored in memory
  • unzReadCurrentFile(): Reads files directly into buffers

These libraries handle all the complexity internally—reading compressed chunks, decompressing them, and writing output—while exposing simple, memory-friendly APIs.

Low-Level Libraries: The Custom Callback Approach

Unlike high-level libraries, some decompression tools are designed exclusively for file-based I/O:

  • LZMA SDK (7-Zip)
  • Windows Cabinet APIs (FCICreate, FDICopy, etc.)

These libraries don't provide memory extraction APIs. Instead, they expect standard file operations and process data through internal buffers that aren't exposed to developers.

To work with memory buffers, you must implement custom I/O callbacks that simulate file behavior. The library requests data—thinking it's reading from a file—while your callbacks feed it data from memory instead.

How Custom Callbacks Work

The process is conceptually simple:

  1. Your compressed archive is loaded into a memory buffer
  2. The library performs read, seek, or write operations
  3. Your callbacks intercept these operations and provide data from your buffer
  4. The library processes this data through its internal buffers, decompressing as if reading from a real file

This approach requires more boilerplate but provides full control over memory usage and enables working with non-disk data sources.

Implementing Memory Callbacks

To make file-centric libraries work with memory, we encapsulate the buffer in a structure that tracks position:

#include <stdint.h>
#include <string.h>
#include <stdbool.h>

typedef struct
{
    const uint8_t* data;      // pointer to memory buffer
    size_t size;           // total size of buffer
    size_t pos;            // current position
} MemoryStream;
Enter fullscreen mode Exit fullscreen mode

Core Callback Implementations

Read Callback
Reads up to bytes from the stream into dst.
Returns the number of bytes actually read.

size_t Read(MemoryStream* ms, void* dst, size_t bytes) 
{
    if (!ms || !dst || ms->pos >= ms->size)
    {
        return 0;
    }

    size_t remainingBytes = ms->size - ms->pos;

    if (bytes > remainingBytes)
    {
        bytes = remainingBytes;
    }

    memcpy(dst, ms->data + ms->pos, bytes);

    ms->pos += bytes;
    return bytes;
}
Enter fullscreen mode Exit fullscreen mode

Seek Callback
Moves the current position relative to the beginning, current position, or end of the stream.

typedef enum SeekOrigin { SEEK_SET, SEEK_CUR, SEEK_END };

bool Seek(MemoryStream* ms, long offset, SeekOrigin origin) 
{
    if (!ms)
    {
        return false;
    }

    uint64_t newPos = 0;

    switch (origin) 
    {
        case SEEK_SET: 
        {
            newPos = (uint64_t)offset; 
            break;
        }
        case SEEK_CUR:
        {
            newPos = (uint64_t)ms->pos + offset; 
            break;
        } 
        case SEEK_END: 
        {
            newPos = (uint64_t)ms->size + offset; 
            break;
        }

        default:
        {
            return false;
        }
    }

    if (newPos < 0)
    {
        newPos = 0;
    }
    else if (newPos > (uint64_t)ms->size)
    {
        newPos = (uint64_t)ms->size;
    }

    ms->pos = (size_t)newPos;
    return true;
}
Enter fullscreen mode Exit fullscreen mode

⚠️ Warning: Inside this callback, the newPos variable is declared as an uint64_t. This is done for safety, to avoid overflows when adding or subtracting long values with the current position or the stream size (size_t). In particular, since ms->size and ms->pos are of type size_t (typically unsigned), converting to uint64_t allows for correct handling of negative offsets and prevents undefined behavior resulting from operations between signed and unsigned types. Additionally, using int64_t provides extra safety when working with very large archives, ensuring that file positions larger than 4 GB or 2^32 bytes are handled correctly. This choice does not affect behavior on 32-bit processes.

Look Callback
Provides a direct pointer to the unread portion of the buffer (for “peek” access).

bool Look(MemoryStream* ms, const void** buf, size_t* size)
{
    if (!ms || !buf || !size)
    {
        return false;
    }

    if (ms->pos >= ms->size)
    {
        *buf = NULL;
        *size = 0;
        return false;
    }

    *buf = ms->data + ms->pos;
    *size = ms->size - ms->pos;

    return true;
}
Enter fullscreen mode Exit fullscreen mode

Skip Callback
Advances the current position by offset bytes (without reading).

bool Skip(MemoryStream* ms, size_t offset) 
{
    if (!ms)
    {
        return false;
    }

    if (offset > SIZE_MAX - ms->pos)
    {
        return false;
    }

    if (ms->pos + offset > ms->size)
    {
        ms->pos = ms->size;
    }
    else
    {
        ms->pos += offset;
    }

    return true;
}
Enter fullscreen mode Exit fullscreen mode

Write Callback
Writes up to bytes from src into the buffer.
Returns the number of bytes successfully written.

bool Write(MemoryStream* ms, const void* src, size_t size)
{
    size_t remainingBytes = ms->size - ms->pos;

    if (size > remainingBytes)
    {
        size = remainingBytes;
    }

    uint8_t* dest = (uint8_t*)ms->data + ms->pos;
    memcpy(dest, src, size);

    ms->pos += size;
    return true;
}
Enter fullscreen mode Exit fullscreen mode

Tell Callback
Returns the current position within the buffer.

size_t Tell(const MemoryStream* ms)
{
    return (ms && ms->pos <= ms->size) ? ms->pos : 0;
}
Enter fullscreen mode Exit fullscreen mode

Callback Purpose and Usage

Each callback serves a specific role when simulating file access in memory:

  • Read: Provides the next chunk of data from the memory buffer. Essential for decompression, parsing, or streaming input.
  • Seek: Moves the current position relative to the beginning, current position, or end of the buffer. Required for random-access libraries.
  • Look: Provides a pointer to the unread portion of the buffer for “peek” operations without changing the current position. Optional but useful for inspection.
  • Skip: Advances the current position by a given number of bytes without copying data. Useful for efficiently ignoring unwanted sections.
  • Write: Stores processed or decompressed output back into memory. Not required for read-only operations.
  • Tell: Returns the current position within the buffer. Needed for offset tracking or progress reporting.

Usage Notes

  • Sequential operations (e.g., streaming decompression) may only require Read and Write.
  • Random-access archives or seeking operations require Seek, Skip, and Tell.
  • Inspection-only operations might only need Read and Look, skipping Write entirely.

Adapting Callbacks to Specific Libraries

The callbacks above are a starting template. Each library has different requirements for function signatures, parameters, or behavior. Some might pass opaque pointers instead of MemoryStream*, or require additional flags.

The key principle is adapting memory buffers to behave like files. From this foundation, customize the callbacks to match your library's API.

Library-Specific Requirements

LZMA SDK
Requires ILookInStream interface with 4 callbacks:

  • Look → peek at upcoming bytes without advancing
  • Skip → skip bytes forward
  • Read → read and advance cursor
  • Seek → move to specific position

Windows Cabinet (CAB) SDK
Requires 2 main callbacks:

  1. File operations (CabRead, CabWrite, CabSeek)
  2. Notify callback (CabNotify) for operation events

This comparison shows how different libraries require varying levels of manual I/O adaptation.

Library Comparison

Library Memory Support Custom Callbacks Complexity Key Features
libzip ✅ Native ❌ Not needed Low zip_fread(), zip_source_buffer_create(), zip_open_from_source()
libarchive ✅ Native ❌ Not needed Low Multi-format support, archive_read_data(), archive_write_data()
zlib ✅ Native ❌ Not needed Low/Medium uncompress() and streaming APIs, efficient for simple compression
minizip (part of zlib) ✅ Native ❌ Not needed Medium unzOpenMemory(), unzReadCurrentFile() for complete in-memory ZIP handling
LZMA SDK (7-Zip) ❌ Manual ✅ Required High Requires ISeqInStream (1 callback) or ILookInStream (4 callbacks), granular control
Windows Cabinet (CAB) APIs ❌ Manual ✅ Required Medium Core callbacks: Read, Write, Seek, plus CabNotify, simpler than LZMA

Conclusion

Memory-based decompression in C/C++ requires understanding two distinct approaches: high-level libraries with native memory APIs, and low-level file-centric libraries requiring custom callbacks.

High-level libraries like libzip, libarchive, zlib, and minizip provide straightforward, efficient APIs for working directly with memory buffers. For most use cases, these should be your first choice.

When working with specialized formats or legacy libraries like LZMA SDK or Windows Cabinet APIs, custom callbacks bridge the gap. By implementing virtual file operations on memory buffers, you gain the same flexibility while maintaining full control over data flow.

Whether you're embedding archives in executables, processing network streams, or avoiding disk I/O for security reasons, these techniques enable efficient, flexible decompression workflows entirely in memory.

Apply these patterns in your projects: start with high-level APIs when available, and implement custom callbacks when necessary. With this foundation, you're equipped to build secure, performant decompression solutions in C++.

Top comments (0)