Tejas

Posted on May 24

Under the Hood: Demystifying Database Internals (InnoDB vs. ARCHIVE)

#systemdesign #database #intenal #systems

Most developers treat database engines as black boxes: you write a query, and data magically returns. But when data scales to millions of rows, understanding how that data sits on physical hardware changes how you build systems.

If you throw every single piece of data into your default database setup, your storage costs will eventually skyrocket and your queries will slow down.

To build systems that scale, you need to understand two completely opposing storage philosophies: Page-Based Transactional Storage (InnoDB) and Stream-Based Compressed Storage (ARCHIVE), and how they connect using Table Partitioning.

1. The InnoDB Hierarchy (From Bytes to Tablespaces)

MySQL’s default engine, InnoDB, is built for speed, multi-user accuracy, and heavy read/write traffic. It doesn't just append raw text strings to a file. It manages data through a highly structured, strict multi-tiered hierarchy.

The Storage Stack

Tablespaces: This is the highest logical container. When you turn on modern settings like innodb_file_per_table, every single database table gets its own physical .ibd file on your hard drive.
Extents: To prevent your operating system's filesystem from scattering your table across different physical sectors of your drive, InnoDB allocates space in chunks called Extents. Each extent is exactly 1 Megabyte in size and bundles multiple continuous pages together, ensuring that sequential data stays physically close on disk.
Pages: The Page is the core atomic unit where InnoDB does its actual work. By default, an InnoDB page is exactly 16 Kilobytes. Whenever InnoDB reads or writes data, it loads entire 16KB pages into your server's RAM (the Buffer Pool), even if you only requested a single row.
Rows: At the absolute base of this structure are individual rows, packed tightly inside Data Pages. Because InnoDB uses a Clustered Index B+Tree, your rows are physically sorted and stored on disk by their Primary Key.

2. The ARCHIVE Engine (The Anti-Page Approach)

What if you don’t need to update data? What if you are storing billions of system logs, clickstreams, or audit trails that you only look at when something breaks?

Throwing that into InnoDB is incredibly wasteful because index trees and 16KB page structures bloat your disk usage. The ARCHIVE engine completely flips InnoDB's design on its head to solve this exact problem:

No Fixed Pages: Rather than structuring data into strict 16KB blocks, ARCHIVE treats your data as a continuous, unbounded append-only binary byte stream saved inside an .arz file.
On-the-Fly Stream Compression: When data is inserted, it passes through an in-memory compression buffer. Trailing spaces are stripped out, and an optimized bit-header handles NULL values. The raw rows are then compressed on-the-fly using the zlib algorithm before hitting the disk.
The Index Trade-off: To maintain its tiny footprint, ARCHIVE allows zero secondary indexes. The only permissible index is an AUTO_INCREMENT column.

Because it acts as a raw compressed string of text rather than an indexed block structure, ARCHIVE routinely achieves 3:1 to 10:1 compression ratios compared to InnoDB.

3. Architecture Comparison Matrix

Architectural Feature	InnoDB Engine	ARCHIVE Engine
Storage Layout	Strict 16KB Pages / 1MB Extents	Continuous `zlib` Compressed Byte Stream
Primary File Format	`.ibd` (Data + Indexes combined)	`.arz` (Data) + `.frm` (Metadata)
Write Model	B+Tree Insertion, Page Splitting	Append-Only Stream via Memory Buffer
Supported Mutations	`INSERT`, `SELECT`, `UPDATE`, `DELETE`	`INSERT`, `SELECT` only (WORM model)
Locking Mechanism	Fine-grained Row Locks (MVCC)	Interlaced Row Stream Locking

4. Connecting the Worlds: Table Partitioning & Exchange

Now that you understand both engines, you can combine these mental models into a Hot/Cold Data Tiering System. The goal is simple: Keep recent data (Hot) in InnoDB so it can be updated and indexed quickly. Move older data (Cold) to alternative storage so it takes up less disk space.

Since MySQL doesn't allow you to mix storage engines inside a single partitioned table, we use a strategy called Partition Exchange. This is a metadata-only swap that transfers millions of rows in less than a second.

Step A: Create the Partitioned InnoDB Table

We partition an active InnoDB table by months using a timestamp or date:

CREATE TABLE system_logs (
    log_id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
    log_date DATE NOT NULL,
    subsystem VARCHAR(50) NOT NULL,
    message TEXT NOT NULL,
    PRIMARY KEY (log_id, log_date)
) ENGINE=InnoDB
PARTITION BY RANGE (TO_DAYS(log_date)) (
    PARTITION p_past_month VALUES LESS THAN (TO_DAYS('2026-05-01')),
    PARTITION p_current_month VALUES LESS THAN (TO_DAYS('2026-06-01')),
    PARTITION p_future VALUES LESS THAN MAXVALUE
);

Step B: The Instant Partition Exchange

When a month ends, you want to compress that older data. Running a massive DELETE statement would lock your tables and spike your CPU. Instead, you create an empty, identical InnoDB staging table, and instantly swap the old partition out:

-- 1. Create a temporary staging table matching the schema
CREATE TABLE system_logs_stage LIKE system_logs;
ALTER TABLE system_logs_stage REMOVE PARTITIONING;

-- 2. Instantly swap the old partition into the staging table (Takes < 1 second)
ALTER TABLE system_logs EXCHANGE PARTITION p_past_month WITH TABLE system_logs_stage;

-- 3. Drop the now empty partition from the active table
ALTER TABLE system_logs DROP PARTITION p_past_month;

From here, you can pipe the contents of system_logs_stage directly into an ARCHIVE-backed table to keep your cold logs compressed natively in the database.

5. Production Reality Check: Where the Industry is Today

Is Partition Exchange actually used in production? Yes, every single day. It is the gold standard for dropping or isolating historical data without locking your live tables.

However, while the ARCHIVE engine is an awesome architectural blueprint for stream compression, modern cloud infrastructure teams usually handle the "cold" tiering differently. Today, production database storage and IOPS (Input/Output Operations Per Second) are incredibly expensive. Engineers don't want cold, historical log data sitting on the same live production database instance, even if it is compressed.

Instead, companies use Partition Exchange to isolate old data into a staging table, stream it completely out of MySQL into a cloud data lake (like AWS S3) formatted as hyper-compressed columnar Apache Parquet files, and then drop the staging table entirely. This keeps the live production database lean, fast, and remarkably cheap to run.

Conclusion

When you are building side projects or working through software engineering courses, it is easy to just default to standard database configurations. But looking under the hood changes how you approach application performance.

Understanding the tension between InnoDB's rigid page allocations and ARCHIVE's fluid compression streams helps you think about data access patterns critically - matching the right storage strategy to the right workload is how you move from just writing standard CRUD apps to engineering optimized, production-ready systems.

Top comments (1)

Tejas • May 24

Note: Some implementation details (especially around the ARCHIVE engine and metadata files) vary across MySQL versions, the features have improved obv across the years