DEV Community

D
D

Posted on • Originally published at zenn.dev

[Side B] Have you ever wanted to extract a ZIP file in memory? I have.

Author's Note:
I am a software engineer based in Japan. This article is an English translation (with AI assistance for clarity) of a development chronicle originally written for the Japanese developer community platform, Zenn.
Recently, I introduced my project, D-MemFS, on Reddit (r/Python), where it sparked intense architectural discussions. This response confirmed that in-memory I/O bottlenecks and OOM crashes are truly universal pain points for developers everywhere. Therefore, I decided to cross the language barrier and share these insights globally.

🧭 About this Series: The Two Sides of Development

In Japan, I publish this series across two distinct platforms to serve different developer needs. To provide the complete picture here on Dev.to, I've brought them together as two "Sides":

  • Side A (Practical / originally on Qiita): Focuses on the "How". Implementation details, benchmarks, and concrete solutions for practical use cases.
  • Side B (Philosophy / originally on Zenn): Focuses on the "Why". The development war stories, design decisions, and how I collaborated with AI through Specification-Driven Development (SDD).

15 to 20 Minutes. Every Single Time.

There is a Python desktop application I built for business use. I developed it about three years ago, and I have been modifying and using it ever since. One of its features is a process to "download a ZIP from the internet, extract it, and place the contents in appropriate locations."

The problem was the test waiting time that occurred every time I touched this app.

When you modify code, naturally, you test it. A single trial of that test took 15 to 20 minutes. Whether it was adding a feature, fixing a bug, or changing a specification, I had to wait over 15 minutes for every single operation check. Modify, wait, check the results, fix again, wait again. During development, this waiting time occurred over and over.

The bottleneck was obvious without even investigating.

Extracting the ZIP and copying files.

The downloaded file is saved to storage, extracted on that storage, and then copied to the appropriate location (including overwrites). It's entirely disk I/O. Of course it takes time.

"Wouldn't this be a lot faster if it were all done in memory?"

I Tried PyFilesystem2

The very first thing that came to mind was pyfilesystem2. The name itself suggests "doing file system-like things in Python," and it’s supposed to have memory FS features.

I installed it, wrote the code, and executed it.

UserWarning: pkg_resources is deprecated ...
Enter fullscreen mode Exit fullscreen mode

A mysterious warning appeared. I looked into it and found that the deprecation of pkg_resources in setuptools 82 caused pyfilesystem2 to stop working properly. Looking at Issue #597, it has been left open since late 2024. It seems maintenance has stalled.

It worked when I started development, but before I knew it, it was broken. This is a classic external library story.

I Looked Elsewhere, But Nothing Clicked

I checked other libraries too.
However, including pyfilesystem2, they either had excessively too many features for what I was looking for, or lacked necessary features.

The thing that bothered me the most was the lack of quotas (capacity limits). Running in memory means that if it runs out of control, the process will immediately die from OOM (Out-of-Memory). Since I was developing this app with the intention of eventually letting end-users use it directly, having it "eat up all the memory and crash when fed invalid data" was out of the question. I wanted it to throw an error with a limit like "Up to 100MB max." I couldn't find a library that had that kind of feature.

So I Decided to Build It Myself (Just Out of Curiosity)

I thought, "Then I'll just build a prototype and embed it directly into the app." Just out of curiosity.

Within a few hours, I had something that looked the part. The flow was: receive the ZIP with BytesIO, extract it into an in-memory directory structure, and read from there.

I ran it.

It was fast.

It finished before I could even finish drinking a cup of coffee. Calculating the time it took from the logs, it now fit within under 2 minutes. From 15-20 minutes down to less than 2 minutes. For the exact same process. Waiting 15 minutes for every test during development now ended in 2 minutes. This alone drastically changed the development experience.

I slightly suspected, "Isn't this cheating?" but thinking about it, it's obvious: it gets faster when you eliminate the round trips of writing to disk and reading it back again.

"Why Not Release It as a Library?"

After being satisfied embedding it in my app, I suddenly remembered.
"Wait, pyfilesystem2 is broken right now. Aren't there other people struggling with the exact same thing?"

Looking again from the perspective of a memory FS, there are a few memory FS-related libraries on PyPI, but they are all unmaintained or have limited features. If I build this properly and publish it, there might be demand for it.

Before I knew it, I started writing a design document, and I found I had rewritten it 13 times.

Here's What It Looks Like To Use

When you use the actually published D-MemFS (PyPI: dmemfs), it looks like this. The core MemoryFileSystem class implementation is in dmemfs/_fs.py, so feel free to check that out as well if you're interested.

import zipfile
import io
from dmemfs import MemoryFileSystem

# The byte sequence of a ZIP (assuming received from the network)
zip_bytes = b"..."  # requests.get(...).content, etc.

# Extract in memory (import_tree automatically creates parent directories)
mfs = MemoryFileSystem(max_quota=100 * 1024 * 1024)  # 100MB limit
with zipfile.ZipFile(io.BytesIO(zip_bytes)) as zf:
    tree = {f"/{name}": zf.read(name) for name in zf.namelist() if not name.endswith("/")}
    mfs.import_tree(tree)

# You can read it just like a normal file system
with mfs.open("/data/report.csv", "rb") as f:
    for line in f.read().decode("utf-8").splitlines():
        print(line)
Enter fullscreen mode Exit fullscreen mode

If the quota is exceeded, an MFSQuotaExceededError is raised. This prevents "accidentally eating up all the memory."

from dmemfs import MemoryFileSystem, MFSQuotaExceededError

mfs = MemoryFileSystem(max_quota=1024)  # 1KB limit (for testing)
try:
    with mfs.open("/huge.bin", "wb") as f:
        f.write(b"x" * 2048)
except MFSQuotaExceededError as e:
    print(f"Capacity exceeded: {e}")
Enter fullscreen mode Exit fullscreen mode

And So, My First OSS Release

When I built the prototype, my intention was merely to "just embed it in my own app." And yet, I rewrote the design document countless times, wrote an insane amount of tests, and even tested it on a Mac in CI which I had never used before...

For someone who had managed code in private repositories all along, placing what I had built myself into a "sort of public square" like GitHub takes a fair amount of courage.

Having been a systems engineer for a long time, this was my somewhat delayed OSS debut.
It's quite deeply moving.


πŸ”— Links & Resources

If you find this project interesting, a ⭐ on GitHub would be the best way to support my work!

Top comments (0)