Posted on Apr 1 • Originally published at zenn.dev

[Side B] Pursuing OSS Quality Assurance with AI: Achieving 369 Tests, 97% Coverage, and GIL-Free Compatibility

#python #opensource #testing #devops

From the Author:
Recently, I introduced D-MemFS on Reddit. The response was overwhelming, confirming that memory management and file I/O performance are truly universal challenges for developers everywhere. This series is my response to that global interest.

🧭 About this Series: The Two Sides of Development

To provide a complete picture of this project, I’ve split each update into two perspectives:

Side A (Practical / from Qiita): Implementation details, benchmarks, and technical solutions.
Side B (Philosophy / from Zenn): The development war stories, AI-collaboration, and design decisions.

Testing is a "Contract between the Design Document and the Code"

Why do we write tests? "To prevent bugs," is correct, but I want to phrase it differently.

I believe tests are a contract between the design document and the code.

In the context of "Spec-First AI Development" that I wrote about in the previous article—a method I later learned is called SDD (Specification Driven Development)—testing is the embodiment of the specification. The design document says, "If the quota is exceeded, throw an MFSQuotaExceededError." The code implements that. Testing is a mechanism to eternally verify "is it really doing exactly that?" I think software is trustworthy only when the triangle of Design Document -> Test -> Code aligns consistently.

Precisely because strict specifications (SDD) exist, AI can propose exhaustive test cases, and a human can judge their validity (supervise) against those specifications. Without a specification, there is no standard to determine whether the test cases output by the AI are "correct."

In D-MemFS, I directly derived the vast majority of the test cases from the design document. Reading through each section of the design document, I kept asking, "What should be tested to verify this specification?" Tests that can be traced back to the design document like this are called traceability.

AI Gives Us Tons of Test Cases (But Needs Supervision)

If you have Claude Opus or GitHub Copilot write test cases, they will give you a massive amount, including edge cases humans wouldn't readily think of.

For example, when asking about a simple operation like "deleting a directory":

Delete an empty directory
Delete a directory with contents
Attempt to delete a non-existent directory
Attempt to delete the root directory
While deleting, another thread tries to write to the same path
Attempt to delete using symbolic link-like paths (paths with ..)

...exhaustive cases line up like this. There’s always one where you think, "Ah, I hadn't thought of that."

However, you must not blindly trust the tests an AI writes.

An AI writes "tests that look correct." But when its understanding of the specifications is wrong, it asserts incorrect expected values. In particular, we often found it reversing "cases that should error" and "cases that should run normally." Verifying them one by one against the design document is tedious, but if you don't, it leads to a false conviction that "the tests passed = it is correct."

AI provides speed. Humans verify the direction. This division of roles worked perfectly.

Thread-Safe Tests Look Like This

Since I touted support for GIL-free Python (PYTHON_GIL=0), I had to prove it doesn't break even in multithreading.

import threading
from dmemfs import MemoryFileSystem

def stress_test():
    mfs = MemoryFileSystem(max_quota=50 * 1024 * 1024)
    errors = []

    def worker(thread_id):
        for i in range(1000):
            path = f"/thread_{thread_id}/file_{i}.txt"
            try:
                mfs.mkdir(f"/thread_{thread_id}", exist_ok=True)
                with mfs.open(path, "wb") as f:
                    f.write(f"data-{thread_id}-{i}".encode())
                with mfs.open(path, "rb") as f:
                    data = f.read()
                assert data == f"data-{thread_id}-{i}".encode()
            except Exception as e:
                errors.append(e)

    threads = [threading.Thread(target=worker, args=(t,)) for t in range(50)]
    for t in threads:
        t.start()
    for t in threads:
        t.join()

    assert not errors, f"Thread errors: {errors}"

50 threads × 1000 iterations. Running this under PYTHON_GIL=0 (free-threaded Python) and passing was a requirement.

Why is this important? In the free-threaded mode (PYTHON_GIL=0) introduced in Python 3.13, the "implicit lock" known as the GIL disappears. In normal Python, thanks to the GIL, inter-thread conflicts rarely surface. But when the GIL is removed, if your RW lock implementation has even the slightest flaw, a race condition occurs instantly.

D-MemFS's Read/Write locks are designed as explicit locks independent of the GIL. By executing the quota verification and reservation atomically under a single lock, the typical conflict of "another thread jumping in between verification and writing" was eliminated. This design was solidified around v8 of the design document when Claude Opus's review pointed out the potential for race conditions.

Indeed, in the initial implementation, a few problems emerged in the GIL-free environment, and I fixed them. GIL-free compatibility is currently a massive topic across the entire Python community. I believe whether a library can claim to be "GIL-free safe" will become a metric of reliability going forward.

Testing the Memory Guard Was a Battle Against "Environmental Dependencies"

The Memory Guard feature (memory_guard parameter), added in v0.3.0, checks the host machine's physical memory balance. If a Hard Quota manages the "budget within the virtual FS," the Memory Guard verifies "whether the machine has the luxury to pay that budget in the first place."

What made this testing troublesome was that the available physical memory differed for every test environment. Free memory on the CI Ubuntu runners was completely different from my local Windows machine. Thus, using unittest.mock.patch, I mocked get_available_memory_bytes and artificially created situations like "a state with only 100 bytes of free memory" to test it. Securing reproducibility independent of the environment—because the design document had stated "Memory Guard relies on OS memory info" from the start, I could design the tests presupposing a mock.

What is the "Remaining 3%"?

When I write 97% coverage, I know I'll be asked "Why not 100%," so I’ll answer that in advance.

The remaining roughly 3% consists of defensive code. Parts that are essentially "theoretically unreachable, but error handling placed just in case." For example:

# Fallback when internal state breaks
# A path that will absolutely not be traversed under normal usage
if self._internal_state is None:
    raise MFSInternalError("Internal state is corrupted (Possible bug)")

To let tests reach such defensive code, you'd have to directly break the object's internal state manually, which defeats the purpose of the test. Rather, it is code that is correct because it doesn't get reached.

Rather than obsessing over 100%, I feel it's much more honest to state 97% leaving clear what "the remaining 3%" is.

A Matrix of 3 OS × 3 Python Versions

I configured my CI like this:

OS	Python 3.11	Python 3.12	Python 3.13
Ubuntu	✅	✅	✅
Windows	✅	✅	✅
macOS	✅	✅	✅

I verified that all 369 tests pass across a total of 9 environments. Python 3.13 includes execution on the free-threaded build (PYTHON_GIL=0).

Setting this up myself was subtly tedious. Tests failed particularly over the nuances of path handling between Windows and macOS (path separators, case insensitivity/sensitivity). I was correct in setting "Multi-OS support" as an initial requirement rather than settling for "just getting it to pass in Linux CI."

The Moment All Tests Turned Green

When I fixed the final failing test, ran CI, and looked back at the screenshot of the moment all 369 tests turned green, I still feel a little emotional.

If I write "I built it hand-in-hand with AI," people might assume "So in the end, the AI did everything?" But the reality is different. AI offers speed and exhaustiveness. However, it was the human that judged what was correct.

Which specifications in the design document to implement. Which expected values in a test are correct. Judging whether a bug stems from a design issue or a code flaw. Deciding "not to implement this feature." All of these decisions could only be made because there was a standard called the design document.

Treating AI like an "incredibly fast but supervise-needing rookie." Treating the design document as the "Task Specification Handout given to that rookie." This combination functioned remarkably well.

Conclusion: With a Design Doc, I Feel We Can Trust AI a Bit More

Maintain the triangle: Design Document -> Test -> Code
Leave the speed and exhaustiveness to the AI
Judgment and philosophy lie with the human

This gave birth to D-MemFS. 369 tests, 97% coverage, and zero external dependencies.

As someone who used to solely work inside Azure DevOps, I published an OSS on GitHub for the very first time. I rewrote the design document 13 times, sparred hundreds of times with AI, and reviewed the design doc every time a test failed.

It all started as "just speeding up a small script," and before I knew it, I had built something like this.