Posted on Mar 25 • Originally published at zenn.dev

[Side B] Breaking Free from Vibe Coding Fatigue: A Practical Record of Building an OSS with 'Spec-First AI Development'

#python #opensource #testing #devops

From the Author:
D-MemFS was featured in Python Weekly Issue #737 (March 19, 2026) under Interesting Projects, Tools and Libraries. Being picked up by one of the most widely-read Python newsletters confirmed that in-memory I/O bottlenecks and memory management are truly universal challenges for developers everywhere. This series is my response to that interest.

🧭 About this Series: The Two Sides of Development

To provide a complete picture of this project, I’ve split each update into two perspectives:

Side A (Practical / from Qiita): Implementation details, benchmarks, and technical solutions.
Side B (Philosophy / from Zenn): The development war stories, AI-collaboration, and design decisions.

Are You Realizing the Limits of "Vibe Coding"?

Having AI write code for us has become the norm. Throw a prompt at it, and it returns plausible code. It runs. The tests pass. It's convenient.

However, if you continue this way of "having AI generate code based on a vague vibe"—often called Vibe Coding—you will inevitably hit a wall at some point.

You don't understand the underlying principles of the code the AI generated.
Trying to fix a bug creates another bug.
You want to refactor, but have no standard for what must be protected.
In code reviews, you can't answer "Why is it done this way?"

This is Vibe Coding Fatigue. By continually accepting AI's output, the codebase eventually slips out of your control.

So, what should you do?

My answer was: "Write the design document before the code." Inspired by the idea "Write the specifications first," which I had heard somewhere last year, I arbitrarily called this "Spec-First AI Development." Before letting the AI write any code, we thoroughly iron out the specifications and design documents.

Why "Design Documents Before Code"?

When you suddenly let an AI write code, the following problems arise:

Problem 1: Reviewing AI-generated code is too difficult for humans.

AI yields hundreds of lines of code in an instant. But the human reviewing it must read and understand every single line. AI code frequently "works but the intention is unreadable." This is because the choices of variables and logic don't reflect human design intent.

On the other hand, reviewing a design document is overwhelmingly easier. If it says, "Throw an exception before writing if the quota is exceeded," you can judge the validity of that statement as plain language. What's more, you can even have the AI itself review the design document.

Problem 2: Code alone cannot guarantee "Design Philosophy."

Code states "what to do," but not "why it does so." When adding a new feature, AI might suggest an implementation that violates the existing design philosophy. If the philosophy is explicitly stated in the design document, you can instruct the AI to "think of a method based on the design philosophy." The design document becomes a guardrail.

For D-MemFS, I explicitly stated design philosophies like "Zero External Dependencies," "Safety First," and "Binary-Exclusive FS" in the design document. Even if AI suggested using a convenient external library, I could reject it with a single phrase: "It goes against the design philosophy."

Problem 3: Context Explosion.

Basic design documents, detailed design documents, test design documents—if you separate documents by phase or purpose, you don't have to make the implementing AI read all the documents. When implementing tests, you just need to hand over the test design document. The AI's context window is finite. If you flood it with irrelevant information, the accuracy of the crucial parts drops.

And the biggest advantage: Human understanding advances.

Through the process of bouncing ideas off the AI to refine the design document, the human's system comprehension deepens. "What should this class do?", "Where does this error occur?", "Is this operation thread-safe?"—you begin to hold the answers to these questions in your own mind. Consequently, the code output by the AI becomes remarkably easier to read. You can read it because you understand it.

Using Browser AIs as Sparring Partners

Let me talk about what I actually did to create the design documents.

The first thing I did was brush up on the idea. I uploaded the source file of the memory FS feature I had embedded in my personal app to both Gemini and ChatGPT. A sparring session started: "If I were to carve this out as a library, what kind of design would be best?"

I summarized the sparring contents into Markdown and handed it to another AI. Spurred on by that, we sparred some more. Of course, I relentlessly interjected with my own questions.

Once the ideas were somewhat aligned, I had one of the AIs write the design document. From here on, they naturally split into the role of modifying the design document and the role of reviewing.

:::details [Instruction to Gemini]
Based on the following ideas and requirements,
please write a design document for a Python in-memory file system library.
Requirements: Quota management, Hierarchical directories, Thread-safe, No external dependencies
[Paste the Markdown of ideas]
:::

Gemini outputs a design document (v1). I throw that straight into ChatGPT.

:::details [Instruction to ChatGPT]
Please critique the following design document.
List problems, omissions, and design inconsistencies as much as possible.
[Paste Design Document v1]
:::

ChatGPT lists the problems. I hand that over to Gemini to let it fix them. Then I evaluate it with ChatGPT again. I repeated this back-and-forth. Sometimes I hit Gemini's Pro usage limits.

And I myself also relentlessly kept putting in my two cents.

"Why does this API look like this?", "Are these error types sufficient?", "What happens during concurrent writes?", "When is the quota calculated?" If a vague answer returned, I kept pressing for "more specifics." I didn't let "something roughly like this" slide.

There is one thing to note. If you keep sparring over technical details, the context fills up rapidly, and the AI's responses gradually get weird. A dark descent, so to speak. Before that happens, you have the AI generate a handover message and switch to a new chat. I did this "chat switching" numerous times.

Claude Opus Overturned Everything

When the design document had somewhat come together (around v7-v8), I asked Claude Opus to review it.

What came back was a massive amount of feedback.

Problems I couldn't catch in sparring with Gemini and ChatGPT surfaced one after another. Issues with lock granularity, naive rollback design upon quota excess, omissions in path normalization rules, latent race conditions in the asynchronous wrapper design...

It was a moment I realized, "This is fundamentally much harder than I thought."

From here, we repeated: modify → Opus review → modify → Sparring → modify... and the design document eventually reached v13. Note that I started implementation around v11 or v12.

Never Bend the Design Philosophy

Along the way, the topic of "Wouldn't it be convenient to add this feature too?" came up many times. Every single time, I returned to this question:

"Does this align with the original design philosophy?"

The design philosophy for D-MemFS was set from the beginning:

Zero External Dependencies (stdlib only)
Safety First (No dangerous operations provided)
Not just a buffer, but a real FS with a hierarchical structure
Direct extraction support for Archives (ZIP)

Any proposal conflicting with these principles, no matter how convenient it seemed, was entirely dropped.

I insisted on zero external dependencies because I had seen the failure of pyfilesystem2. It stopped working due to a single change in setuptools. Relying on a library means adopting that library's risks wholesale. By using only the stdlib, it runs as long as Python runs.

"Safety First" concretely manifests in decisions like "Not providing operations to wipe the entire file system" and "Rejecting operations that could result in path traversal attacks by default." It's better not to possess convenient but dangerous features from the start.

Looking Back, This Was SDD

To summarize, this is what I did:

1. Spar ideas with AI and write design docs (use multiple AIs)
2. Human thoroughly reviews the design docs
3. Generate code ONLY after the design doc is solidified
4. If a problem is found, fix the design doc first
5. Only after fixing the design doc, fix the code

I learned recently that this is precisely the method known in the industry as SDD (Specification Driven Development). While I casually labeled it "Spec-First AI Development" earlier, it seems it already had a proper name. While Vibe Coding is "letting AI write code by vibe," SDD is "controlling AI through specifications." What I was doing was practicing this SDD in collaboration with AI.

The critical points are 4 and 5. When a problem is found during the implementation phase, it's easy to just fix the code first. But doing so causes the design document and the code to drift apart. The design document turns into "a dream we wrote down at the start," misaligned with reality.

Therefore, when a problem is found, return to the design document first. Fix the design document saying, "This specification should be changed like this," and then fix the code according to that modification. The design document is permanently the "source of truth," and the code follows it. Maintaining this sequence was vital.

By doing so, the evaluation criteria for the code written by AI transforms from "Does it run?" to "Does it correctly implement the intentions of the design document?" I believe this is the framework for "using AI while trusting it."

In my personal opinion, I even feel that the design document is more important than the code. As long as the design document is correct, the code can be rewritten infinitely. Whether you let an AI write it or write it by hand, having the standard of a design document allows you to evaluate "if it's correct." Conversely, without a design doc, the standard itself to judge whether the code is correct doesn't exist.

"It Got Bigger Than I Expected"

To be formal about it, I initially envisioned something much smaller. My intention was just to "tidy up the prototype a bit and turn it into a library."

But the more I wrote the design document, the more things I realized I had to consider. If I seriously consider thread safety, I realize I need RW locks. By implementing RW locks, I naturally want to check if it runs properly in GIL-free Python (PYTHON_GIL=0). To check that, I need tests, and writing tests exposes the flaws in the design...

Before I knew it, what emerged was a library running purely on the standard library, supporting hierarchical directories, equipped with quota management, RW-locked, harboring an asynchronous wrapper, supporting free-threaded Python, with 369 tests and 97% coverage.

In v0.3.0, even a Memory Guard was introduced. While Hard Quotas manage the "budget within the virtual FS," the Memory Guard checks the host machine's physical memory and rejects writes proactively if space cannot be secured. Given that an OS will execute an OOM kill before reaching the quota if the set quota exceeds the machine's free memory, this feature became a logical necessity. Having written in the design document that "the quota will absolutely not be exceeded," I could no longer ignore the possibility of death outside the quota — this too is a consequence of "Spec-First." (The technical details will be discussed in the 3rd article.)

The scary part of being Spec-First might be that if you try to do it right, you actually end up creating something truly solid.