The Unfiltered Log of Shipping Open-Source v2 with AI Agents

#react #webdev #opensource

Six weeks, 146 commits, and every hallucination along the way

Sometime in 2024 I opened a GitHub notification from react-modern-audio-player, read the issue, started typing a reply, and closed the tab. I don't remember the issue. I remember the tab closing. That was the pattern for three years: someone would file a bug or ask about accessibility, and I would care about it for about ninety seconds before the weight of everything the library needed crushed the motivation to start.

The last real commit was v1.4.0-rc.2, February 2023. I had a full-time job, the library worked well enough, and nobody was furious enough to fork it. So it sat.

What changed

On March 1, 2026, I made the first commit of what became v2. The catalyst wasn't inspiration. It was Claude Code.

Not because Claude wrote the library for me. Because it lowered the activation energy enough that I could start. When the gap between "I should fix this" and "here is a concrete first step" shrinks from two hours of re-reading your own code to fifteen minutes of conversation, starting stops feeling impossible. That's the only honest way I can describe what happened.

The overhaul ran about six weeks. 146 commits, roughly 20 PRs numbered #31 through #53, a complete rewrite of the test infrastructure, the CSS layer, the bundle composition, and the accessibility surface. My primary tools were Claude (Opus and Sonnet) for code generation, refactoring, and review, and CodeRabbit for automated PR review. CodeRabbit runs a multi-model ensemble under the hood, selecting different frontier models per review task. I also consulted Gemini and GPT occasionally for documentation references and terminology checks, though they weren't part of the daily workflow.

Early on, I ran an experiment that changed how I worked with all of them.

Four models, four confessions

On 2026-03-18, I gave the same PR review task to four models side by side: Claude Sonnet, Gemini 3 Flash, GPT 5.3, and CodeRabbit. Every model got something wrong. Here is each one admitting it, in their own words:

Primary (daily drivers):

Model	Role	Self-admission
Claude Sonnet	Config file review	"I explained existing file content as if valid without verifying."
CodeRabbit (multi-model)	PR blocker	"My original blocker was wrong — your config is correct."

Reference (doc checks, terminology):

Model	Role	Self-admission
Gemini 3 Flash	Doc consistency	"I mixed past data with current documentation."
GPT 5.3	Bundle config	"I confused 'condition bundle' vs 'rule array' concept."

On any given question, roughly two of four got it right. But the two that got it right rotated. Claude would nail a state management question and hallucinate about a Vite config. That experiment killed my trust in any single model's output. From that point on, my rule was: if Claude and CodeRabbit agree and the official docs confirm, proceed. Otherwise, run the code and find out. The validation stack wasn't overhead. It was the actual product of the overhaul.

What AI caught that I missed

The initial analysis of v1.4.0-rc.2 produced a scorecard I already half-knew but had never written down:

Category	Score	Note
Functionality	9/10	Feature-complete
Reliability	4/10	0% test coverage
Performance	6/10	Degrades at scale
Accessibility	1/10	No WCAG support
Maintainability	5/10	Technical debt
Bundle Size	5/10	~380 KB, 6 runtime deps

Overall production-readiness: 5/10. I had shipped a feature-complete component that was nearly unusable for screen-reader users and had no safety net against regressions.

PR #41 addressed the accessibility gaps across player components. PR #42 split the React context and added memoization to fix a re-render storm I'd been ignoring. PR #43 replaced direct DOM mutations with React state. These weren't AI ideas, exactly. They were problems I already knew about, surfaced and organized by models that could scan the whole codebase in seconds instead of the hours it would have taken me to re-orient after three years away.

The validation stack

Here is what I actually ran before tagging v2.0.0:

CodeRabbit on every PR, configured to block merge on unresolved findings
32 test files (up from zero): unit tests via Vitest, integration with React Testing Library, end-to-end via Playwright
axe-core for automated accessibility checks
Manual VoiceOver testing on Safari
A docs-first workflow where I wrote the README change before the implementation
Cross-validation: no architectural decision accepted unless Claude and CodeRabbit agreed and official documentation confirmed

That last one sounds paranoid. It saved me from shipping hallucinated configs at least three times.

Is it more robust now?

Metric	v1.4.0 (Before)	v2.1.0 (After)
Tests	0 files	32 files (unit + integration + e2e)
Bundle	~380 KB, 6 runtime deps	~79 KB unminified, 1 dep (wavesurfer.js)
Accessibility	1/10, no ARIA	ARIA + keyboard + VoiceOver tested
Re-renders	All consumers on any state change	Split context + memoization
DOM control	Direct manipulation outside React	React state driven
Public API	None	`useAudioPlayer()` hook

The bundle number looks like roughly 80% smaller, but I haven't verified against gzipped production builds, so treat it as approximate.

I'm not going to claim a specific new accessibility score because I haven't run a formal audit against v2.1.0 yet. The keyboard interface covers play, pause, seek, and volume, and VoiceOver navigation works. Whether that's a 7 or an 8, I honestly don't know.

So yes, more robust. Also still a small library maintained by one person with a day job. The tests exist now, which means regressions will be caught. That is the win. Not "production-grade enterprise audio solution." Just: maintained.

"Just vibe-code your own player"

I've seen the argument. Why depend on a library at all when you can prompt an AI to generate a custom audio player in an afternoon? I've thought about it more than I'd like to admit, because if the answer is "no reason," then six weeks of my life just evaporated.

Here is what I think. A generated audio player works on the demo. Then you discover that Safari fires canplaythrough differently than Chrome, and your loading state breaks on iOS. Then you add wavesurfer.js for waveform rendering and find out its lifecycle hooks need careful cleanup or you leak memory on every track change. Then a user wants shuffle, repeat, and drag-to-reorder in the playlist, and suddenly you're maintaining a state machine. Then someone files an accessibility issue and you realize that aria attributes alone don't make a screen reader experience. Then you deploy on Next.js App Router and learn that half your hooks assume a browser environment.

Each one of these is a week. Not because any single problem is hard, but because they compound, and the generated version hasn't paid that tax yet. The question isn't whether you can build a player from scratch. Of course you can. The question is whether you want to maintain one from scratch, because that's what the six weeks actually bought: not a player, but the accumulated scar tissue of integration problems already solved, tested, and documented. Whether you'd rather spend six weeks or use a library that already did, I'll leave to you.

What I'm left with

react-modern-audio-player v2.1.0 shipped on April 14, 2026. It is still a small library. It is maintained now, which is more than I could say for three years. If you use it and something breaks, file an issue and I'll see it. I won't close the tab this time.

I don't know if AI-assisted maintenance scales to larger projects or longer timelines. I know it worked for this one, this time, with constant supervision. That's a narrower claim than I wanted to make, but it's the one the evidence supports.

P.S. This article was drafted with AI assistance (Claude) and then edited by hand. All metrics, commit references, and timeline claims were verified against the actual git history and project documentation.

Repo: https://github.com/slash9494/react-modern-audio-player