Writing an Essay with AI: Codex vs Claude Code

#ai #writing #claudecode #codex

I recently published Science Catch-Up — a 16-chapter essay examining the limits of the scientific method and proposing a framework for evaluating knowledge without waiting for consensus. The full essay was written with AI assistance: first with OpenAI Codex (GPT 5.2/5.3), then with Claude Code (Opus 4.5/4.6).

The difference in quality was striking. Not in the way you'd expect from a code generation comparison — but in the far more demanding territory of prose.

The essay is available on Payhip (English) and Payhip (Spanish).

The project

Science Catch-Up is not a light read. It's a combative, heavily referenced essay that critiques scientism as a power structure, traces the cost of institutional dogma, and proposes operational criteria for evaluating informal knowledge. The tone had to be precise: direct without being conspiratorial, critical without being anti-science, and provocative without turning into a pamphlet.

That level of nuance is exactly where AI writing gets tested — and where the differences between tools become impossible to ignore.

Phase 1: Codex (GPT 5.2/5.3) — the rough start

The first drafts were written using OpenAI Codex. The initial structure and chapters came together, but the problems piled up fast.

Verbose, repetitive commit messages tell the story. Compare the early GPT-era commits:

"Se amplía la sección sobre biohacking, clarificando las categorías de restauración, mitigación y deuda biológica. Se añaden ejemplos prácticos para cada tipo y se enfatiza la distinción entre restaurar, mitigar y endeudarse, mejorando la comprensión del impacto fisiológico de estas prácticas."

With the later Opus-era commits:

"fix: QA polish — glossary term, font consistency, cover in PDF, new food pyramid"

Same repo, same author, different tool. The commit messages mirror the prose itself.

The specific problems with GPT's writing:

Repetitive expressions everywhere. Words like "precisamente", "en el fondo", and repeated subject openings ("Science Catch-Up propone...", "El marco establece...") appeared in clusters. I eventually had to do a dedicated cleanup pass — commit 61684f4: "Limpiar patrones repetitivos ChatGPT: precisamente, sujetos repetidos, en el fondo".
The "Ejemplo:" pattern. GPT consistently inserted the label "Ejemplo:" before illustrative cases, even when the editorial criteria explicitly said to integrate examples with flowing prose ("Por ejemplo...", "En la práctica..."). This rigid formatting was one of the hardest things to stamp out because it kept reappearing.
Bland, conciliatory tone. The essay needed to be combative and direct. GPT kept softening the edges, adding disclaimers ("this is not anti-science"), and producing what read like an academicized version of ideas that were originally sharp and provocative.

Phase 2: Claude Code (Opus 4.5/4.6) — a different league

When I switched to Claude Code with Agent Teams, the improvement was immediate. The prose was more natural, the tone closer to what I wanted, and the adherence to editorial guidelines much stronger.

After several iterations, I created a formal editorial criteria document — covering tone, argumentative structure, how to handle examples (three named criteria for narrative, schematic, and hybrid styles), referencing standards, and the essay's rhetorical direction. Claude Code followed these criteria consistently once they were established.

What Claude Code got right:

Tone adherence. The combative, no-apologies voice came through naturally. Less defensive hedging, more direct argumentation.
Structural intelligence. When given a chapter outline and editorial criteria, it produced content that respected the flow and built on previous sections.
Research and references. Excellent at finding and integrating relevant sources, formatting bibliography entries, and maintaining consistency across chapters.
Grammar and spelling. Essentially flawless in both Spanish and English — a non-trivial advantage for a bilingual publication.
Programmatic figures. Most of the essay's diagrams and charts were generated by Claude Code using matplotlib scripts — the evidence pyramid, the Science Catch-Up cycle, the cascade of patches diagram. Only one figure and the cover itself were created with Gemini.

Where it still fell short:

Subagent context loss. When dispatching multiple subagents in parallel (a common pattern with Agent Teams), individual agents would sometimes write sections in isolation, losing the narrative thread or producing content that didn't flow with surrounding chapters. The result read like separate authors had written adjacent paragraphs.
Still needs heavy review. Even with Claude Code's better output, I reviewed and revised every paragraph. The ideas and instructions were mine; the AI's role was transforming rough ideas into developed content, proposing structure, and doing research. I'd estimate 95%+ of the text was strictly written by AI, but every sentence was validated or adjusted by me.

Prose vs code: why writing is harder

This project crystallized something I'd been sensing: AI-assisted prose requires more oversight than AI-assisted code.

In code, functionality is king. If a function works, passes tests, and handles edge cases, the style matters but it's secondary. Modularity, naming conventions, and code quality are still easier to achieve and verify than their prose equivalents.

In prose, what you say and how you say it are inseparable. A paragraph that communicates the right idea but in a bland, hedging, or repetitive way is a failure — even though the "functionality" (conveying information) works. There's no test suite for tone. No linter for rhetorical punch. No CI pipeline that catches "this sounds like it was written by ChatGPT."

I found myself spending far more time reviewing prose than I ever do reviewing AI-generated code. Every sentence had stakes that a line of code doesn't.

The translation test

The essay was originally written in Spanish. The English translation was done with Claude Code and it was remarkably fast — the structure, references, and formatting carried over cleanly.

The interesting challenges were cultural, not technical:

Catchy phrases needed adaptation, not literal translation. Punchlines that worked in Spanish sometimes needed complete rethinking in English.
Domain acronyms: the essay introduces CSV (Ciencias de Sistemas Vivos) and CSI (Ciencias de Sistemas Inertes) in Spanish. In English these became OSS (Organic System Sciences) and ISS (Inert System Sciences) — a deliberate choice that required discussion.

The iterative process

One thing worth noting: the writing was never a single-pass affair. The typical cycle was:

Draft — AI writes a reasonable first version from my outline and notes
Ideas emerge — reading the draft triggers new thoughts, missing angles, additional references
Expand — feed those back in, ask for specific additions or restructuring
Review — catch tone drift, repetitive patterns, weak arguments
Polish — final pass for consistency with editorial criteria

This loop happened per chapter and across the whole essay. AI is extraordinary at step 1 and 3 — transforming raw ideas into developed content. But steps 2, 4, and 5 remain fundamentally human.

The output

Beyond the essay itself, the AI-assisted pipeline produced:

PDF and ePub builds for both languages
Amazon KDP formatting with programmatic cover generation
Audiobooks in both languages using Google's Aoede TTS
Publication metadata for multiple platforms

The Spanish audiobook is already on Spotify. The English one is coming — I'll do separate posts for the audiobook editions.

Takeaways

Codex (GPT) produced noticeably worse prose — repetitive, bland, and resistant to style guidelines
Claude Code (Opus) was significantly better — closer to the desired tone, better at following editorial criteria, stronger structural awareness
Neither replaces editorial judgment — the human loop of reviewing, rethinking, and refining is non-negotiable for quality prose
AI shines at transformation — turning rough ideas into structured content, researching references, handling grammar and spelling, generating figures programmatically
Prose requires more human oversight than code — because style, tone, and rhetorical effectiveness have no automated tests
Translation was the easiest part — structure carries over cleanly; only cultural nuances and catchy phrases needed real thought

The essay is a serious, opinionated piece of work. Whether you agree with its thesis or not, the process of writing it taught me more about AI-assisted creation than any coding project has.

Science Catch-Up is available on Payhip (English edition) and Payhip (Spanish edition).

Interested in AI agent architectures? Get in touch.

Originally published on javieraguilar.ai

Want to see more AI agent projects? Check out my portfolio where I showcase multi-agent systems, MCP development, and compliance automation.