Joshua Gutierrez

Posted on May 19

Why Most AI Writing Tools Quietly Fail

#ai #socialmedia #softwareengineering #softwaredevelopment

I spent a session this week tearing apart one of our own systems, and the post-mortem turned into a thesis I keep coming back to:

Most AI writing tools are optimized for transformation, not preservation.

Our adapter looked correct on the surface. You write a post, click “Auto Adapt,” and out come Twitter, LinkedIn, and Threads versions. Short. Clean. Under the character limit. Technically successful.

Semantically wrong.

I noticed it when I fed the adapter a post about real before-and-after SEO scores from our portfolio. The original had four specific score deltas, three domain names, Core Web Vitals data, and a thesis about why HTML-parsing audit tools miss what real-browser ones catch. Evidence-heavy.

The adapter compressed all of it into:

"Every one jumped."
That line bothered me. Not because it was inaccurate. Because it erased the proof. The post still had the same shape (problem, explanation, CTA) but it no longer had the thing that made it persuasive.

The adapter was doing exactly what I’d asked it to do.

The optimization target was wrong
The engineering was fine: parallel async adaptation, platform-specific character limits, fallback truncation, taxonomy-aware prompting, safe failure behavior. All clean.

The prompt was the problem. It said:

"Rewrite this for Twitter."
Sounds harmless. Unpack what “rewrite” actually permits:

summarize
restructure
merge claims
drop specifics
abstract upward
replace evidence with implication
optimize for engagement over fidelity
The system was behaving correctly. The philosophy was wrong.

The shift
The whole architecture moved around one sentence:

The author’s wording, length, and specifics are correct unless they
violate a rule.

That sentence flips the model’s role. Most AI writing tools assume the model should act like the creator. But creators don’t want replacement. They want assistance with distribution friction: formatting, platform constraints, length caps, pacing, thread splitting.

Become a Medium member
The moment the AI starts “improving” the substance, trust collapses. Because now the creator has to audit the AI instead of using it.

What actually changed
The model became a copy editor, not a writer. The instruction shifted from “rewrite this post” to “make the smallest possible change necessary.” The model can fix formatting, remove forbidden phrases, adjust syntax, handle structural constraints. It is no longer authorized to rewrite hooks, invent framing, replace evidence, or compress meaning.

Preservation became enforced, not requested. Before the model runs, the system extracts protected facts: numbers, score deltas, URLs, domains, quoted text, structured evidence blocks, CTAs. After generation, a validator checks that those facts survived. If they didn’t, the output is rejected and the system falls back to deterministic trimming.

That distinction is the actual breakthrough. A prompt saying “please preserve the numbers” is not a guarantee. LLMs are not deterministic semantic compressors. They abstract naturally. So you don’t ask the model to preserve facts. You verify it afterward and reject when it didn’t.

The fallback became deterministic. Earlier versions had instructions like “drop the weakest sentence.” That sounds rigorous until you implement it. Weakest according to what? Without an explicit scoring policy, “deterministic fallback” is just another hidden heuristic.

The new fallback ranks sentences using weighted signals: protected fact presence, forbidden phrases, adjective density, duplication, position, CTA detection, sentence length. Same input, same scoring, same output. No hidden model mood swings.

What it looks like in practice
Three before/after pairs from the rebuild:

Original: “axiondeep.com 91→96, axiondeepdigital.com 94→96, made4founders.com 90→97. Mobile Core Web Vitals all in the Good range.”
Old adapter: “Every one jumped.”
New adapter: “axiondeep.com 91→96. axiondeepdigital.com 94→96. made4founders.com 90→97. All Mobile CWV: Good.”

Original: “We audited 292 small business websites. 96.9% failed Core Web Vitals on mobile.”
Old adapter: “Most websites are slow.”
New adapter: “Audited 292 small business sites. 96.9% failed mobile CWV.”

Original: “I’d actually like to hear it if the audit missed the mark for you.”
Old adapter: “Send feedback.”
New adapter: Unchanged. The line is short enough and structurally fine.

The third example is the point. Most posts need fewer changes than the model wants to make.

What I haven’t solved
The hardest case is the emotionally critical weak sentence. Example:

"My dad would've loved this."
There is no deterministic rule that recognizes why that sentence matters. An LLM cannot reliably infer it either. Structurally it’s removable. Compression-friendly. Emotionally, it might carry the entire post.

I did not find a clean answer. The system flags short sentences as candidates for removal but errs toward keeping anything that doesn’t match a forbidden pattern. The cost is occasionally bloated drafts. The benefit is never silently destroying the line that mattered most.

That was the trade I was willing to make. Engineering systems become dangerous when they pretend uncertainty doesn’t exist.

The real product was the feedback loop
The most strategic decision wasn’t in the adapter at all. It was capturing final_published_text after the user edits and posts.

That single field turns the adapter from a static feature into a measurable editorial system. We can now observe what users reverted, what they preserved, where they distrusted the AI, how aggressively they edited, which transformations survived. Most AI writing tools optimize against assumptions. This one will optimize against observed correction behavior.

The telemetry probably matters more than the adapter itself.

If you’re building anything in the AI writing space and you’ve fought the same problem, I’d genuinely like to compare notes. The hard part isn’t the model. It’s deciding what the model is allowed to optimize for.

— Joshua R. Gutierrez

DEV Community

Why Most AI Writing Tools Quietly Fail

Top comments (0)