DEV Community: ur-grue

What 'quality-tested' actually means for a library of 394 AI skills

ur-grue — Tue, 30 Jun 2026 15:25:49 +0000

"Quality-tested" is the kind of phrase that usually means nothing. Every tool claims it. Most mean "we tried it once and it didn't crash." So when a library of 394 free Claude skills puts "all quality-tested, mean 4.38/5" on the tin, the fair response is: prove it.

Here's exactly what the claim means, including where it's soft.

A skill ships `stable` only if it clears two bars

Every skill carries a status. To reach stable — the only status the library promotes — it has to pass a seven-dimension evaluation:

Overall mean ≥ 4.0/5 across coherence, relevance, accuracy, completeness, usefulness, format-fit, and Editorial Naturalness.
Editorial Naturalness ≥ 4.0 as a hard floor — a skill can ace the other six and still fail here. This dimension scores the output against observable AI tells (lexical, structural, tonal, genre). It's the one that stops competent-sounding slop from shipping.

The library mean across all stable skills is 4.38. The whole framework — dimensions, thresholds, the banned-phrase list — is in the repo, not a marketing page.

Two stages, because prose isn't code

Code eval is binary: it runs or it doesn't. Prose has no green checkmark, so the library tests in two layers. First, binary assertions catch the mechanical failures — did it produce the required sections, did it refuse to fabricate a quote with no source. Across thousands of these the pass rate is high, and the few "failures" were skills correctly refusing to invent content on deliberately thin inputs — the behaviour you want. Second, the graded rubric above handles the judgment calls binary checks can't.

Where it's soft — said plainly

The graded scoring uses a model as judge, and models are generous: they tend to like fluent text, including fluent AI text. So the scores are treated as a filter, not a verdict. Three things keep them honest:

the bar is set high (≥ 4.0 with the naturalness floor), so borderline output doesn't pass;
the rubric is anchored on observable tells, not taste, so two runs roughly agree;
the worked example in every skill lets you check the output yourself, by eye, in seconds.

It is not a guarantee every output is perfect. It's a documented, repeatable bar that's a lot higher than "we tried it once."

Why bother for free skills

Because the audience is media professionals, and they detect generic instantly. A skill library for people who notice bad writing has to be testable on exactly that axis, or the whole premise collapses. The eval framework isn't a credential — it's the thing that makes "doesn't sound like AI" a claim you can check instead of a vibe.

Open the repo, open any skill, read its example, and judge for yourself. That's the test that matters.

→ github.com/ur-grue/autopunk-media-skills

Show notes in under a minute — without the AI tell

ur-grue — Tue, 23 Jun 2026 15:46:10 +0000

Show notes are the chore nobody books time for. The episode's done, you're tired, and now you owe a summary, timestamps, links, and a guest bio — for a page most listeners skim. So it gets rushed, or skipped, or handed to a chatbot that returns "In this captivating episode, we dive deep into the fascinating world of..."

That opening is the tell. A podcast audience — and the host re-reading it — clocks it instantly. It reads like nobody listened to the episode, because nobody did.

I maintain a free, MIT-licensed library of Claude skills for media producers. The show-notes skill is one of the most-used, because it removes the chore without leaving the fingerprint.

What it does

You paste the transcript or a rough outline. It returns:

A summary that sounds like the show — written in the register of your podcast, not the generic "join us as we explore" template.
Chapter timestamps pulled from the actual structure of the conversation, not invented.
A links/resources block for anything named in the episode.
A short guest bio when there's a guest, drawn from what was actually said — flagged for you to confirm, never fabricated.

Why it doesn't sound like AI

The skill runs against the same quality bar as everything in the library: a seven-dimension rubric whose hard floor is whether the output reads like a human who works in the medium. The banned-phrase list — "dive deep," "captivating," "in this episode we explore" — is enforced, not suggested. What you get back is a draft you can post, not one you have to launder first.

The honest limit

It's only as good as what you feed it. A clean transcript gives clean notes; a vague three-line outline gives a thinner summary and more placeholders for you to fill. And it won't invent a guest's credentials or a stat that wasn't in the episode — if the source doesn't say it, the notes flag the gap instead of guessing. That restraint is the point: wrong show notes are worse than late ones.

Paste an episode in and see. No install — open the skill, copy it into Claude, drop your transcript below it.

→ github.com/ur-grue/autopunk-media-skills

Why I built 394 narrow Claude skills instead of one big media prompt

ur-grue — Tue, 16 Jun 2026 18:03:52 +0000

The obvious way to build an AI media assistant is one big system prompt: "You are an expert media producer skilled in journalism, video, podcasting, PR..." Load it with everything. Let the model figure out what the user needs.

I went the other way. The library is 394 separate skills, each one narrow — a single task, a single format. A skill for ledes. A skill for show notes. A skill for FOIA letters. A skill that does nothing but strip AI tells out of a draft. Here's why narrow won.

One big prompt averages everything toward mush

A system prompt that has to be good at everything is good at nothing in particular. Asked for a lede, it gives you a competent-but-generic opening, because its instructions for "writing" are diluted across forty other jobs. The model can't carry the specific conventions of forty formats at full fidelity at once. You get the average. The average is mush.

A narrow skill carries the full weight of one format: what a lede actually is, the three styles worth trying, the trap of burying the news, the conventions of the outlet type. Nothing competes for that attention. The output is sharper because the instruction is sharper.

Narrow skills are testable

This is the part that matters and the part the "one big prompt" approach can't do. Because each skill has one job, you can write assertions for it: did it produce a one-sentence lede? Did it refuse to invent a quote with no source? You can run that across dozens of inputs and get a real pass rate.

You cannot meaningfully eval "be a great media assistant." There's no assertion for it. By splitting the work into narrow skills, every one becomes measurable — and every one in this library is scored against a seven-dimension rubric (with a hard floor on whether the output reads human) before it ships as stable.

Narrow skills compose

The objection writes itself: forty skills sounds like more work than one prompt. In practice the opposite. You don't load forty — you reach for the one the task needs. And they chain: story-angle-finder → reportage-structure → lede-writer → fact-check → libel-check is a real pipeline, each step's output feeding the next, each step independently good.

The trade-off, honestly

Narrow means discovery matters — you have to find the right skill. That's a real cost, and it's why the library has role-based guides and a one-command plugin install that lets the model auto-select. The bet is that "sharp and findable" beats "broad and average" for people who care whether the output is shippable. For media work, where the audience detects generic instantly, it's not close.

The whole thing is free and MIT-licensed. Take it apart, fork the structure, tell me where narrow was the wrong call.

→ github.com/ur-grue/autopunk-media-skills

The first 15 seconds decide your video. Here's how to make Claude write them.

ur-grue — Tue, 09 Jun 2026 15:53:51 +0000

Retention graphs don't lie. Most videos lose a third of their audience before the 30-second mark. Whatever you spent on the edit, the thumbnail, the script — none of it matters if the open doesn't hold. The hook is the highest-leverage fifteen seconds you'll write all week.

So why do most AI-written hooks sound like a corporate explainer? "In this video, we'll dive deep into the fascinating world of..." That's not a hook. That's a throat-clear. The viewer is already gone.

I build a free, MIT-licensed library of Claude skills for media producers, and the YouTube hook generator is one people come back to. Here's the difference between a hook and a throat-clear — and how the skill gets it right.

A hook does one of three jobs

It doesn't "introduce the topic." It creates a reason to stay:

The open loop — pose a question the viewer now needs answered. "I moved abroad to learn the language. Two years later I still couldn't order coffee. Here's why."
The contrarian claim — say the thing that contradicts what they assume. "Immersion is the most overrated advice in language learning."
The stakes — make the cost of clicking away concrete. "If you're doing this one thing in your first month, you're wasting the whole year."

What the skill does

You give it your topic, your audience, and your tone. It gives back three hook options — not one — each built on a different mechanism above, each with a one-line note on why it works. You pick the one that fits your voice, and you learn the pattern for next time instead of outsourcing it forever.

It also refuses the tells: no "dive deep," no "in this video," no manufactured "you won't believe." Those are the phrases that pattern-match to AI, and an audience that watches a lot of YouTube clocks them instantly.

Why three, with reasons

Because the best hook for a video isn't knowable in advance — it depends on your delivery and your channel. Giving you one "optimised" hook is a worse product than giving you three real options and the logic to choose. The skill treats you as the creative lead, not the prompt-typist.

The honest limit

A hook can't save a video the audience doesn't want. The skill makes a strong topic open strongly — it can't make a weak topic interesting. And it works best when you give it a real angle, not just "make a video about productivity." Garbage in, polite refusal out.

Every skill in the library is quality-tested against a rubric whose hard floor is "does this read like a human in the medium." The hook generator lives or dies by that bar.

→ github.com/ur-grue/autopunk-media-skills

FOIA letters are a format, not a vibe — so I made Claude write them properly

ur-grue — Fri, 29 May 2026 11:00:22 +0000

Ask a general-purpose chatbot for a public-records request and you get something that looks like a letter and works like a liability. It misses the statute. It asks for "any and all documents" — the phrasing agencies love to reject. It promises a fee waiver you didn't qualify for. It sounds confident and gets you nothing.

A records request is a format with rules. Cite the right law. Describe the records narrowly enough to be findable and broadly enough to catch what you want. State your fee-waiver basis correctly. Set the response clock. Miss any of those and you've burned weeks waiting for a "no."

I maintain a free, MIT-licensed library of Claude skills for media work, and the FOIA/records skills are some of the most-used. Here's what "properly" means.

What the skill actually does

You give it: the records you want, the agency that holds them, and your jurisdiction. It gives back a filing-ready letter that:

Cites the correct statute — US federal FOIA (5 U.S.C. § 552), a named state public-records act, or the EU/UK equivalent — not a generic "freedom of information" gesture.
Scopes the request so it's specific (date ranges, record types, named offices) instead of the "any and all" phrasing that invites a denial.
States the fee-waiver basis in the language the statute uses — public interest, news-media requester status — only when it actually applies.
Sets the response deadline the law requires, so you know when silence becomes appealable.
Leaves a clean appeal trail — structured so that if you're denied, your next letter writes itself.

Why this beats "write me a FOIA request"

The difference isn't eloquence. It's that the skill encodes the procedural knowledge a beat reporter accumulates over years: which exemptions agencies hide behind, how to pre-empt the "too broad" rejection, when a fee waiver is real. A blank-slate prompt doesn't carry any of that — so it produces a letter that reads well and fails procedurally.

The honest limits

It is not legal advice, and it does not know your specific agency's quirks. Statutes change; local rules vary. The skill gives you a strong, correctly-structured draft — you still confirm the citation for your jurisdiction and adjust for the body you're filing against. It will also tell you when your request is too vague to file, instead of inventing specifics.

The whole library is free. If you file records requests, start with the FOIA skill and the document-analysis skill that pairs with it — feed in what comes back, get a structured read of what's in it and what's missing.

→ github.com/ur-grue/autopunk-media-skills

How do you eval LLM output that isn't code?

ur-grue — Fri, 29 May 2026 10:44:27 +0000

Code has a luxury: it either runs or it doesn't. You write an assertion, you run it, you get a green check or a red one. Most LLM eval frameworks lean on exactly this — assert output contains X, assert valid JSON, assert no error.

Editorial output has no such luxury. A pitch treatment, a show-notes draft, a lede — there's no test that returns true for "a working producer would send this." So how do you evaluate it without a human reading every output, every time?

I had to answer this concretely, because I maintain a library of ~400 Claude skills for media work, and "trust me, it's good" is not a quality bar. Here's the approach.

Two stages: binary first, judgment second

Stage one is the cheap filter — binary assertions. Even for prose, a lot of failure is mechanical and testable: did it produce the required sections? Is the lede one sentence? Did it refuse to fabricate a quote when given no source? These catch the obvious breaks fast, run blind across many inputs, and cost nothing. The library runs thousands of these. The interesting result: the few "failures" were skills correctly refusing to invent content on deliberately thin inputs — the desired behaviour, not a bug.

Stage two is the part that matters for prose — graded judgment. A model scores each output 1–5 across seven dimensions: coherence, relevance, accuracy, completeness, usefulness, format-fit, and one more that does the real work.

The dimension that does the work: Editorial Naturalness

Six of the seven dimensions are standard. The seventh is a hard floor: does this read like a person who knows the medium, or like a model?

This is scored against observable tells, not vibes:

Lexical — the AI vocabulary (delve, leverage, robust, seamless, tapestry).
Structural — the false pivot ("not just X, but Y"), throat-clearing openers, rule-of-three on every line.
Tonal — manufactured enthusiasm, hedging stacks, the apology spiral.
Genre — does it honour the conventions of the format it claims to serve?

A skill can score 5/5 on the other six and still fail. If Editorial Naturalness is below the floor, it doesn't ship. That single constraint is what stops the library drifting into competent-sounding slop.

Why a hard floor and not an average

Averages hide the thing you care about. A draft that's accurate, complete, well-structured — and unmistakably machine-written — would pass an averaged score comfortably. For media work that draft is useless: the audience clocks it in a sentence. The floor forces the failure to surface instead of being averaged away.

The honest limitation

A model grading prose is generous — it tends to like fluent text, including fluent AI text. So the scores are treated as a filter, not a verdict: they catch the clear failures and rank candidates, but the bar for "stable" is deliberately set high (≥ 4.0 with the naturalness floor), and the rubric is anchored on observable tells rather than taste, so two runs roughly agree. It's not perfect. It's a lot better than shipping on feel.

The whole framework — dimensions, thresholds, the banned-phrase list — is open source. If you're evaluating non-code LLM output, take it apart and tell me where it's too soft.

→ github.com/ur-grue/autopunk-media-skills

How to tell if AI wrote it — and how to make it stop

ur-grue — Fri, 29 May 2026 10:10:25 +0000

You can spot it in a sentence. "In today's fast-paced landscape, we're thrilled to delve into game-changing solutions that empower teams to unlock their potential."

Grammatically fine. Recognisably synthetic. And for anyone whose readers notice — journalists, producers, newsletter writers — that register is a liability.

I build a free, MIT-licensed library of Claude skills for media producers. The whole project rests on one idea: AI output for media has to read like a person who knows the format, not like a chatbot. So I had to make "sounds like a human" something you can actually measure, not just feel.

Here is how.

Name the tells

Most AI prose fails in four observable ways:

Lexical — a vocabulary of words real writers rarely reach for: delve, leverage, robust, seamless, cutting-edge, tapestry, realm, navigate, unlock, empower.
Structural — the false pivot ("not just X, but Y"), the throat-clearing opener ("In this article, we will"), the rule-of-three rhythm on every sentence.
Tonal — manufactured enthusiasm, hedging stacks ("may potentially"), the customer-service apology spiral.
Genre — copy that ignores the conventions of the medium it claims to serve. A show-notes draft that reads like a press release. A lede that buries the news.

Once you can name the tell, you can cut it.

Make it a quality gate, not a vibe

In the skill library, every skill is scored on a seven-dimension rubric before it ships. Six dimensions are the usual suspects — coherence, relevance, accuracy, and so on. The seventh is a hard floor: Editorial Naturalness. A skill can score well everywhere else and still fail if its output reads as machine-made. Below the floor, it does not ship.

That one constraint changes how the skills are written. They stop optimising for "complete and correct" and start optimising for "a working professional would send this without rewriting it."

The detox pass

There is also a skill whose only job is to strip the tells: feed it AI-flavoured copy, get back a publishable rewrite plus a before/after table so you learn the pattern. It carries the canonical banned-phrase list — the same list a runtime check uses to flag drafts before they go out.

Worked example. Before:

In today's fast-paced media landscape, our cutting-edge AI-powered solution seamlessly empowers newsrooms to unlock unprecedented efficiencies.

After:

Three regional dailies now use it for pre-publication checks. None of them call the change transformative. All three call it removing a class of small, repetitive edits from the subbing pass.

Shorter. Concrete. Verifiable. No tells.

Why this matters for media work

Readers in media are trained to detect manufactured language — it is the job. Copy that pattern-matches to "AI wrote this" costs you credibility before the argument even lands. The fix is not "write a better prompt." It is treating naturalness as a measurable bar and refusing to ship below it.

The library is free and open source. Browse it, copy a skill into Claude, and watch what changes when the AI tells are gone.

→ github.com/ur-grue/autopunk-media-skills

DEV Community: ur-grue

What 'quality-tested' actually means for a library of 394 AI skills

A skill ships stable only if it clears two bars

Two stages, because prose isn't code

Where it's soft — said plainly

Why bother for free skills

Show notes in under a minute — without the AI tell

What it does

Why it doesn't sound like AI

The honest limit

Why I built 394 narrow Claude skills instead of one big media prompt

One big prompt averages everything toward mush

Narrow skills are testable

Narrow skills compose

The trade-off, honestly

The first 15 seconds decide your video. Here's how to make Claude write them.

A hook does one of three jobs

What the skill does

Why three, with reasons

The honest limit

FOIA letters are a format, not a vibe — so I made Claude write them properly

What the skill actually does

Why this beats "write me a FOIA request"

The honest limits

How do you eval LLM output that isn't code?

Two stages: binary first, judgment second

The dimension that does the work: Editorial Naturalness

Why a hard floor and not an average

The honest limitation

How to tell if AI wrote it — and how to make it stop

Name the tells

Make it a quality gate, not a vibe

The detox pass

Why this matters for media work

A skill ships `stable` only if it clears two bars