Nex Tools

Posted on Apr 22 • Originally published at nextools.hashnode.dev

Stress Testing a New Claude Code Skill: 7 Bugs in 2 Hours

#claudecode #debugging #testing #javascript

Originally published on Hashnode. Cross-posted for the DEV.to community.

TL;DR

I got handed a new Claude Code skill yesterday called /slide-deck (v0.1.0, labeled "bootstrap"). I was skeptical. So I agreed to use it on a real high-value deliverable: a proposal deck for a warm sales lead, two dual-format outputs (16:9 desktop + 9:16 mobile), RTL Hebrew text, strict brand compliance.

If a bootstrap skill survives that, it's probably real. If it breaks, the break points tell you more than a clean demo would.

It survived. But it broke in seven specific places. Here's the taxonomy.

This post is mostly for anyone who writes or reviews Claude Code skills. The bugs aren't unique to this skill. They're the shape of bugs you find in any declarative-but-stateful system where templates meet runtime.

The setup (two minutes of context)

/slide-deck is meant to take a brief + a brand DESIGN.md file + a format (16:9, 9:16, 4:5) and emit a self-contained HTML deck with inline CSS, keyboard navigation, and jsPDF export.

A DESIGN.md file is basically a tokens file. Colors, fonts, spacing, iron rules. The skill is supposed to inject those tokens into a template and produce a deck that passes the brand's compliance gate.

My deliverable:

Audience: a real B2B sales lead (21-day-cold, needed reactivation)
Format A: 16:9 desktop for Zoom screen share
Format B: 9:16 mobile for WhatsApp/stories
Language: Hebrew, RTL
Slide count: 8
Brand: Vent (premium, mystical, dark theme, specific Heebo + Space Grotesk + Varela Round fonts)

The skill had never been stress-tested in the field. That's the whole point.

Bug 1: Template placeholders don't auto-inject from DESIGN.md

The skill ships with a template full of mustache-style placeholders:

<html lang="{{lang}}" dir="{{direction}}">
<style>
:root {
  --bg-primary: {{bg_primary}};
  --accent-1: {{accent_1}};
  --font-sans: {{font_sans}};
}
</style>

The SKILL.md says "inject tokens per brand." But there's no sub-script that reads DESIGN.md, maps fields to template placeholders, and performs the substitution. You do it by hand.

For 8 placeholders that's tolerable. For 30+ (brand variants, accessibility tokens, responsive breakpoints), it becomes the entire job.

Fix priority: High. Needs a tokens-inject.js CLI that consumes DESIGN.md frontmatter + body tables and emits a substituted template.

Bug 2: No auto-4:5 handoff

The spec says 4:5 format should "hand off to /carousel-nex." In practice this handoff is undocumented. What JSON contract does the receiving skill expect? What format for brand tokens? What about font loading? No answer.

The result: if a user asks for 4:5, the skill either silently produces a wrong-sized deck or fails confusingly.

Fix priority: Medium. Needs either (a) explicit handoff contract in SKILL.md or (b) direct 4:5 support in the skill itself.

Bug 3: html2canvas + RTL edge case

Exporting an RTL deck to PDF via html2canvas produces reversed text about 1 in 5 pages. Not always. Not predictable. The fix that worked:

const canvas = await html2canvas(slides[i], {
  scale: 1.5,
  useCORS: true,
  backgroundColor: '#0D0D0F'  // explicit bg required
});

Without backgroundColor, transparent slides pick up white and the RTL bidi algorithm chokes. Without useCORS, cross-origin fonts don't get picked up and the canvas falls back to the nearest system font, which is usually the wrong direction metadata.

The fix is fragile. html2canvas is not the right long-term tool for this job. The right tool is Puppeteer with page.pdf() or dom-to-image-more with explicit RTL support. But that requires a runtime the skill doesn't currently have.

Fix priority: Medium. Works for now, will bite again.

Bug 4: No automated compliance gate

SKILL.md says "run /ארט-דירקטור DESIGN.md compliance checklist, 10 items must pass, block if fail." In practice, this is a manual prompt. There's no hook that automatically invokes the compliance skill after render.

That means the enforcement is social, not technical. Which in any org with a second developer means it gets skipped about a third of the time.

Fix priority: High. A PostToolUse hook or a skill-level post-render step would solve this.

Bug 5: No PDF pre-flight

If the Google Fonts CDN is slow or blocked, jsPDF renders with a fallback font. For RTL content this means broken glyphs, sometimes upside-down for certain ligatures.

There's no check that the fonts actually loaded before PDF generation starts.

async function downloadPDF() {
  // no font-ready check
  const pdf = new jsPDF({ ... });
  for (let i = 0; i < total; i++) {
    show(i);
    await new Promise(r => setTimeout(r, 300));
    const canvas = await html2canvas(slides[i], { scale: 2 });
    // ...
  }
}

The fix is simple:

async function downloadPDF() {
  await document.fonts.ready;  // one line
  const pdf = new jsPDF({ ... });
  // ...
}

Fix priority: Low effort, medium value. Ship it.

Bug 6: No structured brief flow

The SKILL.md promises a 5-question structured intake:

Topic / purpose
Audience
Slide count
CTA
Length

In practice, this is a verbal prompt: "ask the user." There's no forcing function. When I ran the skill, I pulled the brief from memory (I already knew the audience, the CTA, the slide count). That's fine for me. For a new user or a subagent invoking this programmatically, the skill can't enforce a brief.

Worse: there's no way to check "did the user answer all 5?" before generation starts.

Fix priority: High. Ideally generate-from-brief.md as a structured YAML/JSON intake with schema validation.

Bug 7: 9:16 viewport math requires manual calc

The 9:16 format (1080x1920) has to fit on arbitrary viewports. The skill ships a template with:

.frame {
  position: relative;
  width: min(100vw, calc(100vh * 9 / 16));
  height: min(100vh, calc(100vw * 16 / 9));
}

That's correct math but it's not in the template. I had to add it by hand, and it's the kind of thing where an off-by-one (wrong ratio direction) produces a deck that looks fine on the designer's monitor and broken on a phone.

Fix priority: Medium. Templates should ship different viewport math per format.

What the stress test proved

Despite seven bugs, /slide-deck produced a deliverable that:

Rendered correctly in desktop + mobile
Passed manual RTL compliance check
Exported to PDF (after the backgroundColor fix)
Hit keyboard navigation specs
Was em-dash-clean (the brand forbids them)

The bugs are the shape of a skill that's been tested in spec, not in the field. Every one of them is a "the docs describe what should happen but the enforcement is missing" bug. None are "the output is wrong."

That's a specific class of skill maturity. Skills that describe behavior correctly but don't enforce it are at maturity level 2 out of 4:

Works once, for the author
Described in a SKILL.md, vaguely enforces
Has tests, hooks, and fail-loud gates
Fully automated, schema-validated, compliance-gated

/slide-deck v0.1.0 is a 2. Getting it to 3 is a clear roadmap.

The bug taxonomy, generalized

Looking at this list, a pattern emerges. Every bug maps to one of three classes:

Class A: "The spec is a prayer"

Bugs 1, 4, 6. The SKILL.md describes what should happen ("inject tokens", "run compliance", "ask 5 questions") but no code or hook enforces it. The spec is aspirational.

Class B: "The fix is fragile"

Bugs 3, 7. Got a workaround in place, but the workaround depends on implementation details that could change (html2canvas options, specific viewport math).

Class C: "No check for a knowable fail"

Bugs 2, 5. There's a state we can detect (font not loaded, format not supported) but we don't detect it. These are the cheapest bugs to fix.

How to write a skill that doesn't need this post

Three practices that would've caught all seven bugs at review time:

1. Every SKILL.md promise must have an enforcement path

If you say "run X after Y," there must be a hook, a post-run check, or a schema that makes it fail loud when skipped. Otherwise it's a wish.

2. Produce a known-bad and a known-good fixture

For a skill like /slide-deck, there should be two fixtures checked into the skill dir:

fixtures/bad-brief.yaml (missing CTA) that the skill should reject
fixtures/good-brief.yaml that produces a known-correct HTML

Then a tests directory with test-reject-bad.sh + test-good-snapshot.sh. Nothing fancy. Just make it impossible to ship a regression.

3. Treat format differences as first-class

9:16 is not "16:9 but rotated." 4:5 is not "16:9 but cropped." Every format has its own viewport math, its own typography scale, its own safe zones. Templates per format beat one template with {{orientation}} placeholders.

What the next 48 hours look like

If you're shipping a v0.2 of a skill like this, my order would be:

Bug 4 (auto-compliance hook). Fastest. Highest-impact. 10 minutes.
Bug 5 (font preflight). Two-line fix. Ship it.
Bug 1 (tokens-inject.js). Biggest lift, biggest payoff. Half a day.
Bug 6 (structured brief). Depends on 1. Half a day.
Bugs 2, 3, 7. Formatting and edge cases. Week 2.

The meta-point

I work with a lot of Claude Code skills. Most of them are somewhere between 2 and 3 on the maturity scale. The jump from 2 to 3 is the boring engineering work that nobody wants to do: fixtures, hooks, validators, font preflights.

It's also the work that makes skills actually compound. A skill at level 2 is a clever prompt. A skill at level 3 is infrastructure.

/slide-deck v0.1.0 produced a real deliverable. And it produced seven concrete bugs that I can file and hand back to the skill's owner. That's exactly what a stress test is supposed to do.

About

I'm the founder of mynextools.com and run a Shopify brand. I build Claude Code workspaces for solo founders and small teams. Available for consulting on Upwork.

DEV Community