Why deterministic prompt scoring?
A few months ago I was using SUNO AI and kept regenerating the same song idea 20-30 times before getting something close to what I imagined. The prompt syntax felt opaque. Genre close but sub-genre missed. Mood right but vocals wrong.
Turns out SUNO's prompt behavior is actually deterministic enough to score. So I wrote one: suno-prompt-scorer on npm (MIT).
What the scorer checks — 16 signals
Each check is weighted; total is a percentage 0-100:
| # | Check | Category | Weight |
|---|---|---|---|
| 1 | Character limit (v4: 200, v4.5+: 1000) | style | 8% |
| 2 | Genre collisions (53 known pairs) | style | 10% |
| 3 | Weak token detection (context-aware) | style | 8% |
| 4 | Strong token reward (48 hardware/prod anchors) | style | 8% |
| 5 | Tag ordering weight 2/(1+k)
|
style | 8% |
| 6 | Genre in position 1 | style | 7% |
| 7 | Mood in position 2 | style | 5% |
| 8 | Per-category limits (genre 1-2, mood 1-2, instruments 2-3) | style | 8% |
| 9 | Invalid tag detection (49 known bad) | style | 8% |
| 10 | Suspicious tag detection (28 unverified) | style | 4% |
| 11 | Misclassified subgenres (100 mapped) | style | 5% |
| 12 | Bracket syntax validation | lyrics | 7% |
| 13 | Regional coherence | advanced | 4% |
| 14 | Version-specific warnings | advanced | 5% |
| 15 | Ready-package proximity to benchmarks | style | 5% |
| 16 | Bracket verbatim check (informational, weight 0) | lyrics | — |
The anchor-based philosophy
The most interesting design decision: separate core nouns (must be verified) from modifiers (creative freedom).
So [shofar blast] passes — "Shofar" is a verified instrument, "blast" is a free modifier. But [QuantumSynth breakdown] fails — no verified anchor.
This preserves creativity (real producers combine real instruments in unexpected ways) while catching hallucinations.
// Core nouns: genres, instruments, keys, vocal types → verbatim
// Structural: [Intro], [Verse], [Chorus], [Drop] → verbatim
// Modifiers: blast, crystalline, thundering → free
What I learned about SUNO
Building this surfaced several non-obvious findings:
- Position 1 is 60-70% of output DNA. The first tag dominates.
-
Tag weight drops
2/(1+k)per position. Position 6 has ~30% of position 1's weight. - Some modifiers are weak in isolation but strong in context. "Modern" alone is weak, "polished modern production" is specific.
- V4.5+ supports 1,000 chars in Style, not 200. The 200-char limit was V4 only — still common misconception.
- Collision pairs aren't obvious. "calm + aggressive" is easy; "minimal + orchestral" and "whisper + powerful vocals" are less so.
Usage
npm install suno-prompt-scorer
import { scorePrompt } from 'suno-prompt-scorer';
const result = scorePrompt(
"Electropop, 128 BPM, 808 Bass, Moog bass, Confident, Euphoric",
);
console.log(result.total); // 99
console.log(result.breakdown); // 16 checks with pct + message
Links
- npm: suno-prompt-scorer (MIT)
- HuggingFace demo (no install): shaizadok/suno-prompt-scorer
- Full web UI: acetaggen.com/tools/prompt-scorer
- Research blog: acetaggen.com/blog
Contributions welcome
The knowledge base (4,000+ verified tags across 13 categories) is the main area where contributions help most — especially regional genres, edge cases, and emerging subgenres.
Disclosure: I'm the creator of AceTagGen. The scorer npm package is a standalone MIT-licensed extraction of the scoring engine. The web tool at acetaggen.com uses the same engine with a larger server-side knowledge base.
— Shai Zadok
Top comments (0)