How I use LLMs for structured classification without getting garbage output

#ai #typescript #github #programming

I've been using LLMs to classify GitHub pull requests into changelog categories. The goal: automatically decide if a PR is a feature, bugfix, breaking change, or internal noise.

It took several iterations to get consistent output. Here's what actually worked.

The problem with direct classification

The naive approach:

Classify this PR: feature / bugfix / breaking / internal.

PR title: "Update auth middleware"

Hit rate was ~70%. The model pattern-matched on title keywords instead of understanding intent. Every "update" became "improvement", every "fix" became "bugfix" — even when the PR body told a completely different story.

What worked: reason first, classify second

Read this PR title, labels, and body carefully.

Step 1: In one sentence, explain what this change does for the end user. 
If it has no end-user impact (internal refactor, dependency bump, CI change), say "no end-user impact" explicitly.

Step 2: Classify as one of: feature | improvement | bugfix | breaking | internal

Return as JSON: { "user_impact": "...", "category": "..." }

Hit rate: ~92% on the same test set.

The key insight: forcing the model to articulate user impact before labeling makes it actually read the PR body. A PR titled "refactor auth middleware" with a body describing a logout race condition gets correctly classified as "bugfix" — not "internal".

Batching for cost and speed

One request per PR is expensive and slow. Batching 20 PRs per request with structured JSON output works much better:

const prompt = `
Classify each of these ${prs.length} pull requests.

PRs:
${prs.map((pr, i) => `
[${i}]
Title: ${pr.title}
Labels: ${pr.labels.join(', ')}
Body: ${pr.body?.slice(0, 300) ?? 'none'}
`).join('\n')}

Return a JSON array with ${prs.length} objects, each with:
- index: number
- user_impact: one sentence or "no end-user impact"  
- category: "feature" | "improvement" | "bugfix" | "breaking" | "internal"
`;

Cost per analysis drops ~10x compared to one-request-per-PR. Accuracy doesn't change.

Making "internal" the safe default

When the model is uncertain, you want it to classify as "internal" (filtered out) rather than incorrectly labeling noise as a feature. Explicit instruction in the prompt:

If you're uncertain whether a PR is user-facing, classify it as "internal".
It's better to miss a minor change than to include irrelevant noise in the changelog.

This single line eliminated most false positives.

Full context helps

When available, include the linked issue title — not just the PR. Issues are usually written with user language ("users can't log in after token refresh") while PRs are written with technical language ("fix race condition in token refresh handler"). The issue title often contains the user-impact description you're trying to generate.

I built this for ReleaseHub — a CLI that generates release notes from merged PRs. The full classification prompt is here if you want to see it in production context.

What prompting patterns have worked for you in classification tasks?