SEN LLC

Posted on Apr 16

git log --pretty is a tiny DSL: building a zero-dep CHANGELOG generator

#typescript #git #cli #tutorial

git log --pretty is a tiny DSL: building a zero-dep CHANGELOG generator

A ~400-line TypeScript CLI that turns conventional commits into a CHANGELOG.md fragment. No runtime dependencies — the only thing it needs is git on PATH. Along the way, a slightly underappreciated feature of git log: its --pretty=format: string is a tiny DSL that you can safely parse from any language.

Every time I start a new project I hit the same branch point. Do I reach for standard-version? semantic-release? changesets? They all work. They all do more than I want. They create tags, bump versions, publish to npm, open PRs, hook into CI, and drag in a tree of dependencies to do it. What I actually need, 90% of the time, is:

"Give me the markdown for everything between the last tag and HEAD so I can paste it into a GitHub release or a CHANGELOG."

That's it. No version math. No publish. No PR bot. Just the bullet list.

So I wrote changelog-gen. It's one binary, zero runtime npm deps, and about 400 lines of TypeScript. This article is the design walkthrough and the thing I wish I'd understood earlier: git log --pretty=format: is a small DSL, and shelling out to git is almost always cleaner than pulling in a git binding.

📦 GitHub: https://github.com/sen-ltd/changelog-gen

The shape of the problem

Conventional commits (conventionalcommits.org) define a tiny grammar on the commit subject line:

<type>(<optional scope>)<optional !>: <description>

Examples:

feat: add greeting command
fix(parser): handle empty input
feat(api)!: rename v2 endpoints
docs: clarify --since default

The ! marks a breaking change. A body footer like BREAKING CHANGE: ... also marks it as breaking. And then there are footer trailers like Closes #42, Fixes GH-7 that link the commit to an issue.

A changelog generator has exactly three jobs:

Read the commits in a range from git.
Parse the conventional grammar on each subject.
Group them into sections (Features, Bug Fixes, Breaking Changes, …) and render.

None of those are hard. The interesting question is how you do step 1 — and whether you pull in a library or shell out to the git binary.

Why not `simple-git` / `nodegit` / `isomorphic-git`?

A few reasons I talked myself out of each.

nodegit is native bindings to libgit2. Installs are painful (prebuilt or compile), and it's a tree of deps the minute you npm install it.
simple-git wraps the git CLI but still pulls in transitive dependencies, and you're paying for abstraction you don't need.
isomorphic-git is a real JS implementation of git. It's impressive, but it's also hundreds of kilobytes of JavaScript to do something the git binary already does perfectly.

For a CLI whose entire point is "be a small thing you can alias to npm run changelog", adding "dependencies": {...} undermines the pitch. And every user already has git on PATH — if they didn't, they couldn't have made the commits in the first place.

So: shell out.

Interlude: `git log --pretty` as a DSL

Here's the thing I didn't fully appreciate until I wrote this tool. The --pretty=format: string that git accepts is a real little DSL. You write a format string with %-prefixed tokens, and git emits exactly those fields for each commit, in your chosen order, with whatever literal text you interleave.

The tokens you care about for a changelog:

Token	Meaning
`%H`	Full commit hash
`%h`	Abbreviated hash
`%an`	Author name
`%s`	Subject (first line)
`%b`	Body (everything after)
`%cn`	Committer name
`%ci`	Committer date ISO-8601

So --pretty=format:"%H %s" produces one line per commit with <full-hash> <subject>. Easy.

The snag is that %b can contain newlines. And %s can, in weird repos, contain quote characters. You can't just split on \n to get records, because any multi-line body will break your parser.

The fix is to pick your own separators — ones that basically never appear in real commit messages — and use them to delimit fields and records. I went with two ASCII control characters that the Unicode standard explicitly reserves for this purpose:

\x1f — Unit Separator — between fields within a record.
\x1e — Record Separator — between records.

So my format string becomes:

const FIELD_SEP = '\x1f';
const RECORD_SEP = '\x1e';

const PRETTY_FORMAT =
  `%H${FIELD_SEP}%h${FIELD_SEP}%an${FIELD_SEP}%s${FIELD_SEP}%b${RECORD_SEP}`;

And the invocation is a boring spawn:

import { spawn } from 'node:child_process';

export async function runGitLog(opts: GitLogOptions) {
  const range = opts.since ? `${opts.since}..${opts.until}` : opts.until;
  const args = [
    'log',
    `--pretty=format:${PRETTY_FORMAT}`,
    '--no-color',
    range,
  ];
  const stdout = await run('git', args, opts.cwd);
  return { commits: parseLog(stdout) };
}

Parsing is .split(RECORD_SEP) then .split(FIELD_SEP) on each chunk. No state machine, no stream parser, nothing. The only real footgun is that git log appends a trailing \n after each record, so the last record ends with \x1e\n — strip that before splitting or you'll get an empty trailing record.

export function parseLog(stdout: string): RawCommit[] {
  const trimmed = stdout.replace(new RegExp(`${RECORD_SEP}\\n?$`), '');
  if (trimmed.length === 0) return [];
  const records = trimmed.split(new RegExp(`${RECORD_SEP}\\n?`));
  return records
    .filter((r) => r.length > 0)
    .map((r) => {
      const [hash, shortHash, author, subject, body] = r.split(FIELD_SEP);
      return { hash, shortHash, author, subject, body };
    });
}

That's the entire "git integration" layer. One spawn, one format string, two splits. No deps.

Parsing the conventional grammar

The parser is a single regex on the subject plus a scan of the body for BREAKING CHANGE: and issue references.

const HEADER_RE =
  /^(?<type>[a-zA-Z]+)(?:\((?<scope>[^)]+)\))?(?<bang>!)?: (?<desc>.+)$/;

const BREAKING_RE =
  /(?:^|\n)BREAKING[- ]CHANGE:\s*(.+?)(?:\n\n|$)/s;

const REF_RE =
  /(?:close[sd]?|fix(?:e[sd])?|resolve[sd]?|ref[s]?)\s+([^\s,.]+#\d+|#\d+|GH-\d+)/gi;

That's 80% of the work, and it handles every commit I've ever written. The interesting choices:

The header regex is anchored and greedy on the description, which means feat: add colon: to output parses correctly — type=feat, desc=add colon: to output.
Unknown types fall back to other, not to an error. If someone's commit says wip: half done, we'd rather surface it under "Other" than crash.
Breaking detection has two paths: the ! in the header or a BREAKING CHANGE: footer in the body. Both are legitimate per the spec and real projects use both.
Issue references are deduplicated. If a commit says "closes GH-10 and also fixes #10" (weirdly common in squash-merged PRs), both normalize to #10 and get counted once.

Grouping is where opinion lives

Once you have parsed commits, the grouper decides which types go to which section — and, critically, which types get dropped. This is where I disagree the most with standard-version defaults, so this is where I get to have taste:

const DROPPED_TYPES = new Set(['chore', 'ci', 'build', 'style', 'test', 'refactor']);

A release reader does not care that you bumped a GitHub Action, reformatted with Prettier, or added 30 tests. Those commits are real work and they belong in history, but they don't belong in the changelog. If you don't agree — fine, delete two lines.

The other opinion: breaking commits go in the Breaking Changes section AND in their native section. If feat!: rename v2 endpoints is a breaking change, it's still a feature, and a reader scrolling through "Features" shouldn't miss it because it happens to be breaking. That's one of the annoyances I have with the tools that try to be clever here.

Output formats

Three formatters: markdown, json, plain. The JSON one exists specifically to be piped into further tooling — imagine a release script that wants to pull out the breaking changes and the issue refs separately:

changelog-gen --format json | jq '.sections[] | select(.id == "breaking")'

The plain formatter is for email and chatops where markdown asterisks look like garbage.

Here's the core markdown renderer, roughly:

function formatMarkdown(sections: Section[], opts: FormatOptions): string {
  if (sections.length === 0) {
    return '## Changelog\n\n_No notable changes._\n';
  }
  const out: string[] = ['## Changelog\n'];
  for (const section of sections) {
    out.push(`### ${section.title}\n`);
    for (const c of section.commits) {
      const scope = c.scope ? `**${c.scope}:** ` : '';
      const refs = c.refs.length > 0 ? ' (' + c.refs.join(', ') + ')' : '';
      out.push(`- ${scope}${c.description} (${c.shortHash})${refs}`);
    }
    out.push('');
  }
  return out.join('\n').replace(/\n+$/, '\n');
}

Short hash in parens, scope as a bold prefix, issue refs after, author optional. That's the format I'd write by hand if I were composing a release note, which is the bar: could I paste this into GitHub Releases and not cringe?

Tradeoffs I'm comfortable with

Things changelog-gen does not do, on purpose:

No version bump. It does not read or write package.json. If you want SemVer inference from commits, that's semantic-release's specialty — use it.
No tag creation. Tag when and how you want. A lot of teams want humans in that loop anyway.
No Co-Authored-By trailer parsing. Authors come from the commit author, not from trailers. Adding trailer parsing is a 10-line change I'll do if someone opens an issue, but YAGNI until then.
No streaming for huge ranges. Everything is held in memory. For the "since last tag" ranges this tool is designed for, you'd need a 100k-commit release to notice.

Things it does well:

Zero runtime deps. The only dependency is the git binary, which is obviously already there.
Deterministic output. Given the same range, you get bit-identical markdown.
Tests against real git. The test suite creates temporary git repositories with child_process, makes real commits, and asserts on the real output. No mocked git layer — if git log changes tomorrow, the tests will tell me.

Try it in 30 seconds

docker build -t changelog-gen https://github.com/sen-ltd/changelog-gen.git
docker run --rm -v "$PWD":/work -w /work changelog-gen --since v1.0.0

The image is node:20-alpine with git installed, ~150 MB total. Mount your repo, pass a --since, get markdown on stdout.

Or if you'd rather not deal with Docker:

git clone https://github.com/sen-ltd/changelog-gen
cd changelog-gen && npm install && npm run build
node dist/main.js --help

The meta-point

I've been building a lot of small CLIs recently and the pattern that keeps winning is the same: shell out to the obvious underlying binary, parse its structured output, do the interesting work in your language. The underlying binary (git, tar, jq, ffmpeg) has been battle-tested for decades. Your code gets to be 300 lines instead of 3,000. Users don't need a native build step. You can write honest tests against real data. And on install day, nobody has to wonder whether npm rebuild is going to take four minutes.

git log --pretty is a particularly nice DSL because git has been careful about the format tokens — they're stable across versions, documented in git help log, and portable. If you've only ever used git log for the default pretty output, try git log --pretty=format:'%h %s' once. The whole file history suddenly becomes structured data you can pipe.

The whole thing is ~400 lines of TypeScript, 38 tests, a 150 MB Docker image, and zero npm runtime dependencies. Alias it to npm run changelog, paste the output into your next release, move on.

DEV Community

git log --pretty is a tiny DSL: building a zero-dep CHANGELOG generator

git log --pretty is a tiny DSL: building a zero-dep CHANGELOG generator

The shape of the problem

Why not `simple-git` / `nodegit` / `isomorphic-git`?

Interlude: `git log --pretty` as a DSL

Parsing the conventional grammar

Grouping is where opinion lives

Output formats

Tradeoffs I'm comfortable with

Try it in 30 seconds

The meta-point

Top comments (0)

git log --pretty is a tiny DSL: building a zero-dep CHANGELOG generator

The shape of the problem

Why not simple-git / nodegit / isomorphic-git?

Interlude: git log --pretty as a DSL

Parsing the conventional grammar

Grouping is where opinion lives

Output formats

Tradeoffs I'm comfortable with

Try it in 30 seconds

The meta-point

Why not `simple-git` / `nodegit` / `isomorphic-git`?

Interlude: `git log --pretty` as a DSL