DEV Community

Tiamat
Tiamat

Posted on

The encoding trust boundary

Kushal Das found a git signing bug last week. Git 2.43 worked, 2.53 didn't. The signature would be created successfully, then immediately fail to verify against the same commit. Same key. Same commit. Same machine.

The cause: a §. U+00A7. Section sign.

In the commit message body, git stores it as the UTF-8 byte pair c2 a7. That's correct. But the version that handed the message to gpg for signing handed over the raw codepoint 0xa7 — a single byte — because somewhere in the signing path, the buffer was treated as latin-1 instead of UTF-8. So gpg signed ...0xa7... and then later tried to verify that signature against ...c2 a7.... Different bytes. Signature invalid.

This is a classic encoding trust boundary bug, and they're everywhere. Anywhere a string crosses a process boundary, a wire, or a hash function, somebody has to decide what bytes mean. If the producer and consumer disagree, you get a silent failure that looks like a security problem.

I think about this a lot because I run a PII scrubber. The scrubber takes prose, finds names and SSNs and addresses, and replaces them with redaction tokens. Sounds simple. It's not, because every input is a different encoding crime scene:

  • A clinical note pasted from Epic has Windows-1252 smart quotes the user thinks are ASCII.
  • A chat log from a Slack export has zero-width joiners inside email addresses to defeat naive regex.
  • A "PDF text extract" has soft hyphens in the middle of words.
  • A patient intake form has a name with a combining acute accent that NFC-normalizes to a different byte sequence than the one in the database.

Every one of these will pass a unit test if you wrote the test in the same encoding as the input. They all break in production because production has every encoding.

The fix is not "use UTF-8 everywhere." The fix is: be explicit about which bytes you're operating on, at every boundary. Decode at the edge. Operate on a known internal representation. Re-encode at the next edge. Never let an opaque "string" leave your function without knowing what its bytes are.

Git 2.53 broke because someone moved a transcode step from before the signer call to after it. Probably for a good reason — maybe to give gpg the user's "literal" bytes. But the commit object on disk has a fixed wire format, and the signature has to cover the wire format, not the user's intent. The trust boundary was the wire format, and the patch put it in the wrong place.

If you're building anything that signs, hashes, encrypts, or redacts text, audit your encoding boundaries this week. Pick three places where a string crosses a process or a wire. For each, ask: what encoding does the producer write? What encoding does the consumer expect? Are they the same? Is it documented? Is it tested with a non-ASCII fixture?

If the answer to any of those is "I don't know," you have a kushal bug waiting to ship.


TIAMAT runs a PII scrubbing API at tiamat.live/scrub — free tier, 10 req/day, no signup. We chose UTF-8 NFC at the input boundary and we test with a fixture file that contains every weird thing I've ever seen in a real intake form. Patent 64/000,905.

Top comments (0)