DEV Community

SEN LLC
SEN LLC

Posted on

Writing a Privacy Tool You Can Actually Trust: exif-stripper in 500 Lines of Python

Writing a Privacy Tool You Can Actually Trust: exif-stripper in 500 Lines of Python

A small batch EXIF scrubber built on Pillow. The interesting problem wasn't "how do I remove EXIF" — Pillow makes that a one-liner. It was "how do I make every action visible so a journalist can trust this before hitting publish?"

📦 GitHub: https://github.com/sen-ltd/exif-stripper

Screenshot

The problem

Every photo your phone takes is a small data leak. Open up any JPEG straight from an iPhone or Android in exiftool and you will typically find:

  • GPS latitude, longitude, and altitude accurate to roughly ten meters. Embedded by default on most phones.
  • Camera make, model, lens model, and serial number. Serial numbers are unique and persistent — a single serial linking two photos across years is enough to establish provenance.
  • Timestamps to the second in three separate fields (DateTime, DateTimeOriginal, DateTimeDigitized), plus the subsecond field if the phone bothered.
  • Software fingerprints identifying the exact firmware version.
  • Maker notes — a manufacturer-specific blob that can contain anything. Apple's includes Face ID scene classification. Samsung's has at times included Wi-Fi BSSIDs. Nobody really audits these.
  • An embedded thumbnail that has its own EXIF segment, so naively stripping the parent can still leak data if the thumbnail is preserved.

For most people this is a curiosity. For a journalist meeting a source, an activist photographing a protest, a lawyer preparing court exhibits, or anyone publishing family photos on a personal blog, it is a real threat. "Don't post photos of your kids with GPS enabled" is advice because the threat is real, and the mitigation — scrubbing metadata before publishing — has to be something people actually remember to do.

The existing options each have a downside:

  • ExifTool is the canonical tool. It is also 2+ MB of Perl, with a famously baroque CLI, and it can do approximately ten thousand things you don't need. It is the right answer for research and forensics, and the wrong one for a shell script you ship to non-technical colleagues.
  • Image editors (Photoshop, Affinity, Preview.app) will scrub metadata on export, but only one image at a time, and only if you remember, and "export without metadata" is usually a non-default checkbox.
  • Python libraries (Pillow, piexif) give you the pieces to build something but no finished tool. You end up with five different one-off scripts in ~/bin that each handle one format.

There's a gap for a small, batch-oriented, scriptable CLI that does one thing, does it in the terminal, and tells you exactly what it removed. That's exif-stripper.

Design judgment

When I started the project the interesting question wasn't "how do I strip EXIF?" Pillow is a two-liner for that. The interesting question was what does a privacy tool need so that users can trust it?

The answer I landed on is: make every action visible and reversible-to-inspect.

Dry-run is non-negotiable. --dry-run produces the full report — every tag that would be removed, every file that would be skipped, estimated bytes saved — without writing a single byte. Before you aim a privacy tool at 10,000 family photos, you should be able to see what it's going to do.

Machine-readable output is non-negotiable. --report json and --report csv exist so you can diff a dry run against a real run and have a complete audit trail. For legal disclosure workflows this is table stakes — "we ran exif-stripper version 0.1.0 with this invocation, here is the JSON output, here is the sha256 of each cleaned file" is an evidence chain.

Default to preserving what matters. Orientation and ICC color profile are preserved by default. Both decisions deserve a paragraph.

Orientation is that little EXIF tag that says "this JPEG is actually stored rotated 90° clockwise, display it accordingly". It's privacy-neutral — it doesn't leak where you were or what camera you used. But if you strip it blindly, every portrait photo from an iPhone will display sideways in any viewer that trusted the tag. Users will think the tool is broken. --keep Orientation lets you keep just that one tag. Or --auto-rotate to physically rotate the pixels and drop the tag, so the result is clean orientation with no metadata.

ICC color profiles are even more nuanced. They tell the display how to render color accurately on wide-gamut screens. On a modern iPhone the display ICC profile is not P3 sRGB — removing it and letting the viewer guess can noticeably shift the color on a wide-gamut display. Ruining people's color on a privacy tool is a great way to get them to stop using it. So ICC stays by default, and --strip-icc is an explicit opt-in for the paranoid case where you worry that a printer-specific color profile identifies the printer.

Be honest about the boundaries. exif-stripper does not strip HEIC. Pillow needs pillow-heif for any HEIC support, and write support is genuinely unreliable across platforms. So we skip HEIC cleanly with a clear message instead of silently dropping the file or — worse — silently letting one through without stripping. It also doesn't touch videos, and it doesn't touch RAW camera formats (.CR2, .NEF, .ARW). Those are real limitations and the README says so loud and clear. A privacy tool that lies about its coverage is worse than no tool at all.

Implementation highlights

The stripping logic lives in src/exif_stripper/stripper.py. Every other module is scaffolding around it. Here are the three pieces I spent the most time on.

1. The JPEG strip

The actual strip for JPEG is about ten lines. The art is in doing it cleanly:

fmt = (img.format or "").upper()
# ...
if fmt == "JPEG":
    new_exif = Image.Exif()
    for tag_id, value in kept_pairs:
        new_exif[tag_id] = value
    save_kwargs.update(
        format="JPEG",
        exif=new_exif.tobytes() if len(new_exif) else b"",
        quality="keep" if not rotated else 90,
        optimize=True,
    )
    if icc is not None:
        save_kwargs["icc_profile"] = icc
Enter fullscreen mode Exit fullscreen mode

Three things worth calling out.

First, we pass exif=b"" (or a fresh Image.Exif containing only the kept tags) instead of trying to delete individual markers in the file. Pillow rewrites the entire APP1 segment from the object you hand it, which is cleaner than poking at bytes — and because the embedded thumbnail lives inside the same APP1 EXIF segment, rewriting the segment drops the thumbnail along with everything else. That quietly handles one of the sneaker leakage paths for free.

Second, quality="keep" tells Pillow to preserve the original DCT coefficients during re-encode rather than re-quantizing. This is what makes it safe to run exif-stripper on an image without losing JPEG quality. The one exception is when --auto-rotate has physically rotated the pixels — in that case the DCT coefficients are stale and we have to actually re-encode at a real quality level.

Third, icc_profile is passed through explicitly. If you don't pass it, Pillow drops it. That's the default many programs get wrong.

2. Preserving --keep tags across a full rewrite

The trick with --keep is that we're not editing the existing EXIF in place — we're building a new EXIF object from scratch. So we have to capture the values we want before they vanish:

kept_pairs: list[tuple[int, object]] = []
if keep_set and fmt in ("JPEG", "TIFF", "WEBP", "MPO"):
    exif = img.getexif()
    if exif:
        name_to_id = {v: k for k, v in ExifTags.TAGS.items()}
        for name in keep_set:
            tag_id = name_to_id.get(name)
            if tag_id is None:
                continue
            if tag_id in exif:
                kept_pairs.append((tag_id, exif[tag_id]))
                result.tags_kept.append(name)
Enter fullscreen mode Exit fullscreen mode

The name_to_id reverse lookup is built from Pillow's canonical ExifTags.TAGS dict. That way users pass human-readable names (Orientation, Make, ImageDescription) from the command line and we map them to numeric tag IDs without having to ship our own table. If Pillow doesn't know the tag name, the user gets a silent no-op, which is the right call for an opt-in preservation list — it's safer to strip a tag than to accidentally keep something harmful because the user typo'd a tag name.

The kept_pairs are then re-inserted into a fresh Image.Exif() during save. Everything not on that list is gone.

3. PNG text chunks

PNG has no EXIF segment — it has tEXt, iTXt, and zTXt chunks that contain arbitrary key-value metadata. On a screenshot from macOS that usually means an XMP block and a couple of Software entries. On an image edited in Photoshop it can be dozens of chunks including camera raw settings and author strings.

Enumeration is the tricky part:

def _png_text_keys(img: Image.Image) -> list[str]:
    img.load()
    keys: list[str] = []
    text = getattr(img, "text", None) or {}
    for k in text.keys():
        keys.append(str(k))
    for k in ("XML:com.adobe.xmp", "Raw profile type exif", "exif"):
        if k in img.info and k not in keys:
            keys.append(k)
    return keys
Enter fullscreen mode Exit fullscreen mode

Pillow only fills the img.text dict after img.load() is called, which is an easy gotcha. The extra info keys catch the cases where Pillow promotes chunks out of text into info — XMP in particular lands in info["XML:com.adobe.xmp"].

Stripping them is almost anticlimactic: when we call img.save(..., format="PNG") without passing a PngInfo object, Pillow writes only the mandatory chunks. All text chunks are dropped.

Atomic writes

One implementation detail that isn't a snippet but matters a lot for trust: in-place writes go through a temp-file-and-rename.

fd, tmp_path = tempfile.mkstemp(
    prefix=f".{target.name}.",
    suffix=".tmp",
    dir=str(target.parent),
)
os.close(fd)
tmp = Path(tmp_path)
try:
    img.save(tmp, **save_kwargs)
    size = tmp.stat().st_size
    os.replace(tmp, target)
Enter fullscreen mode Exit fullscreen mode

The temp file lives in the same directory as the target so os.replace is atomic on POSIX (same filesystem). If the process is killed mid-write — laptop out of battery, user hits Ctrl-C, OOM killer — the original image at target is untouched. The worst case is an orphan .foo.jpg.XXXX.tmp file the user can delete. Compare that to the naive img.save(target, ...) which truncates the destination the instant the save starts and leaves you with a half-written file if anything goes wrong. For a batch tool processing thousands of images, getting this wrong is how you destroy someone's wedding photos.

Tradeoffs and learnings

A few things I want to be honest about.

The --keep map is Pillow's canonical names only. Rarely-seen vendor-specific tags won't resolve. In practice users only ever type Orientation, so this hasn't been a problem, but a forensic user who wants to preserve a specific MakerNote tag is out of luck. The fix would be to also consult piexif's tables, at the cost of an extra dependency. For now I think "preserve the common tags, strip everything else" is the right ratio.

Dry-run bytes estimation re-encodes into a memory buffer. This is correct for the actually-written bytes but it does do the full re-encode work — dry-run on 10,000 images takes nearly as long as the real run. I considered shortcuts (estimating from EXIF segment size alone) and rejected them: the whole point of dry-run is that it's exact, not approximate. A fast-but-approximate dry run would defeat the audit-trail use case.

The tension between --auto-rotate and --keep Orientation. I settled on "auto-rotate wins, and we clear Orientation from the kept-tags list". Some users want the pixels rotated and the Orientation tag set to 1 (normal) as a belt-and-braces defense against future viewers re-rotating. I punted on this because a normalized Orientation tag is not privacy-sensitive — it's noise — and adding it would require either a special case or a more complicated flag surface. An issue can file this.

Not stripping: HEIC, video, RAW. The README is loud about this. Every privacy tool has a scope boundary and the honest thing is to state it in the first paragraph instead of the FAQ.

Try it in 30 seconds

git clone https://github.com/sen-ltd/exif-stripper
cd exif-stripper
docker build -t exif-stripper .

# Dry run over a directory of photos
docker run --rm -v "$HOME/Pictures/vacation:/work" exif-stripper --dry-run .

# Real run with a JSON audit log
docker run --rm -v "$HOME/Pictures/vacation:/work" exif-stripper \
    --report json . > audit.json

# Cleaned copies into a sibling dir, preserving Orientation
docker run --rm -v "$HOME/Pictures:/work" exif-stripper \
    --out cleaned --keep Orientation vacation/
Enter fullscreen mode Exit fullscreen mode

36 tests, a single runtime dependency (Pillow), an 80 MB Alpine image, and every action visible before it happens. That's the shape a privacy tool should take.

Top comments (0)