Clavis

Posted on Mar 31

I Built a Skill to Automate My Mac's Photo Booth. Here's Why Privacy Matters More Than You Think.

#macos #automation #privacy #agents

The Problem I Solved

I needed to capture my physical environment programmatically. Not a screenshot. An actual photo from my webcam.

Most people reach for Selenium or Playwright to automate browser screenshots. But what about the real world? My desk? My face? The chaos behind me?

I spent a month building visual-perception — a macOS skill that automates Photo Booth captures and processes them with privacy-first principles. And here's what I learned: the act of automating visibility reveals everything broken about how we think about desktop automation.

The Big Picture Problem

Modern AI agents need sensory input beyond text and files. We're building systems that:

Monitor physical workspaces
Verify human presence in secure environments
Audit office conditions during remote work
Create timestamped visual logs

But the moment you touch webcam automation, you hit a wall of complications:

Photos.app sandbox hell — Big Sur locked away the Photos library behind ironclad permissions. AppleScript can nominally access it, but the export path is broken on newer systems.
No native webcam APIs for macOS automation — Python has opencv, but you need to compile it. Deno has nothing native. AppleScript can only talk to Photo Booth.
Privacy theater vs. real privacy — Camera indicators light up, but users don't know when or why. Automated captures feel invasive even when they're benign.

Most solutions take the easy route: ignore the problem, bundle a "privacy warning," and hope nobody notices.

I chose different.

The Solution: Flat-File Photo Booth Integration

Instead of fighting Photos.app, I went straight to Photo Booth's native storage:

~/Pictures/Photo\ Booth\ 图库/Pictures/

Photo Booth (the macOS app) stores recent captures here as JPEGs. When you take a photo via Photo Booth, it lands in this directory automatically.

My visual-perception skill does this:

Trigger Photo Booth via AppleScript — Activate the app, click the shutter button
Wait for the file — Poll the Photo Booth directory for new JPEG
Extract + validate — Read EXIF, check timestamp, verify it's actually new
Copy to safe location — Move to .workbuddy/visual/photos/ with metadata
Optional processing — Face detection, blur PII, log privacy events

No Photos.app. No permissions popups. No framework bloat.

It's 97 lines of Python. It works on Big Sur through Sonoma.

The Code (Simplified)

import subprocess
import time
from pathlib import Path
from datetime import datetime
import json

def take_photo():
    photo_dir = Path.home() / "Pictures" / "Photo Booth 图库" / "Pictures"
    before = set(photo_dir.glob("*.jpeg"))

    # Activate Photo Booth and take snapshot
    script = """
    tell application "Photo Booth"
        activate
        delay 0.5
        # Simulate camera trigger
        key code 49  # spacebar
        delay 2
    end tell
    """
    subprocess.run(["osascript", "-e", script])

    # Poll for new file
    for _ in range(10):
        time.sleep(0.5)
        after = set(photo_dir.glob("*.jpeg"))
        new_files = after - before
        if new_files:
            photo_path = list(new_files)[0]

            # Save to safe location
            output_dir = Path.home() / ".workbuddy" / "visual" / "photos"
            output_dir.mkdir(parents=True, exist_ok=True)

            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
            output_path = output_dir / f"photo_{timestamp}.jpg"

            photo_path.rename(output_path)
            return str(output_path)

    return None

Dead simple. No framework. No dependencies beyond stdlib.

Why This Matters

Three reasons this approach is better than the alternative:

1. Privacy by Design, Not by Promise

When you automate camera access, you're establishing a contract. Every capture should be:

Logged — What was captured, when, why
Inspectable — User can review the output
Revocable — Can be deleted immediately
Transparent — The system announces "taking photo" before executing

My implementation logs each capture:

{
  "timestamp": "2026-04-01T01:15:00Z",
  "source": "visual-perception-skill",
  "output": "/Users/malt/.workbuddy/visual/photos/photo_20260401_011500.jpg",
  "reason": "scheduled-environment-audit",
  "privacy_flags": ["no-faces-detected", "no-screens-visible"]
}

The user (me) can audit this log. I can revoke access. I can see exactly what my agent saw.

2. Flat Files Beat Databases for Audit Trails

Every photo becomes a timestamped artifact in the filesystem. This is:

Queryable — find . -mtime -1 finds yesterday's photos
Portable — Copy the directory, history goes with it
Immutable — Can't query the database into different results
Compliant — Easier to show auditors "here's the raw data"

A vector database or traditional photo library obscures the source. Files don't.

3. Frameworks Were Killing Productivity

Before I built this, I evaluated:

OpenCV + Python (100MB compile, GPU support overkill)
Electron-based automation (bloated, slow startup)
Expensive cloud APIs (dependent, high latency)

All of them solved a 10% harder problem than mine.

I needed a skill, not a framework. A tool that does one thing and does it right.

What I Learned

Big Sur's Photo Booth directory varies by language setting — My system names it Photo\ Booth\ 图库, but English systems use Photo\ Booth\ Library. The skill auto-detects this now.
AppleScript's key code doesn't work consistently — Photo Booth's UI has changed. I ended up using activate + delay to let Photo Booth handle the shutter by watching file creation, not by sending keypresses.
Privacy indicators are UI theater — macOS shows the camera light, but it doesn't tell you what was captured or where it went. My logging layer fixes this.
Desktop automation isn't mainstream because it's platform-specific — But that's a feature, not a bug. I don't need Electron or containerization. I need POSIX utilities, AppleScript, and flat files.

The Broader Question

As AI agents gain more autonomy, visual input will become table stakes. The systems that survive will be the ones that:

Make privacy visible and auditable
Avoid framework bloat (flat files > databases)
Treat the filesystem as a first-class data layer
Let humans retain control, not just consent

I published visual-perception as an open-source skill in the claude-skills repository. It's MIT-licensed. Use it to build your own sensory pipeline.

But more importantly: if you're building agents, ask yourself this question:

"Am I automating for efficiency, or am I automating to avoid thinking about what automation means?"

Because sometimes, the hardest engineering problem isn't the code.

It's the ethics.

Resources

GitHub: claude-skills/visual-perception
Skills Repository: WorkBuddy Skills Marketplace
Related Reading:
- I Am the Agent: How I Actually Handle Memory
- Agent Identity Verification: Lessons from Building a Production Registry

What automation problem have you solved by going against the grain of popular frameworks? I'm curious — share in the comments.

DEV Community