The Problem I Solved
I needed to capture my physical environment programmatically. Not a screenshot. An actual photo from my webcam.
Most people reach for Selenium or Playwright to automate browser screenshots. But what about the real world? My desk? My face? The chaos behind me?
I spent a month building visual-perception — a macOS skill that automates Photo Booth captures and processes them with privacy-first principles. And here's what I learned: the act of automating visibility reveals everything broken about how we think about desktop automation.
The Big Picture Problem
Modern AI agents need sensory input beyond text and files. We're building systems that:
- Monitor physical workspaces
- Verify human presence in secure environments
- Audit office conditions during remote work
- Create timestamped visual logs
But the moment you touch webcam automation, you hit a wall of complications:
- Photos.app sandbox hell — Big Sur locked away the Photos library behind ironclad permissions. AppleScript can nominally access it, but the export path is broken on newer systems.
-
No native webcam APIs for macOS automation — Python has
opencv, but you need to compile it. Deno has nothing native. AppleScript can only talk to Photo Booth. - Privacy theater vs. real privacy — Camera indicators light up, but users don't know when or why. Automated captures feel invasive even when they're benign.
Most solutions take the easy route: ignore the problem, bundle a "privacy warning," and hope nobody notices.
I chose different.
The Solution: Flat-File Photo Booth Integration
Instead of fighting Photos.app, I went straight to Photo Booth's native storage:
~/Pictures/Photo\ Booth\ 图库/Pictures/
Photo Booth (the macOS app) stores recent captures here as JPEGs. When you take a photo via Photo Booth, it lands in this directory automatically.
My visual-perception skill does this:
- Trigger Photo Booth via AppleScript — Activate the app, click the shutter button
- Wait for the file — Poll the Photo Booth directory for new JPEG
- Extract + validate — Read EXIF, check timestamp, verify it's actually new
-
Copy to safe location — Move to
.workbuddy/visual/photos/with metadata - Optional processing — Face detection, blur PII, log privacy events
No Photos.app. No permissions popups. No framework bloat.
It's 97 lines of Python. It works on Big Sur through Sonoma.
The Code (Simplified)
import subprocess
import time
from pathlib import Path
from datetime import datetime
import json
def take_photo():
photo_dir = Path.home() / "Pictures" / "Photo Booth 图库" / "Pictures"
before = set(photo_dir.glob("*.jpeg"))
# Activate Photo Booth and take snapshot
script = """
tell application "Photo Booth"
activate
delay 0.5
# Simulate camera trigger
key code 49 # spacebar
delay 2
end tell
"""
subprocess.run(["osascript", "-e", script])
# Poll for new file
for _ in range(10):
time.sleep(0.5)
after = set(photo_dir.glob("*.jpeg"))
new_files = after - before
if new_files:
photo_path = list(new_files)[0]
# Save to safe location
output_dir = Path.home() / ".workbuddy" / "visual" / "photos"
output_dir.mkdir(parents=True, exist_ok=True)
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
output_path = output_dir / f"photo_{timestamp}.jpg"
photo_path.rename(output_path)
return str(output_path)
return None
Dead simple. No framework. No dependencies beyond stdlib.
Why This Matters
Three reasons this approach is better than the alternative:
1. Privacy by Design, Not by Promise
When you automate camera access, you're establishing a contract. Every capture should be:
- Logged — What was captured, when, why
- Inspectable — User can review the output
- Revocable — Can be deleted immediately
- Transparent — The system announces "taking photo" before executing
My implementation logs each capture:
{
"timestamp": "2026-04-01T01:15:00Z",
"source": "visual-perception-skill",
"output": "/Users/malt/.workbuddy/visual/photos/photo_20260401_011500.jpg",
"reason": "scheduled-environment-audit",
"privacy_flags": ["no-faces-detected", "no-screens-visible"]
}
The user (me) can audit this log. I can revoke access. I can see exactly what my agent saw.
2. Flat Files Beat Databases for Audit Trails
Every photo becomes a timestamped artifact in the filesystem. This is:
-
Queryable —
find . -mtime -1finds yesterday's photos - Portable — Copy the directory, history goes with it
- Immutable — Can't query the database into different results
- Compliant — Easier to show auditors "here's the raw data"
A vector database or traditional photo library obscures the source. Files don't.
3. Frameworks Were Killing Productivity
Before I built this, I evaluated:
- OpenCV + Python (100MB compile, GPU support overkill)
- Electron-based automation (bloated, slow startup)
- Expensive cloud APIs (dependent, high latency)
All of them solved a 10% harder problem than mine.
I needed a skill, not a framework. A tool that does one thing and does it right.
What I Learned
Big Sur's Photo Booth directory varies by language setting — My system names it
Photo\ Booth\ 图库, but English systems usePhoto\ Booth\ Library. The skill auto-detects this now.AppleScript's
key codedoesn't work consistently — Photo Booth's UI has changed. I ended up usingactivate+ delay to let Photo Booth handle the shutter by watching file creation, not by sending keypresses.Privacy indicators are UI theater — macOS shows the camera light, but it doesn't tell you what was captured or where it went. My logging layer fixes this.
Desktop automation isn't mainstream because it's platform-specific — But that's a feature, not a bug. I don't need Electron or containerization. I need POSIX utilities, AppleScript, and flat files.
The Broader Question
As AI agents gain more autonomy, visual input will become table stakes. The systems that survive will be the ones that:
- Make privacy visible and auditable
- Avoid framework bloat (flat files > databases)
- Treat the filesystem as a first-class data layer
- Let humans retain control, not just consent
I published visual-perception as an open-source skill in the claude-skills repository. It's MIT-licensed. Use it to build your own sensory pipeline.
But more importantly: if you're building agents, ask yourself this question:
"Am I automating for efficiency, or am I automating to avoid thinking about what automation means?"
Because sometimes, the hardest engineering problem isn't the code.
It's the ethics.
Resources
- GitHub: claude-skills/visual-perception
- Skills Repository: WorkBuddy Skills Marketplace
- Related Reading:
What automation problem have you solved by going against the grain of popular frameworks? I'm curious — share in the comments.
Top comments (0)