[Day 4] I Had a Local AI Sort Through 25,000 Photos on My iPhone
Intro
Day 4: I'm going to hand the 25,000 photos sitting on my iPhone over to a local AI for sorting.
This is experiment #4.
What I'm using today: DGX Spark + CLIP (image-understanding AI from OpenAI) + Qwen2-VL (a vision-language model that can chat about images, from Alibaba).
Today's setup
- Data: 25,382 photos and videos sitting on my iPhone (96 GB).
- Goal: Have AI find unnecessary photos so I can drop my phone storage subscription.
-
Approach:
- Stage 1: Quickly classify all 25K with CLIP.
- Stage 2: Have Qwen2-VL (a VLM) grade CLIP's classifications.
- Comparison axis: Lightweight + fast classifier (CLIP) vs. heavyweight + smart conversational AI (VLM).
Bottom line: Overall agreement of 84.5% when the VLM grades CLIP's classifications. People detection: 99.2% — only 59 misses out of 7,195 photos. Documents and screenshots ended up wrong about half the time. Oh, and I gave up midway and just dumped everything into Amazon Photos because I'd just learned Prime members get unlimited photo storage. Five years a Prime member, never knew.
🔧 Steps
Big picture flow:
iPhone
↓ ① Sync via iCloud for Windows
myPC1 (Windows)
↓ ② scp transfer to DGX (96 GB)
DGX (Linux)
├─ ③ Split photos and videos by extension
│ └ Photos 24,497 / Videos 884
├─ ④ Classify with CLIP (~20 min)
│ └ Sorted into 8 categories
└─ ⑤ Have VLM grade "is this category right?" (~3 hours)
└ Overall agreement: 84.5%
Let's walk through each step.
Getting photos onto the DGX (the biggest hurdle)
iPhone → myPC1 (a Windows laptop I use day-to-day) → DGX, a two-leg relay.
The first leg started at 0.5 MB/s, with the ETA showing "6 days." After realizing my Wi-Fi was the bottleneck, I switched to wired LAN, fixed the hostname-resolution path, and got it up to 80 MB/s (~160x faster). Burned half a day. More technical details in the collapsible section below.
Splitting photos and videos
The 25,382 transferred files broke down like this:
| Extension | Count | Type |
|---|---|---|
| HEIC | 13,107 | Photo (Apple's format) |
| JPG / JPEG | 10,721 | Photo |
| PNG | 660 | Photo (mostly screenshots) |
| WEBP | 9 | Photo |
| MOV | 799 | Video |
| MP4 | 85 | Video |
| ini | 1 | System file (ignored) |
I had Claude write a small script that splits photos and videos into separate folders by extension (one command, takes a few minutes — details in the collapsible section).
Result:
- Photos: 24,497
- Videos: 884
- Photos are the focus from here.
What is CLIP?
CLIP = an image-understanding AI from OpenAI, apparently. You hand it a photo and ask "is this a cat? a landscape? a screenshot?" with multiple labels, and it returns a similarity score for each. Lightweight and fast is its specialty, supposedly.
Stage 1: Classifying all 25K photos with CLIP
I set up 8 categories:
- Trash candidates: screenshot / document / blank
- Keep: food / landscape / other
- Review: people / cat
For each category, I prepared multiple English captions (e.g., "a screenshot of an app", "a photo of a cat") and used the maximum similarity. Also: anything below 0.5 confidence goes into the uncertain bucket for manual review.
Batch size 64, ~20 minutes of GPU time, all done. Results in the next section!
The "How accurate is it?" question
CLIP did the classification, but how accurate is it really?
Normally you'd verify by manual inspection, but eyeballing 25,000 photos is not realistic.
So I decided to have a smarter AI grade CLIP's classifications.
What is a VLM?
A VLM (Vision-Language Model) is an AI that can hold a conversation about images, apparently.
How it differs from CLIP:
| CLIP | VLM | |
|---|---|---|
| What it does | Category classification (returns probabilities) | Can describe image content in natural language |
| Smartness | Lightweight, fast, coarse | Heavy, slow, smart |
| Size | ~400 MB | ~16 GB |
I picked Qwen2-VL 7B Instruct (Alibaba). Apache 2.0 licensed for commercial use, no Hugging Face authentication required for download — those were the selection criteria.
The plan: ask the VLM "is this a screenshot? answer yes or no" for each image and record the answer.
Stage 2: Grading all 25K photos with VLM
Started at 16 seconds per image (~5 days for the full set). The cause was image size — resizing to 448px on the short side dropped it to 0.3 sec/image (~54x faster). Even with one-image-at-a-time inference, the full set takes ~2-3 hours.
Started before bed, woke up to 24,496 graded results.
📊 Results
CLIP's classification results
After CLIP processed 24,496 photos, the distribution looked like this:
private-data/iphone-photos-classified/
├── _trash-candidate/ Trash candidates
│ ├── screenshots/ (981)
│ ├── documents/ (1,804)
│ └── blank/ (59)
├── _review/ Review
│ ├── people/ (7,195)
│ ├── cat/ (1,009)
│ └── uncertain/ (7,700)
└── _keep/ Keep
├── food/ (1,682)
├── landscape/ (1,991)
└── other/ (2,075)
| Category | Count | Share |
|---|---|---|
| people | 7,195 | 29.4% |
| uncertain (low confidence) | 7,700 | 31.4% |
| other | 2,075 | 8.5% |
| landscape | 1,991 | 8.1% |
| document | 1,804 | 7.4% |
| food | 1,682 | 6.9% |
| cat | 1,009 | 4.1% |
| screenshot | 981 | 4.0% |
| blank | 59 | 0.2% |
That's a lot of cat photos...
Let's see how CLIP actually judged some of these.
🎯 Big wins
![]() |
![]() |
![]() |
![]() |
|---|---|---|---|
| My cat | A meal | App screenshot | Mountain (landscape) |
| cat 0.97 | food 0.999 | screenshot 0.74 | landscape 0.98 |
CLIP nailed the cat without hesitation, food at 0.999, screenshots and landscapes too. Reliable.
✨ Subtly impressive recognition
![]() |
![]() |
|---|---|
| Cat keychain | Close-up of coffee beans |
| cat 0.64 | food 0.53 |
Even the keychain got recognized as "cat." And coffee beans up close as "food." Quietly impressive.
🤔 Funny misclassifications (CLIP's quirks)
Browsing thumbnails by category, some interesting patterns emerged.
Food edition: "Trash sorting chart" beats "homemade cake" for being food-like
![]() |
![]() |
|---|---|
| My homemade cake | Trash sorting chart |
| food 0.57 | food 0.83 |
Both ended up in the "food" category. Apparently the trash sorting chart looks more food-like to CLIP than my homemade cake. Reacting to the text? The table layout? Mystery.
People edition: "A doodle" beats "Mona Lisa" for being people-like
![]() |
![]() |
|---|---|
| The Mona Lisa | A face I doodled myself |
| people 0.50 | people 0.52 |
Both in the "people" category. My crappy doodle edges out da Vinci's Mona Lisa for being more "people-like" (just barely).
CLIP's quirks — kind of charming.
VLM's grading results
I asked the VLM, one photo at a time, whether CLIP's category was correct. For example, photos in the cat folder got "is this a cat?", food folder got "is this food?" — yes/no answers.
Summary by final destination bucket:
| Final bucket | Count | VLM agreement |
|---|---|---|
| people | 7,195 | 99.2% 🎯 |
| food | 1,682 | 95.3% 🎯 |
| cat | 1,009 | 95.0% 🎯 |
| other | 2,075 | 93.6% 🎯 |
| landscape | 1,991 | 83.5% ⚠️ |
| screenshot | 981 | 75.2% ⚠️ |
| document | 1,804 | 67.4% ⚠️ |
| blank | 59 | 52.5% ⚠️ |
| OVERALL | 24,496 | 84.5% |
People detection at 99.2% is quietly amazing. Out of 7,195 photos, the VLM said "no" to only 59.
Documents and screenshots, on the other hand, came back "no" about half the time. CLIP-only confidence isn't enough for those. Out of 24,496 photos, 3,808 got a "no" from the VLM — that's the part CLIP alone wouldn't have caught.
💡 Today's discoveries
Multimodal AI runs at home
Both CLIP (400 MB, classifier) and Qwen2-VL (16 GB, conversational) ran fine on my home machine. Reassuring.
CLIP's confidence is a reliable signal
VLM agreement broken down by CLIP confidence:
| CLIP confidence | Count | VLM agreement |
|---|---|---|
| 0.9+ (super confident) | 3,555 | 96.5% |
| 0.7–0.9 | 6,285 | 93.5% |
| 0.5–0.7 | 6,956 | 86.0% |
| <0.5 (uncertain) | 7,700 | 70.1% |
Boring but important: when an AI says it's confident, you can trust it.
CLIP's weak spots
Things that clearly appear in photos — people, food, cats, objects — score 95%+. Abstract or compound subjects — documents, screenshots, landscapes — drop to 60-80%.
Documents at 67.4% in particular. That's where VLM re-grading earns its keep.
Role split: lightweight model × smart model
Use CLIP to triage everything quickly, VLM to grade the suspicious cases — a two-layer setup. Best of both worlds in speed and accuracy.
Day 3 had the same pattern: "aggregation = tools, interpretation = AI." Today's variant: "rough sorting = CLIP, accuracy check = VLM." Picking the right AI for the right task pays off in both performance and cost.
"Input quality matters more than model size" struck again
In Day 3 (credit card analysis), I learned "input quality > model size." The same pattern showed up today:
- VLM with original-resolution images: 16 sec/image (5 days for full run)
- VLM with resized 448px images: 0.3 sec/image (2 hours)
Just by tidying up the input, 54x speedup — small change, huge impact.
Not "biggest model possible" or "raw original" — clean up the input before sending it to the AI. This worked in Day 3 and Day 4 in a row.
Heart broken, switched to Amazon Photos
I tried to verify the trash candidate folder, then realized I'd need to cross-reference VLM scores too, then realized I never set clear criteria for "what to delete" in the first place. Couldn't finalize the cleanup, and morale broke.
Right then I learned that Amazon Prime members get unlimited photo storage, so I just dumped everything into Amazon Photos. Lol.
That said, I really should have defined the deletion criteria before starting.
The classified data on the DGX is a useful resource for future Day experiments.
🛠️ How I actually did this
:::details Wi-Fi 0.5 MB/s → wired LAN 80 MB/s journey
myPC1 → DGX over 96 GB started at 236 KB/s via WinSCP (ETA: 6 days). The cause was myPC1 being on Wi-Fi.
I plugged the PC into the router with a LAN cable → ping dropped close to 0 ms. But WinSCP was still stuck at 500 KB/s.
PowerShell ping spark-XXXX.local revealed the address resolved to DGX's Wi-Fi-side IP. The DGX was dual-homed (wired + Wi-Fi), and mDNS was returning the old route.
# Failure (routes through Wi-Fi)
scp -r "C:\Users\[user]\Pictures\iCloud Photos\Photos" [user]@spark-XXXX.local:...
# Success (direct IP over wired LAN)
scp -r "C:\Users\[user]\Pictures\iCloud Photos\Photos" [user]@10.0.0.205:...
Switched from hostname to explicit IP and watched it scream:
IMG_0190.HEIC 100% 1812KB 84.3MB/s 00:00
IMG_0190.MOV 100% 17MB 102.4MB/s 00:00
IMG_0192.HEIC 100% 2256KB 81.6MB/s 00:00
Also discovered WinSCP (SFTP-based) struggles with many small files, while scp (stream transfer) is much faster. With 25,382 files, scp won by a landslide.
:::
:::details Splitting photos and videos by extension
PHOTO_EXTS = {".jpg", ".jpeg", ".heic", ".heif", ".png", ".webp"}
VIDEO_EXTS = {".mov", ".mp4", ".m4v"}
for src in input_dir.rglob("*"):
if not src.is_file():
continue
ext = src.suffix.lower()
if ext in PHOTO_EXTS:
shutil.move(str(src), str(photos_out / src.name))
elif ext in VIDEO_EXTS:
shutil.move(str(src), str(videos_out / src.name))
Simple. Caught one snag: right after transfer, the directory permission was dr-x------ (read-only), so the first shutil.move died with PermissionError. chmod u+w fixed it.
:::
:::details CLIP classification script
Used transformers to load openai/clip-vit-base-patch32. For each category, multiple captions are prepared, and the max softmax score is used:
CATEGORIES = {
"screenshot": [
"a screenshot of an app",
"a phone screenshot",
"a screenshot of a website or chat",
],
"document": [
"a photo of a document or paper",
"a photo of a receipt",
"a QR code or barcode",
"a photo of an ID card or driver's license",
],
"people": [
"a photo of a person",
"a photo of people",
"a portrait of someone",
],
"cat": ["a photo of a cat"],
"food": ["a photo of food or a meal"],
"landscape": [
"a photo of a landscape or scenery",
"a photo of a building or city",
],
"other": ["a photo of an object or item"],
}
inputs = processor(text=text_prompts, images=images,
return_tensors="pt", padding=True).to(device)
with torch.no_grad():
outputs = model(**inputs)
probs = outputs.logits_per_image.softmax(dim=1)
Anything below 0.5 confidence goes into _review/uncertain/. Near-black/near-white images get caught by a brightness check and routed to _trash-candidate/blank/ before they reach CLIP.
All per-image category scores are also saved to JSON. That JSON is what the VLM evaluation step consumes later.
:::
:::details The 54x speedup from image resizing for VLM
Qwen2-VL's vision token count scales with input resolution. Original-size images (several thousand pixels) consume hundreds to thousands of tokens, slowing inference dramatically.
processor = AutoProcessor.from_pretrained(
"Qwen/Qwen2-VL-7B-Instruct",
min_pixels=224 * 224,
max_pixels=448 * 448, # ← cap here
)
# Belt and suspenders — also pre-resize the image
img = Image.open(path).convert("RGB")
img = ImageOps.exif_transpose(img)
img.thumbnail((448, 448))
That took 16 sec/image → 0.3 sec/image.
The verification prompt is dead simple:
CATEGORY_PROMPTS = {
"screenshot": "Is this image a screenshot of a phone screen, an app, or a website? Answer with one word: yes or no.",
"document": "Is this image primarily a document, receipt, ID card, or QR code? Answer with one word: yes or no.",
"people": "Does this image clearly show one or more human persons? Answer with one word: yes or no.",
# ...
}
max_new_tokens=5 means only yes/no comes back. Minimal design.
:::
:::details Resumable checkpointing
Running 24,000 images for 3 hours straight, you really want recovery if something hiccups:
CHECKPOINT_INTERVAL = 100
for i, (name, r) in enumerate(todo):
# ... inference ...
if (i + 1) % CHECKPOINT_INTERVAL == 0:
save_checkpoint(results, output_path)
And a --resume flag that picks up where the JSON left off:
if args.resume and args.output.is_file():
with args.output.open() as f:
results = json.load(f)
print(f"Resumed from {len(results)} existing entries")
todo = [(name, r) for name, r in clip_data.items() if name not in results]
Essential for any overnight job.
:::
Next up: Day 5
Tomorrow: have an AI analyze a year of my Amazon purchase history.
Switching to Amazon Photos for storage made me realize Amazon also has my entire purchase history. What if I asked AI "what kind of person am I, based on this?" — see what patterns emerge that I never noticed myself.
To be continued >>>










Top comments (0)