DEV Community

Cover image for [Day 4] I Had a Local AI Sort Through 25,000 Photos on My iPhone
PEPPERCORN
PEPPERCORN

Posted on

[Day 4] I Had a Local AI Sort Through 25,000 Photos on My iPhone

[Day 4] I Had a Local AI Sort Through 25,000 Photos on My iPhone

Intro

Day 4: I'm going to hand the 25,000 photos sitting on my iPhone over to a local AI for sorting.

This is experiment #4.

What I'm using today: DGX Spark + CLIP (image-understanding AI from OpenAI) + Qwen2-VL (a vision-language model that can chat about images, from Alibaba).


Today's setup

  • Data: 25,382 photos and videos sitting on my iPhone (96 GB).
  • Goal: Have AI find unnecessary photos so I can drop my phone storage subscription.
  • Approach:
    • Stage 1: Quickly classify all 25K with CLIP.
    • Stage 2: Have Qwen2-VL (a VLM) grade CLIP's classifications.
  • Comparison axis: Lightweight + fast classifier (CLIP) vs. heavyweight + smart conversational AI (VLM).

Bottom line: Overall agreement of 84.5% when the VLM grades CLIP's classifications. People detection: 99.2% — only 59 misses out of 7,195 photos. Documents and screenshots ended up wrong about half the time. Oh, and I gave up midway and just dumped everything into Amazon Photos because I'd just learned Prime members get unlimited photo storage. Five years a Prime member, never knew.


🔧 Steps

Big picture flow:

iPhone
   ↓ ① Sync via iCloud for Windows
myPC1 (Windows)
   ↓ ② scp transfer to DGX (96 GB)
DGX (Linux)
   ├─ ③ Split photos and videos by extension
   │     └ Photos 24,497 / Videos 884
   ├─ ④ Classify with CLIP (~20 min)
   │     └ Sorted into 8 categories
   └─ ⑤ Have VLM grade "is this category right?" (~3 hours)
         └ Overall agreement: 84.5%
Enter fullscreen mode Exit fullscreen mode

Let's walk through each step.

Getting photos onto the DGX (the biggest hurdle)

iPhone → myPC1 (a Windows laptop I use day-to-day) → DGX, a two-leg relay.

The first leg started at 0.5 MB/s, with the ETA showing "6 days." After realizing my Wi-Fi was the bottleneck, I switched to wired LAN, fixed the hostname-resolution path, and got it up to 80 MB/s (~160x faster). Burned half a day. More technical details in the collapsible section below.

Splitting photos and videos

The 25,382 transferred files broke down like this:

Extension Count Type
HEIC 13,107 Photo (Apple's format)
JPG / JPEG 10,721 Photo
PNG 660 Photo (mostly screenshots)
WEBP 9 Photo
MOV 799 Video
MP4 85 Video
ini 1 System file (ignored)

I had Claude write a small script that splits photos and videos into separate folders by extension (one command, takes a few minutes — details in the collapsible section).

Result:

  • Photos: 24,497
  • Videos: 884
  • Photos are the focus from here.

What is CLIP?

CLIP = an image-understanding AI from OpenAI, apparently. You hand it a photo and ask "is this a cat? a landscape? a screenshot?" with multiple labels, and it returns a similarity score for each. Lightweight and fast is its specialty, supposedly.

Stage 1: Classifying all 25K photos with CLIP

I set up 8 categories:

  • Trash candidates: screenshot / document / blank
  • Keep: food / landscape / other
  • Review: people / cat

For each category, I prepared multiple English captions (e.g., "a screenshot of an app", "a photo of a cat") and used the maximum similarity. Also: anything below 0.5 confidence goes into the uncertain bucket for manual review.

Batch size 64, ~20 minutes of GPU time, all done. Results in the next section!

The "How accurate is it?" question

CLIP did the classification, but how accurate is it really?

Normally you'd verify by manual inspection, but eyeballing 25,000 photos is not realistic.

So I decided to have a smarter AI grade CLIP's classifications.

What is a VLM?

A VLM (Vision-Language Model) is an AI that can hold a conversation about images, apparently.

How it differs from CLIP:

CLIP VLM
What it does Category classification (returns probabilities) Can describe image content in natural language
Smartness Lightweight, fast, coarse Heavy, slow, smart
Size ~400 MB ~16 GB

I picked Qwen2-VL 7B Instruct (Alibaba). Apache 2.0 licensed for commercial use, no Hugging Face authentication required for download — those were the selection criteria.

The plan: ask the VLM "is this a screenshot? answer yes or no" for each image and record the answer.

Stage 2: Grading all 25K photos with VLM

Started at 16 seconds per image (~5 days for the full set). The cause was image size — resizing to 448px on the short side dropped it to 0.3 sec/image (~54x faster). Even with one-image-at-a-time inference, the full set takes ~2-3 hours.

Started before bed, woke up to 24,496 graded results.


📊 Results

CLIP's classification results

After CLIP processed 24,496 photos, the distribution looked like this:

private-data/iphone-photos-classified/
├── _trash-candidate/      Trash candidates
│   ├── screenshots/    (981)
│   ├── documents/    (1,804)
│   └── blank/           (59)
├── _review/                Review
│   ├── people/       (7,195)
│   ├── cat/          (1,009)
│   └── uncertain/    (7,700)
└── _keep/                  Keep
    ├── food/         (1,682)
    ├── landscape/    (1,991)
    └── other/        (2,075)
Enter fullscreen mode Exit fullscreen mode
Category Count Share
people 7,195 29.4%
uncertain (low confidence) 7,700 31.4%
other 2,075 8.5%
landscape 1,991 8.1%
document 1,804 7.4%
food 1,682 6.9%
cat 1,009 4.1%
screenshot 981 4.0%
blank 59 0.2%

That's a lot of cat photos...

Let's see how CLIP actually judged some of these.

🎯 Big wins

cat food screenshot landscape
My cat A meal App screenshot Mountain (landscape)
cat 0.97 food 0.999 screenshot 0.74 landscape 0.98

CLIP nailed the cat without hesitation, food at 0.999, screenshots and landscapes too. Reliable.

✨ Subtly impressive recognition

keychain coffee
Cat keychain Close-up of coffee beans
cat 0.64 food 0.53

Even the keychain got recognized as "cat." And coffee beans up close as "food." Quietly impressive.

🤔 Funny misclassifications (CLIP's quirks)

Browsing thumbnails by category, some interesting patterns emerged.

Food edition: "Trash sorting chart" beats "homemade cake" for being food-like
cake trash chart
My homemade cake Trash sorting chart
food 0.57 food 0.83

Both ended up in the "food" category. Apparently the trash sorting chart looks more food-like to CLIP than my homemade cake. Reacting to the text? The table layout? Mystery.

People edition: "A doodle" beats "Mona Lisa" for being people-like
Mona Lisa doodle
The Mona Lisa A face I doodled myself
people 0.50 people 0.52

Both in the "people" category. My crappy doodle edges out da Vinci's Mona Lisa for being more "people-like" (just barely).

CLIP's quirks — kind of charming.


VLM's grading results

I asked the VLM, one photo at a time, whether CLIP's category was correct. For example, photos in the cat folder got "is this a cat?", food folder got "is this food?" — yes/no answers.

Summary by final destination bucket:

Final bucket Count VLM agreement
people 7,195 99.2% 🎯
food 1,682 95.3% 🎯
cat 1,009 95.0% 🎯
other 2,075 93.6% 🎯
landscape 1,991 83.5% ⚠️
screenshot 981 75.2% ⚠️
document 1,804 67.4% ⚠️
blank 59 52.5% ⚠️
OVERALL 24,496 84.5%

People detection at 99.2% is quietly amazing. Out of 7,195 photos, the VLM said "no" to only 59.

Documents and screenshots, on the other hand, came back "no" about half the time. CLIP-only confidence isn't enough for those. Out of 24,496 photos, 3,808 got a "no" from the VLM — that's the part CLIP alone wouldn't have caught.


💡 Today's discoveries

Multimodal AI runs at home

Both CLIP (400 MB, classifier) and Qwen2-VL (16 GB, conversational) ran fine on my home machine. Reassuring.

CLIP's confidence is a reliable signal

VLM agreement broken down by CLIP confidence:

CLIP confidence Count VLM agreement
0.9+ (super confident) 3,555 96.5%
0.7–0.9 6,285 93.5%
0.5–0.7 6,956 86.0%
<0.5 (uncertain) 7,700 70.1%

Boring but important: when an AI says it's confident, you can trust it.

CLIP's weak spots

Things that clearly appear in photos — people, food, cats, objects — score 95%+. Abstract or compound subjects — documents, screenshots, landscapes — drop to 60-80%.

Documents at 67.4% in particular. That's where VLM re-grading earns its keep.

Role split: lightweight model × smart model

Use CLIP to triage everything quickly, VLM to grade the suspicious cases — a two-layer setup. Best of both worlds in speed and accuracy.

Day 3 had the same pattern: "aggregation = tools, interpretation = AI." Today's variant: "rough sorting = CLIP, accuracy check = VLM." Picking the right AI for the right task pays off in both performance and cost.

"Input quality matters more than model size" struck again

In Day 3 (credit card analysis), I learned "input quality > model size." The same pattern showed up today:

  • VLM with original-resolution images: 16 sec/image (5 days for full run)
  • VLM with resized 448px images: 0.3 sec/image (2 hours)

Just by tidying up the input, 54x speedup — small change, huge impact.

Not "biggest model possible" or "raw original" — clean up the input before sending it to the AI. This worked in Day 3 and Day 4 in a row.

Heart broken, switched to Amazon Photos

I tried to verify the trash candidate folder, then realized I'd need to cross-reference VLM scores too, then realized I never set clear criteria for "what to delete" in the first place. Couldn't finalize the cleanup, and morale broke.

Right then I learned that Amazon Prime members get unlimited photo storage, so I just dumped everything into Amazon Photos. Lol.

That said, I really should have defined the deletion criteria before starting.

The classified data on the DGX is a useful resource for future Day experiments.


🛠️ How I actually did this

:::details Wi-Fi 0.5 MB/s → wired LAN 80 MB/s journey

myPC1 → DGX over 96 GB started at 236 KB/s via WinSCP (ETA: 6 days). The cause was myPC1 being on Wi-Fi.

I plugged the PC into the router with a LAN cable → ping dropped close to 0 ms. But WinSCP was still stuck at 500 KB/s.

PowerShell ping spark-XXXX.local revealed the address resolved to DGX's Wi-Fi-side IP. The DGX was dual-homed (wired + Wi-Fi), and mDNS was returning the old route.

# Failure (routes through Wi-Fi)
scp -r "C:\Users\[user]\Pictures\iCloud Photos\Photos" [user]@spark-XXXX.local:...

# Success (direct IP over wired LAN)
scp -r "C:\Users\[user]\Pictures\iCloud Photos\Photos" [user]@10.0.0.205:...
Enter fullscreen mode Exit fullscreen mode

Switched from hostname to explicit IP and watched it scream:

IMG_0190.HEIC                100% 1812KB  84.3MB/s   00:00
IMG_0190.MOV                 100%   17MB 102.4MB/s   00:00
IMG_0192.HEIC                100% 2256KB  81.6MB/s   00:00
Enter fullscreen mode Exit fullscreen mode

Also discovered WinSCP (SFTP-based) struggles with many small files, while scp (stream transfer) is much faster. With 25,382 files, scp won by a landslide.

:::

:::details Splitting photos and videos by extension

PHOTO_EXTS = {".jpg", ".jpeg", ".heic", ".heif", ".png", ".webp"}
VIDEO_EXTS = {".mov", ".mp4", ".m4v"}

for src in input_dir.rglob("*"):
    if not src.is_file():
        continue
    ext = src.suffix.lower()
    if ext in PHOTO_EXTS:
        shutil.move(str(src), str(photos_out / src.name))
    elif ext in VIDEO_EXTS:
        shutil.move(str(src), str(videos_out / src.name))
Enter fullscreen mode Exit fullscreen mode

Simple. Caught one snag: right after transfer, the directory permission was dr-x------ (read-only), so the first shutil.move died with PermissionError. chmod u+w fixed it.

:::

:::details CLIP classification script

Used transformers to load openai/clip-vit-base-patch32. For each category, multiple captions are prepared, and the max softmax score is used:

CATEGORIES = {
    "screenshot": [
        "a screenshot of an app",
        "a phone screenshot",
        "a screenshot of a website or chat",
    ],
    "document": [
        "a photo of a document or paper",
        "a photo of a receipt",
        "a QR code or barcode",
        "a photo of an ID card or driver's license",
    ],
    "people": [
        "a photo of a person",
        "a photo of people",
        "a portrait of someone",
    ],
    "cat": ["a photo of a cat"],
    "food": ["a photo of food or a meal"],
    "landscape": [
        "a photo of a landscape or scenery",
        "a photo of a building or city",
    ],
    "other": ["a photo of an object or item"],
}

inputs = processor(text=text_prompts, images=images,
                   return_tensors="pt", padding=True).to(device)
with torch.no_grad():
    outputs = model(**inputs)
probs = outputs.logits_per_image.softmax(dim=1)
Enter fullscreen mode Exit fullscreen mode

Anything below 0.5 confidence goes into _review/uncertain/. Near-black/near-white images get caught by a brightness check and routed to _trash-candidate/blank/ before they reach CLIP.

All per-image category scores are also saved to JSON. That JSON is what the VLM evaluation step consumes later.

:::

:::details The 54x speedup from image resizing for VLM

Qwen2-VL's vision token count scales with input resolution. Original-size images (several thousand pixels) consume hundreds to thousands of tokens, slowing inference dramatically.

processor = AutoProcessor.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct",
    min_pixels=224 * 224,
    max_pixels=448 * 448,  # ← cap here
)

# Belt and suspenders — also pre-resize the image
img = Image.open(path).convert("RGB")
img = ImageOps.exif_transpose(img)
img.thumbnail((448, 448))
Enter fullscreen mode Exit fullscreen mode

That took 16 sec/image → 0.3 sec/image.

The verification prompt is dead simple:

CATEGORY_PROMPTS = {
    "screenshot": "Is this image a screenshot of a phone screen, an app, or a website? Answer with one word: yes or no.",
    "document":   "Is this image primarily a document, receipt, ID card, or QR code? Answer with one word: yes or no.",
    "people":     "Does this image clearly show one or more human persons? Answer with one word: yes or no.",
    # ...
}
Enter fullscreen mode Exit fullscreen mode

max_new_tokens=5 means only yes/no comes back. Minimal design.

:::

:::details Resumable checkpointing

Running 24,000 images for 3 hours straight, you really want recovery if something hiccups:

CHECKPOINT_INTERVAL = 100

for i, (name, r) in enumerate(todo):
    # ... inference ...
    if (i + 1) % CHECKPOINT_INTERVAL == 0:
        save_checkpoint(results, output_path)
Enter fullscreen mode Exit fullscreen mode

And a --resume flag that picks up where the JSON left off:

if args.resume and args.output.is_file():
    with args.output.open() as f:
        results = json.load(f)
    print(f"Resumed from {len(results)} existing entries")
todo = [(name, r) for name, r in clip_data.items() if name not in results]
Enter fullscreen mode Exit fullscreen mode

Essential for any overnight job.

:::


Next up: Day 5

Tomorrow: have an AI analyze a year of my Amazon purchase history.

Switching to Amazon Photos for storage made me realize Amazon also has my entire purchase history. What if I asked AI "what kind of person am I, based on this?" — see what patterns emerge that I never noticed myself.

To be continued >>>


100ExperimentsWithDGX #LocalLLM #ImageClassification #CLIP

Top comments (0)