PEPPERCORN

Posted on May 7 • Edited on May 27

[Day 4] I Had a Local AI Sort Through 25,000 Photos on My iPhone

#localllm #ai #dgxspark #clip

[Day 4] I Had a Local AI Sort Through 25,000 Photos on My iPhone

Intro

Day 4: I'm going to hand the 25,000 photos sitting on my iPhone over to a local AI for sorting.

This is experiment #4.

What I'm using today: DGX Spark + CLIP (image-understanding AI from OpenAI) + Qwen2-VL (a vision-language model that can chat about images, from Alibaba).

Today's setup

Data: 25,382 photos and videos sitting on my iPhone (96 GB).
Goal: Have AI find unnecessary photos so I can drop my phone storage subscription.
Approach:
- Stage 1: Quickly classify all 25K with CLIP.
- Stage 2: Have Qwen2-VL (a VLM) grade CLIP's classifications.
Comparison axis: Lightweight + fast classifier (CLIP) vs. heavyweight + smart conversational AI (VLM).

Bottom line: Overall agreement of 84.5% when the VLM grades CLIP's classifications. People detection: 99.2% — only 59 misses out of 7,195 photos. Documents and screenshots ended up wrong about half the time. Oh, and I gave up midway and just dumped everything into Amazon Photos because I'd just learned Prime members get unlimited photo storage. Five years a Prime member, never knew.

🔧 Steps

Big picture flow:

iPhone
   ↓ ① Sync via iCloud for Windows
myPC1 (Windows)
   ↓ ② scp transfer to DGX (96 GB)
DGX (Linux)
   ├─ ③ Split photos and videos by extension
   │     └ Photos 24,497 / Videos 884
   ├─ ④ Classify with CLIP (~20 min)
   │     └ Sorted into 8 categories
   └─ ⑤ Have VLM grade "is this category right?" (~3 hours)
         └ Overall agreement: 84.5%

Let's walk through each step.

Getting photos onto the DGX (the biggest hurdle)

iPhone → myPC1 (a Windows laptop I use day-to-day) → DGX, a two-leg relay.

The first leg started at 0.5 MB/s, with the ETA showing "6 days." After realizing my Wi-Fi was the bottleneck, I switched to wired LAN, fixed the hostname-resolution path, and got it up to 80 MB/s (~160x faster). Burned half a day. More technical details in the collapsible section below.

Splitting photos and videos

The 25,382 transferred files broke down like this:

Extension	Count	Type
HEIC	13,107	Photo (Apple's format)
JPG / JPEG	10,721	Photo
PNG	660	Photo (mostly screenshots)
WEBP	9	Photo
MOV	799	Video
MP4	85	Video
ini	1	System file (ignored)

I had Claude write a small script that splits photos and videos into separate folders by extension (one command, takes a few minutes — details in the collapsible section).

Result:

Photos: 24,497
Videos: 884
Photos are the focus from here.

What is CLIP?

CLIP ＝ an image-understanding AI from OpenAI, apparently. You hand it a photo and ask "is this a cat? a landscape? a screenshot?" with multiple labels, and it returns a similarity score for each. Lightweight and fast is its specialty, supposedly.

Stage 1: Classifying all 25K photos with CLIP

I set up 8 categories:

Trash candidates: screenshot / document / blank
Keep: food / landscape / other
Review: people / cat

For each category, I prepared multiple English captions (e.g., "a screenshot of an app", "a photo of a cat") and used the maximum similarity. Also: anything below 0.5 confidence goes into the uncertain bucket for manual review.

Batch size 64, ~20 minutes of GPU time, all done. Results in the next section!

The "How accurate is it?" question

CLIP did the classification, but how accurate is it really?

Normally you'd verify by manual inspection, but eyeballing 25,000 photos is not realistic.

So I decided to have a smarter AI grade CLIP's classifications.

What is a VLM?

A VLM (Vision-Language Model) is an AI that can hold a conversation about images, apparently.

How it differs from CLIP:

	CLIP	VLM
What it does	Category classification (returns probabilities)	Can describe image content in natural language
Smartness	Lightweight, fast, coarse	Heavy, slow, smart
Size	~400 MB	~16 GB

I picked Qwen2-VL 7B Instruct (Alibaba). Apache 2.0 licensed for commercial use, no Hugging Face authentication required for download — those were the selection criteria.

The plan: ask the VLM "is this a screenshot? answer yes or no" for each image and record the answer.

Stage 2: Grading all 25K photos with VLM

Started at 16 seconds per image (~5 days for the full set). The cause was image size — resizing to 448px on the short side dropped it to 0.3 sec/image (~54x faster). Even with one-image-at-a-time inference, the full set takes ~2-3 hours.

Started before bed, woke up to 24,496 graded results.

📊 Results

CLIP's classification results

After CLIP processed 24,496 photos, the distribution looked like this:

private-data/iphone-photos-classified/
├── _trash-candidate/      Trash candidates
│   ├── screenshots/    (981)
│   ├── documents/    (1,804)
│   └── blank/           (59)
├── _review/                Review
│   ├── people/       (7,195)
│   ├── cat/          (1,009)
│   └── uncertain/    (7,700)
└── _keep/                  Keep
    ├── food/         (1,682)
    ├── landscape/    (1,991)
    └── other/        (2,075)

Category	Count	Share
people	7,195	29.4%
uncertain (low confidence)	7,700	31.4%
other	2,075	8.5%
landscape	1,991	8.1%
document	1,804	7.4%
food	1,682	6.9%
cat	1,009	4.1%
screenshot	981	4.0%
blank	59	0.2%

That's a lot of cat photos...

Let's see how CLIP actually judged some of these.

🎯 Big wins


My cat	A meal	App screenshot	Mountain (landscape)
cat 0.97	food 0.999	screenshot 0.74	landscape 0.98

CLIP nailed the cat without hesitation, food at 0.999, screenshots and landscapes too. Reliable.

✨ Subtly impressive recognition


Cat keychain	Close-up of coffee beans
cat 0.64	food 0.53

Even the keychain got recognized as "cat." And coffee beans up close as "food." Quietly impressive.

🤔 Funny misclassifications (CLIP's quirks)

Browsing thumbnails by category, some interesting patterns emerged.

Food edition: "Trash sorting chart" beats "homemade cake" for being food-like


My homemade cake	Trash sorting chart
food 0.57	food 0.83

Both ended up in the "food" category. Apparently the trash sorting chart looks more food-like to CLIP than my homemade cake. Reacting to the text? The table layout? Mystery.

People edition: "A doodle" beats "Mona Lisa" for being people-like


The Mona Lisa	A face I doodled myself
people 0.50	people 0.52

Both in the "people" category. My crappy doodle edges out da Vinci's Mona Lisa for being more "people-like" (just barely).

CLIP's quirks — kind of charming.

VLM's grading results

I asked the VLM, one photo at a time, whether CLIP's category was correct. For example, photos in the cat folder got "is this a cat?", food folder got "is this food?" — yes/no answers.

Summary by final destination bucket:

Final bucket	Count	VLM agreement
people	7,195	99.2% 🎯
food	1,682	95.3% 🎯
cat	1,009	95.0% 🎯
other	2,075	93.6% 🎯
landscape	1,991	83.5% ⚠️
screenshot	981	75.2% ⚠️
document	1,804	67.4% ⚠️
blank	59	52.5% ⚠️
OVERALL	24,496	84.5%

People detection at 99.2% is quietly amazing. Out of 7,195 photos, the VLM said "no" to only 59.

Documents and screenshots, on the other hand, came back "no" about half the time. CLIP-only confidence isn't enough for those. Out of 24,496 photos, 3,808 got a "no" from the VLM — that's the part CLIP alone wouldn't have caught.

💡 Today's discoveries

Multimodal AI runs at home

Both CLIP (400 MB, classifier) and Qwen2-VL (16 GB, conversational) ran fine on my home machine. Reassuring.

CLIP's confidence is a reliable signal

VLM agreement broken down by CLIP confidence:

CLIP confidence	Count	VLM agreement
0.9+ (super confident)	3,555	96.5%
0.7–0.9	6,285	93.5%
0.5–0.7	6,956	86.0%
<0.5 (uncertain)	7,700	70.1%

Boring but important: when an AI says it's confident, you can trust it.

CLIP's weak spots

Things that clearly appear in photos — people, food, cats, objects — score 95%+. Abstract or compound subjects — documents, screenshots, landscapes — drop to 60-80%.

Documents at 67.4% in particular. That's where VLM re-grading earns its keep.

Role split: lightweight model × smart model

Use CLIP to triage everything quickly, VLM to grade the suspicious cases — a two-layer setup. Best of both worlds in speed and accuracy.

Day 3 had the same pattern: "aggregation = tools, interpretation = AI." Today's variant: "rough sorting = CLIP, accuracy check = VLM." Picking the right AI for the right task pays off in both performance and cost.

"Input quality matters more than model size" struck again

In Day 3 (credit card analysis), I learned "input quality > model size." The same pattern showed up today:

VLM with original-resolution images: 16 sec/image (5 days for full run)
VLM with resized 448px images: 0.3 sec/image (2 hours)

Just by tidying up the input, 54x speedup — small change, huge impact.

Not "biggest model possible" or "raw original" — clean up the input before sending it to the AI. This worked in Day 3 and Day 4 in a row.

Heart broken, switched to Amazon Photos

I tried to verify the trash candidate folder, then realized I'd need to cross-reference VLM scores too, then realized I never set clear criteria for "what to delete" in the first place. Couldn't finalize the cleanup, and morale broke.

Right then I learned that Amazon Prime members get unlimited photo storage, so I just dumped everything into Amazon Photos. Lol.

That said, I really should have defined the deletion criteria before starting.

The classified data on the DGX is a useful resource for future Day experiments.

🛠️ How I actually did this

:::details Wi-Fi 0.5 MB/s → wired LAN 80 MB/s journey

myPC1 → DGX over 96 GB started at 236 KB/s via WinSCP (ETA: 6 days). The cause was myPC1 being on Wi-Fi.

I plugged the PC into the router with a LAN cable → ping dropped close to 0 ms. But WinSCP was still stuck at 500 KB/s.

PowerShell ping spark-XXXX.local revealed the address resolved to DGX's Wi-Fi-side IP. The DGX was dual-homed (wired + Wi-Fi), and mDNS was returning the old route.

# Failure (routes through Wi-Fi)
scp -r "C:\Users\[user]\Pictures\iCloud Photos\Photos" [user]@spark-XXXX.local:...

# Success (direct IP over wired LAN)
scp -r "C:\Users\[user]\Pictures\iCloud Photos\Photos" [user]@10.0.0.205:...

Switched from hostname to explicit IP and watched it scream:

IMG_0190.HEIC                100% 1812KB  84.3MB/s   00:00
IMG_0190.MOV                 100%   17MB 102.4MB/s   00:00
IMG_0192.HEIC                100% 2256KB  81.6MB/s   00:00

Also discovered WinSCP (SFTP-based) struggles with many small files, while scp (stream transfer) is much faster. With 25,382 files, scp won by a landslide.

:::

:::details Splitting photos and videos by extension

PHOTO_EXTS = {".jpg", ".jpeg", ".heic", ".heif", ".png", ".webp"}
VIDEO_EXTS = {".mov", ".mp4", ".m4v"}

for src in input_dir.rglob("*"):
    if not src.is_file():
        continue
    ext = src.suffix.lower()
    if ext in PHOTO_EXTS:
        shutil.move(str(src), str(photos_out / src.name))
    elif ext in VIDEO_EXTS:
        shutil.move(str(src), str(videos_out / src.name))

Simple. Caught one snag: right after transfer, the directory permission was dr-x------ (read-only), so the first shutil.move died with PermissionError. chmod u+w fixed it.

:::

:::details CLIP classification script

Used transformers to load openai/clip-vit-base-patch32. For each category, multiple captions are prepared, and the max softmax score is used:

CATEGORIES = {
    "screenshot": [
        "a screenshot of an app",
        "a phone screenshot",
        "a screenshot of a website or chat",
    ],
    "document": [
        "a photo of a document or paper",
        "a photo of a receipt",
        "a QR code or barcode",
        "a photo of an ID card or driver's license",
    ],
    "people": [
        "a photo of a person",
        "a photo of people",
        "a portrait of someone",
    ],
    "cat": ["a photo of a cat"],
    "food": ["a photo of food or a meal"],
    "landscape": [
        "a photo of a landscape or scenery",
        "a photo of a building or city",
    ],
    "other": ["a photo of an object or item"],
}

inputs = processor(text=text_prompts, images=images,
                   return_tensors="pt", padding=True).to(device)
with torch.no_grad():
    outputs = model(**inputs)
probs = outputs.logits_per_image.softmax(dim=1)

Anything below 0.5 confidence goes into _review/uncertain/. Near-black/near-white images get caught by a brightness check and routed to _trash-candidate/blank/ before they reach CLIP.

All per-image category scores are also saved to JSON. That JSON is what the VLM evaluation step consumes later.

:::

:::details The 54x speedup from image resizing for VLM

Qwen2-VL's vision token count scales with input resolution. Original-size images (several thousand pixels) consume hundreds to thousands of tokens, slowing inference dramatically.

processor = AutoProcessor.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct",
    min_pixels=224 * 224,
    max_pixels=448 * 448,  # ← cap here
)

# Belt and suspenders — also pre-resize the image
img = Image.open(path).convert("RGB")
img = ImageOps.exif_transpose(img)
img.thumbnail((448, 448))

That took 16 sec/image → 0.3 sec/image.

The verification prompt is dead simple:

CATEGORY_PROMPTS = {
    "screenshot": "Is this image a screenshot of a phone screen, an app, or a website? Answer with one word: yes or no.",
    "document":   "Is this image primarily a document, receipt, ID card, or QR code? Answer with one word: yes or no.",
    "people":     "Does this image clearly show one or more human persons? Answer with one word: yes or no.",
    # ...
}

max_new_tokens=5 means only yes/no comes back. Minimal design.

:::

:::details Resumable checkpointing

Running 24,000 images for 3 hours straight, you really want recovery if something hiccups:

CHECKPOINT_INTERVAL = 100

for i, (name, r) in enumerate(todo):
    # ... inference ...
    if (i + 1) % CHECKPOINT_INTERVAL == 0:
        save_checkpoint(results, output_path)

And a --resume flag that picks up where the JSON left off:

if args.resume and args.output.is_file():
    with args.output.open() as f:
        results = json.load(f)
    print(f"Resumed from {len(results)} existing entries")
todo = [(name, r) for name, r in clip_data.items() if name not in results]

Essential for any overnight job.

:::

Next up: Day 5

Tomorrow: have an AI analyze a year of my Amazon purchase history.

Switching to Amazon Photos for storage made me realize Amazon also has my entire purchase history. What if I asked AI "what kind of person am I, based on this?" — see what patterns emerge that I never noticed myself.

To be continued ＞＞＞

100ExperimentsWithDGX #LocalLLM #ImageClassification #CLIP

DEV Community

[Day 4] I Had a Local AI Sort Through 25,000 Photos on My iPhone

[Day 4] I Had a Local AI Sort Through 25,000 Photos on My iPhone

Intro

Today's setup

🔧 Steps

Getting photos onto the DGX (the biggest hurdle)

Splitting photos and videos

What is CLIP?

Stage 1: Classifying all 25K photos with CLIP

The "How accurate is it?" question

What is a VLM?

Stage 2: Grading all 25K photos with VLM

📊 Results

CLIP's classification results

🎯 Big wins

✨ Subtly impressive recognition

🤔 Funny misclassifications (CLIP's quirks)

Food edition: "Trash sorting chart" beats "homemade cake" for being food-like

People edition: "A doodle" beats "Mona Lisa" for being people-like

VLM's grading results

💡 Today's discoveries

Multimodal AI runs at home

CLIP's confidence is a reliable signal

CLIP's weak spots

Role split: lightweight model × smart model

"Input quality matters more than model size" struck again

Heart broken, switched to Amazon Photos

🛠️ How I actually did this

Next up: Day 5

100ExperimentsWithDGX #LocalLLM #ImageClassification #CLIP

Top comments (0)