DEV Community

Cover image for [Day 6] I Had an AI Look at 25,000 iPhone Photos and It Decided My Mom and I Are the Same Person
PEPPERCORN
PEPPERCORN

Posted on

[Day 6] I Had an AI Look at 25,000 iPhone Photos and It Decided My Mom and I Are the Same Person

[Day 6] I Had an AI Look at 25,000 iPhone Photos and It Decided My Mom and I Are the Same Person

Intro

Day 6!

On Day 4, I had a local AI sort through 25,000 photos on my iPhone (Day 4 article). Today is the follow-up — I wanted to go one level deeper and have AI look at my behavioral patterns over time.

Tools used: my home AI machine (DGX Spark) + a face recognition AI (FaceNet) + a summarization LLM (Qwen2.5 72B running on Ollama).


Today's setup

What I actually did

Take 5 years of photos (25,000) and have an AI summarize my day-to-day life from them. Two phases:

  1. Phase 1: aggregate by capture date + camera model + photo category (cat, food, scenery, etc.), then ask the LLM to read it
  2. Phase 2: add face recognition AI to answer "who is in each photo," then ask the LLM again

The key bit of today

The face recognition AI treated me and my mom as the same person — but the interesting part is that all the other misclassifications were "different people with the same expression," whereas in our case it was "different expressions, same person" despite my mom being straight-faced and me grinning with teeth showing.

The AI gets fooled by expressions, but it also seems to pick up on something beyond expressions (bone structure? face shape?). That's today's headline.


🔧 How I went about it

25,000 photos (already categorized on Day 4)
   ↓
Phase 1: aggregate "capture date / camera model / category" only
   → ask the LLM to summarize year by year
   ↓
Phase 2: add face recognition AI to label "who is in each photo"
   → ask the LLM to summarize again
Enter fullscreen mode Exit fullscreen mode

GPS data ("where") had been stripped during the iCloud export, so I substituted camera model as a proxy (iPhone = daily life, Olympus TG = travel, DJI handheld = video shoots, etc.).

(The tools and detailed steps are in the "Technical details" section at the end.)


📊 Results

Phase 1: date / camera / category only

I pulled "capture date + camera model + category (sorted on Day 4)" out of the 25,000 photos and turned it into four heatmaps showing year-over-year patterns. Then handed those to the LLM.

What's a heatmap? = A table where the rows × columns are filled with color intensity based on count. Dense color = a hotspot of activity, visible at a glance.

Photo count per year

Photo count per year

2019 was a clear outlier at 4,931 photos — 2-3x the other years.

Year × Category

Year × Category heatmap

Cat photos exploded starting 2021 → matches the year my cat joined the household.

Year × Month

Year × Month activity heatmap

August 2019 was the single highest month at 1,082 photos.

Year × Camera model

Year × Camera heatmap

Olympus TG dropped off sharply in 2020 (matches the COVID period). The DJI handheld shows up starting 2025.

When I handed this to the LLM and asked for a yearly summary, the output was along the lines of "this might have been a busy year" or "looks like an active year." Well, of course — the only info I gave the LLM was "when, what camera, what kind of subject." That's the ceiling for what it can say.

So the next question: what happens if you add who is in each photo? That's Phase 2.


Phase 2: adding "who's in the photo"

I ran face recognition AI over the 25,000 photos, detected 21,000 faces, and grouped similar-looking ones into 209 groups (C1, C2, …, C209). Plotting those over time:

What's a "similar-face group"? = a group the face recognition AI thinks contains "the same person" (technically called a "cluster"). The AI only manages them as numbered IDs, so a human still has to look at each group and label "this is person X."

Person cluster × Year heatmap

Person cluster × Year

This heatmap turned out to be interesting:

  • Long-spanning groups (C1, C2, C3) → likely family or myself
  • Short-spanning groups → likely acquaintances from a specific period

…which gives you a working guess. When I fed this back to the LLM, the summary turned much more concrete: "C3 is a new appearance," "C2 is decreasing in frequency," etc.


💡 Today's biggest finding

I went through the face clusters one by one and saw that the AI's groupings landed in a mix of "worked great," "fair enough," and "failed":

  • ✅ Worked great: grouped the same person across different angles and expressions (one group had all 4 photos of the same family member)
  • 🤔 Fair enough: burst shots end up grouped (multiple groups were just consecutive frames of the same moment)
  • ⚠️ Failure pattern A: grouped different people who happened to share a similar smile (happened in several groups)
  • 😳 Failure pattern B: grouped me and my mom despite our totally different expressions

The most striking one was failure pattern B.

Failure patterns A and B are misclassifications for opposite reasons

Failure pattern A: different people, same expression

Different people grouped together because of a similar smile.

Different people grouped due to similar smile (illustration)

Three different people — but when smiles are similar, the AI calls them "same person."

※The actual experiment used real photos. The illustrations here are AI-generated stand-ins for privacy.

Failure pattern B: parent and child, different expressions

My mom and I in the same group — despite the expression difference (I'm grinning with teeth showing, she's neutral).

Parent and child grouped despite different expressions (illustration)

Parent and child with clearly different expressions — but the AI still says "same person."

※The actual experiment used real photos. The illustrations here are AI-generated stand-ins for privacy.

In the groups I eyeballed, "different expressions but same person" only happened in our case. Every other misclassification was "same expression, different people."

So my mom and I are a different kind of mistake. Either the AI is picking up on genetic facial similarity, or there's some other mechanism at work (I'll touch on this in the technical details). Hard to be definitive, but a fascinating case.


Summary: how the AI "sees" faces

Situation AI judgment Likely reason
Same person, different angle & expression ◯ to △ Bone structure matches well
Different people, same expression ✕ (often grouped) Pulled in by expression noise
Parent & child, different expressions ✕ (sometimes grouped) Bone structure similarity outweighs expression difference

The AI gets fooled by expressions, but seems to actually pick up on something beyond expressions (bone structure? face shape?) — that was the most interesting observation of the day.


🛠️ Technical details

:::details Tools used

  • EXIF extraction: Python pillow_heif + PIL.Image.getexif() (HEIC-aware)
  • Face recognition: facenet-pytorch (InceptionResnetV1, vggface2-pretrained)
  • Clustering: scikit-learn DBSCAN
  • LLM summarization: Qwen2.5 72B via Ollama
  • Compute: DGX Spark (lots of GPU memory)

What's EXIF? = the camera metadata embedded in each photo file (capture time, camera model, GPS, etc.).

What's FaceNet? = an AI that converts a face photo into a 512-dimensional vector. Same person's faces are close vectors, different people are far apart.

What's DBSCAN? = a classic ML clustering method that automatically groups similar items. You don't need to specify the number of groups in advance.

:::

:::details EXIF extraction script (parallelized, 6 seconds total)

pillow_heif to support HEIC, PIL.Image.getexif() to read EXIF. Parallelized with concurrent.futures.ProcessPoolExecutor (12 processes).

import pillow_heif
pillow_heif.register_heif_opener()

from PIL import Image, ExifTags

def extract_one(path):
    with Image.open(path) as img:
        exif = img.getexif()
        inner = exif.get_ifd(34665)  # ExifIFD
        # DateTimeOriginal lives inside ExifIFD
        if 36867 in inner:
            dt = parse_exif_datetime(inner[36867])
        make = exif.get(271)
        model = exif.get(272)
        gps = exif.get_ifd(34853)
Enter fullscreen mode Exit fullscreen mode

Photos with no EXIF date (screenshots, etc.) fall back to file mtime, but that's just "the day I copied the file," so I excluded those from year-level aggregation.

:::

:::details Tuning DBSCAN's eps

Distance between embeddings is cosine distance (1 - dot product).

eps Clusters Largest cluster size Verdict
0.4 (loose) 3 21,310 Everyone in one group
0.3 73 17,234 Still big lumps
0.25 146 12,905 Still too big
0.2 209 4,582 ◎ chosen
0.18 216 3,131 Too tight — single people split into multiple clusters
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import normalize

embeds_n = normalize(embeds, norm="l2")
db = DBSCAN(eps=0.2, min_samples=5, metric="cosine", n_jobs=-1)
labels = db.fit_predict(embeds_n)
Enter fullscreen mode Exit fullscreen mode

min_samples=5 means only people who show up 5+ times across photos get clustered.

:::

:::details Why parent and child tend to land in the same cluster

facenet-pytorch's InceptionResnetV1 (vggface2-pretrained) produces 512-dim embeddings that are designed to capture geometric (bone structure) features. Lighting, angle, and expression noise also leak in.

Parent and child share genetic bone structure, so their embeddings can be closer than you'd get between random different people. This is a known phenomenon in face recognition research — several papers have demonstrated it.

DBSCAN is density-based: if "A→B is close" and "B→C is close," then A and C end up in the same cluster even if A and C aren't directly close. If there's one photo of me looking especially like my mom that sits in between, that single bridge photo can connect us into one cluster.

:::

:::details Generating representative face thumbnails for manual labeling

Clusters are just IDs (C0, C1, …), so a human has to look at them and label "this is person X."

I wrote a script that crops the largest face from each cluster's representative photos and lays them out as a diagnostic image:

from facenet_pytorch import MTCNN

mtcnn = MTCNN(keep_all=True, min_face_size=40, device='cuda')
boxes, _ = mtcnn.detect(img)
if boxes is not None:
    biggest = max(boxes, key=lambda b: (b[2]-b[0]) * (b[3]-b[1]))
    crop = img.crop(biggest)
Enter fullscreen mode Exit fullscreen mode

This image contains real faces of family and friends, so I kept it strictly local in private-data/day06-timeline/ (gitignored). Opened it via VS Code Remote-SSH to label by eye.

:::


📝 Today's takeaways

  • Handing the LLM only "when / what camera / what category" yields a blurry overview
  • Adding "who is in the photo" jumps the resolution of the analysis up several notches
  • Face recognition AI is sensitive to expression noise but does pick up something beyond expressions (bone structure / face shape)
  • Because of that, parent-child being grouped "despite different expressions" became the one unique case in my dataset
  • Keeping sensitive face data off the cloud is a big advantage of running this locally
  • Processing 25,000 photos in one go is also realistic on a local setup — no API costs to worry about

Tomorrow's preview: Day 7

Day 7 plan: local AI vs cloud AI, 5-round showdown.

Going to take the tasks I usually do with local AI (photo classification, credit card analysis, code completion, etc.), run them on both sides, and build a head-to-head matrix.

To be continued >>>


100ExperimentsWithDGX #LocalLLM #ImageAnalysis #FaceNet

Top comments (0)