<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: PEPPERCORN</title>
    <description>The latest articles on DEV Community by PEPPERCORN (@peppercorn_llm).</description>
    <link>https://dev.to/peppercorn_llm</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3910738%2F8084bbca-3641-4d19-85b2-f53a184e1f84.jpg</url>
      <title>DEV Community: PEPPERCORN</title>
      <link>https://dev.to/peppercorn_llm</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/peppercorn_llm"/>
    <language>en</language>
    <item>
      <title>[Day 7] Does Giving an AI More 'Thinking Time' Really Make It Smarter? Training an OpenMythos-Style Mini Model on DGX</title>
      <dc:creator>PEPPERCORN</dc:creator>
      <pubDate>Tue, 19 May 2026 03:17:51 +0000</pubDate>
      <link>https://dev.to/peppercorn_llm/day-7-does-giving-an-ai-more-thinking-time-really-make-it-smarter-training-an-openmythos-style-1epk</link>
      <guid>https://dev.to/peppercorn_llm/day-7-does-giving-an-ai-more-thinking-time-really-make-it-smarter-training-an-openmythos-style-1epk</guid>
      <description>&lt;h1&gt;
  
  
  [Day 7] Does Giving an AI More "Thinking Time" Really Make It Smarter? Training an OpenMythos-Style Mini Model on DGX
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Intro
&lt;/h2&gt;

&lt;p&gt;Day 7!&lt;/p&gt;

&lt;p&gt;Reddit kept surfacing this new project called &lt;strong&gt;OpenMythos&lt;/strong&gt; in my feed with "12 days to replicate frontier AI, ASI is near" headlines, and I got curious enough to dig in.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Tools used: my home AI machine (DGX Spark) + &lt;a href="https://github.com/kyegomez/OpenMythos" rel="noopener noreferrer"&gt;OpenMythos&lt;/a&gt; (PyTorch reconstruction of the rumored Claude Mythos architecture) + synthetic multi-digit addition.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The question: &lt;strong&gt;does giving an AI more "thinking time" (= more recurrent loops at inference) actually make it smarter?&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Today's setup
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The hype
&lt;/h3&gt;

&lt;p&gt;On 2026-04-07, Anthropic announced &lt;strong&gt;Claude Mythos&lt;/strong&gt;. Press coverage highlights zero-day discovery capabilities — reportedly 271 zero-days in Firefox and a 27-year-old bug in OpenBSD — but the model's architecture and weights remain unreleased. Anthropic kept Mythos itself behind a limited-access coalition (&lt;strong&gt;Project Glasswing&lt;/strong&gt; — AWS, Apple, Microsoft, Google, CrowdStrike, Palo Alto, ~40 organizations) rather than releasing it publicly.&lt;/p&gt;

&lt;p&gt;Twelve days later, Kye Gomez (Swarms) released &lt;strong&gt;OpenMythos&lt;/strong&gt;, a PyTorch reconstruction of the &lt;em&gt;suspected&lt;/em&gt; architecture. The repo is explicit upfront:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"an independent, community-driven theoretical reconstruction based solely on publicly available research and speculation. It is not affiliated with, endorsed by, or connected to Anthropic"&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So OpenMythos is &lt;strong&gt;not&lt;/strong&gt; Mythos. It's a hypothesis-in-code: a Recurrent-Depth Transformer (RDT) with MoE FFNs and MLA/GQA attention, capable of being trained from scratch on standard text data. No leaked weights, no distillation.&lt;/p&gt;

&lt;p&gt;Reddit's "ASI is near" framing skips this critical distinction. The interesting question, once you set the hype aside, is whether the &lt;strong&gt;architectural idea&lt;/strong&gt; — recurrent depth — actually works.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note for this article&lt;/strong&gt;: OpenMythos is not Claude Mythos — it's a theoretical reconstruction inspired by looped-transformer research. The experiments below are not "Claude Mythos capability tests" but rather "how does a looped / recurrent-depth structure behave on a small synthetic task."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Three perspectives on looped transformers
&lt;/h3&gt;

&lt;p&gt;Browsing the literature, I found three different studies giving different pictures of how looped transformers behave:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Source&lt;/th&gt;
&lt;th&gt;Scale&lt;/th&gt;
&lt;th&gt;Claim&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Saunshi et al. 2025&lt;/strong&gt; (ICLR, research paper)&lt;/td&gt;
&lt;td&gt;tens of M params, synthetic&lt;/td&gt;
&lt;td&gt;Loops work: k layers looped L times approximately matches kL-layer fixed-depth, on addition / p-hop induction / math&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Geiping et al. 2025&lt;/strong&gt; (Huginn, research paper)&lt;/td&gt;
&lt;td&gt;3.5B params, 800B tokens&lt;/td&gt;
&lt;td&gt;Task-dependent: at scale on natural-language benchmarks, gains can be marginal (T=4 → T=32 only +1.82 points on GSM8K), though effects vary by task and compute regime&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Micheal Bee 2026-04&lt;/strong&gt; (Medium, independent experiment blog)&lt;/td&gt;
&lt;td&gt;17M params, 12 GPU-hours on RTX 5070 Ti&lt;/td&gt;
&lt;td&gt;Loops plateau at T=2 in this small-scale setup: hidden state reaches a fixed-point that subsequent iterations cannot escape&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Theory, large-scale empirics, and an independent solo replication give different pictures. I wanted to add a fourth data point from my own DGX Spark on a clean, controlled task — multi-digit addition.&lt;/p&gt;

&lt;h3&gt;
  
  
  What I'd hoped to see
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Does training-time accuracy phase-transition (grok) at some step? (Saunshi 3-stage prediction)&lt;/li&gt;
&lt;li&gt;Does test-time loop count matter? At what point does it stop helping?&lt;/li&gt;
&lt;li&gt;Does the hidden state actually keep evolving across loops, or does it hit a fixed-point early? (the Bee question)&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Headline finding
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Loops help, but only within a narrow window centered on the training loop count.&lt;/strong&gt; With training-time &lt;code&gt;max_loop_iters=4&lt;/code&gt;, accuracy peaks at exactly T=4 (100% across all digit counts) and decays in &lt;em&gt;both&lt;/em&gt; directions — fewer loops underthink, more loops overthink.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bee's "T=2 fixed-point" reproduced.&lt;/strong&gt; Cosine similarity between consecutive hidden states jumps from ~0.72 to ~0.95 at T=2, then climbs slowly to ~0.99 by T=4 and stays flat through T=32.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Striking per-seed grokking variance.&lt;/strong&gt; Same hyperparameters, four seeds: seeds 1 and 3 solve 5-digit addition by step 4,000; seed 2 takes 10,000; seed 0 stalls at &amp;lt;10% until step 16,000, then jumps to 100%.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No depth extrapolation in this setup.&lt;/strong&gt; Saunshi's claim that training at T=4 should generalize to deeper T at inference does &lt;em&gt;not&lt;/em&gt; reproduce here — instead, T&amp;gt;4 hurts.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🌀 What is a "looped" transformer?
&lt;/h2&gt;

&lt;p&gt;A standard transformer (GPT-4, Llama, most local LLMs) routes input tokens through a stack of distinct layers, each used exactly once per forward pass. To make it "think deeper," you stack more layers — increasing parameter count.&lt;/p&gt;

&lt;p&gt;A looped transformer reuses &lt;strong&gt;the same&lt;/strong&gt; parameters across multiple iterations. The model has a &lt;code&gt;Prelude → Recurrent Block × T → Coda&lt;/code&gt; structure: a few standard layers up front, then one block iterated T times with input injection at every step, then a few more standard layers.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Input tokens
   ↓
[Prelude P]          — standard layers, run once
   ↓
[Recurrent Block R]  — one block looped T times
   ↑_______↓          h_{t+1} = A·h_t + B·e + Transformer(h_t, e)
   ↓
[Coda C]             — standard layers, run once
   ↓
Output logits
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At each loop iteration &lt;code&gt;t&lt;/code&gt;, the hidden state updates via the LTI injection rule, and the encoded input &lt;code&gt;e&lt;/code&gt; (Prelude output) is re-injected to keep the original signal alive across arbitrary depth. The injection parameters are constrained so that spectral radius ρ(A) &amp;lt; 1, which prevents divergence over many loops (Parcae stability framework).&lt;/p&gt;

&lt;p&gt;The key claim: &lt;strong&gt;more loops at inference = deeper reasoning, without adding parameters&lt;/strong&gt;. This is conceptually analogous to chain-of-thought scaling — except the "thinking" happens in continuous latent space rather than discrete token space.&lt;/p&gt;




&lt;h2&gt;
  
  
  🔧 Experimental setup
&lt;/h2&gt;

&lt;p&gt;I trained a deliberately tiny OpenMythos variant on multi-digit addition. The model is small enough to run 4 seeds in parallel on a single GPU but large enough to exhibit the looped-transformer phenomena.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;OpenMythos tiny (3.4M params)
  ↓
Train 4 seeds in parallel, 30k steps each, fp32 on DGX Spark (GB10)
  ↓
Experiment A: greedy autoregressive accuracy
              loops ∈ {1, 2, 4, 8, 16, 32}  ×  digits ∈ {2, 3, 4, 5}
  ↓
Experiment B: cosine similarity between consecutive hidden states
              ⇒ does the recurrent block reach a fixed-point?
  ↓
Compare against Saunshi / Huginn / Bee
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Model config
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nc"&gt;MythosConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;vocab_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;         &lt;span class="c1"&gt;# digits 0-9 + '+', '=', pad, eos
&lt;/span&gt;    &lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;n_heads&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;n_kv_heads&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;          &lt;span class="c1"&gt;# GQA
&lt;/span&gt;    &lt;span class="n"&gt;max_seq_len&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_loop_iters&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;      &lt;span class="c1"&gt;# training depth; inference varies
&lt;/span&gt;    &lt;span class="n"&gt;prelude_layers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;coda_layers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;attn_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gqa&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;n_experts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;           &lt;span class="c1"&gt;# MoE FFN inside recurrent block
&lt;/span&gt;    &lt;span class="n"&gt;n_shared_experts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;n_experts_per_tok&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;expert_dim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;lora_rank&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;           &lt;span class="c1"&gt;# depth-wise LoRA per loop step
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Total parameters: &lt;strong&gt;3,386,658&lt;/strong&gt; (~3.4M).&lt;/p&gt;

&lt;h3&gt;
  
  
  Data
&lt;/h3&gt;

&lt;p&gt;On-the-fly synthetic addition. Operands are uniformly sampled from &lt;code&gt;[10^(d-1), 10^d - 1]&lt;/code&gt; for digit count &lt;code&gt;d ∈ {2, 3, 4, 5}&lt;/code&gt;. Sequence format &lt;code&gt;"A+B=R$"&lt;/code&gt;, where &lt;code&gt;R = str(A+B)[::-1]&lt;/code&gt; (reverse-order answer, following Saunshi's convention so left-to-right autoregressive generation can carry digits naturally).&lt;/p&gt;

&lt;p&gt;Loss is applied only at positions following the &lt;code&gt;=&lt;/code&gt; token (i.e., on the answer tokens).&lt;/p&gt;

&lt;h3&gt;
  
  
  Training
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Optimizer: AdamW, betas (0.9, 0.95), wd 0.1&lt;/li&gt;
&lt;li&gt;LR: max 3e-4, warmup 2000 steps, cosine decay to 1e-5&lt;/li&gt;
&lt;li&gt;Grad clip: 1.0&lt;/li&gt;
&lt;li&gt;Batch size: 128&lt;/li&gt;
&lt;li&gt;Max steps: 30000&lt;/li&gt;
&lt;li&gt;dtype: &lt;strong&gt;fp32&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Initially I tried bf16 to use the GB10 efficiently, but OpenMythos stores RoPE frequencies as &lt;code&gt;complex64&lt;/code&gt; buffers, and &lt;code&gt;model.to(bfloat16)&lt;/code&gt; silently drops the imaginary part, breaking attention. For a 3.4M-param model on 128 GB of unified memory, fp32 is fine — the bottleneck is not memory but parallel scheduling.&lt;/p&gt;

&lt;p&gt;Four seeds {0, 1, 2, 3} run in parallel on the same GPU. Per-seed throughput drops to ~12K tok/s (vs ~50K solo), but wall-clock time for all four is approximately equivalent to one solo run.&lt;/p&gt;




&lt;h2&gt;
  
  
  📊 Results
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Experiment A: accuracy heatmap
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F62vulnao2x15psie2uv8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F62vulnao2x15psie2uv8.png" alt="accuracy heatmap of OpenMythos addition across loop counts and digit counts" width="800" height="682"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Mean fully-correct rate across 4 seeds, 500 eval samples per condition:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Inference loops&lt;/th&gt;
&lt;th&gt;d=2&lt;/th&gt;
&lt;th&gt;d=3&lt;/th&gt;
&lt;th&gt;d=4&lt;/th&gt;
&lt;th&gt;d=5&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;0.38 ± 0.12&lt;/td&gt;
&lt;td&gt;0.19 ± 0.09&lt;/td&gt;
&lt;td&gt;0.09 ± 0.07&lt;/td&gt;
&lt;td&gt;0.02 ± 0.02&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;0.53 ± 0.17&lt;/td&gt;
&lt;td&gt;0.50 ± 0.12&lt;/td&gt;
&lt;td&gt;0.16 ± 0.08&lt;/td&gt;
&lt;td&gt;0.21 ± 0.16&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;4 (train)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.00&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.00&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.00&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.00&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;0.98 ± 0.01&lt;/td&gt;
&lt;td&gt;0.98 ± 0.01&lt;/td&gt;
&lt;td&gt;0.94 ± 0.03&lt;/td&gt;
&lt;td&gt;0.86 ± 0.08&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;0.91 ± 0.04&lt;/td&gt;
&lt;td&gt;0.91 ± 0.05&lt;/td&gt;
&lt;td&gt;0.75 ± 0.10&lt;/td&gt;
&lt;td&gt;0.56 ± 0.16&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;td&gt;0.62 ± 0.12&lt;/td&gt;
&lt;td&gt;0.65 ± 0.13&lt;/td&gt;
&lt;td&gt;0.45 ± 0.13&lt;/td&gt;
&lt;td&gt;0.26 ± 0.17&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Observations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Peak is exactly at training-time loop count (T=4), 100% across all digit counts.&lt;/li&gt;
&lt;li&gt;One step of inference-time extrapolation (T=8) is near-peak but already shows degradation at d=5 (86%).&lt;/li&gt;
&lt;li&gt;Beyond T=8, accuracy collapses monotonically. At T=32, even 2-digit addition drops to 62%.&lt;/li&gt;
&lt;li&gt;Under-looping (T=1, T=2) hurts more at higher digit counts, consistent with depth being needed to chain carries.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Experiment B: fixed-point analysis
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2lcx3dpp8c8s4e4dab63.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2lcx3dpp8c8s4e4dab63.png" alt="fixed-point cosine similarity curve across loop steps" width="800" height="495"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Mean cosine similarity between consecutive hidden states &lt;code&gt;cos(h_t, h_{t-1})&lt;/code&gt; over answer positions, averaged across 4 seeds, 200 samples per digit:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;t&lt;/th&gt;
&lt;th&gt;d=2&lt;/th&gt;
&lt;th&gt;d=3&lt;/th&gt;
&lt;th&gt;d=4&lt;/th&gt;
&lt;th&gt;d=5&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;0.711&lt;/td&gt;
&lt;td&gt;0.726&lt;/td&gt;
&lt;td&gt;0.745&lt;/td&gt;
&lt;td&gt;0.744&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;0.961&lt;/td&gt;
&lt;td&gt;0.967&lt;/td&gt;
&lt;td&gt;0.957&lt;/td&gt;
&lt;td&gt;0.946&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;0.985&lt;/td&gt;
&lt;td&gt;0.986&lt;/td&gt;
&lt;td&gt;0.977&lt;/td&gt;
&lt;td&gt;0.971&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;0.993&lt;/td&gt;
&lt;td&gt;0.992&lt;/td&gt;
&lt;td&gt;0.986&lt;/td&gt;
&lt;td&gt;0.983&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;0.999&lt;/td&gt;
&lt;td&gt;0.999&lt;/td&gt;
&lt;td&gt;0.998&lt;/td&gt;
&lt;td&gt;0.996&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;0.9995&lt;/td&gt;
&lt;td&gt;0.9996&lt;/td&gt;
&lt;td&gt;0.9992&lt;/td&gt;
&lt;td&gt;0.998&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;td&gt;0.9995&lt;/td&gt;
&lt;td&gt;0.9996&lt;/td&gt;
&lt;td&gt;0.999&lt;/td&gt;
&lt;td&gt;0.998&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Bee's T=2 fixed-point claim is reproduced in spirit but not literally: cosine similarity jumps to ~0.95 at T=2 (vs. Bee's near-1.0), then asymptotes to ~0.99 by T=4 and stays flat through T=32.&lt;/p&gt;

&lt;p&gt;The difference vs. accuracy is telling: &lt;strong&gt;hidden state is effectively static (by cosine similarity) from T=4 onwards, yet accuracy collapses at T=16-32&lt;/strong&gt;. Two non-exclusive interpretations: (a) overthinking — late loops drift away from a converged solution; (b) distribution shift — training used T=4, so T&amp;gt;&amp;gt;4 is simply an out-of-distribution use of the model. Worth noting that cosine similarity ≈ 1 doesn't prove the hidden state is doing nothing — small logit-relevant deltas may still accumulate.&lt;/p&gt;

&lt;p&gt;Digit-count dependence on fixed-point timing is small (d=5 lags d=2 by ~0.01 in cosine sim). "Harder problems take more loops to converge" is &lt;em&gt;not&lt;/em&gt; observed here — they converge at the same rate but the converged state is just less accurate at higher digit counts.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bonus: training dynamics
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6z2j3kgsk8jpcdm3iq3i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6z2j3kgsk8jpcdm3iq3i.png" alt="training loss and teacher-forced accuracy curves per seed" width="800" height="274"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The most striking thing in the training curves is &lt;strong&gt;seed-dependent grokking timing&lt;/strong&gt;. Four runs of identical hyperparameters:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;seed 1: loss → 0 by step 3,000, all digits ≥88% by step 4,000&lt;/li&gt;
&lt;li&gt;seed 3: loss → 0 by step 4,000, all digits ≥87% by step 4,000&lt;/li&gt;
&lt;li&gt;seed 2: stuck at loss ~0.35 plateau until step 8,000, then collapses to 0 by step 10,000; d=4/5 jump from &amp;lt;10% to 99% in 2,000 steps&lt;/li&gt;
&lt;li&gt;seed 0: stuck at loss ~0.30 plateau until step 15,000, then collapses; d=4 groks at step 12,000-14,000, d=5 groks at step 16,000&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is textbook Saunshi-style three-stage grokking (memorization → in-distribution → systematic), with the third-stage trigger varying by a factor of &lt;strong&gt;4x in step count&lt;/strong&gt; purely on random init. The largest seed gap (seed 0 vs. seed 1) is ~12,000 steps, roughly 1 hour of wall-clock on this DGX.&lt;/p&gt;

&lt;p&gt;If you trained a single seed and stopped early, you might conclude "OpenMythos can't generalize beyond d=3" — which would be wrong. The architecture &lt;em&gt;can&lt;/em&gt; solve all 4 digit buckets; some random seeds just need much longer to find the systematic-generalization solution.&lt;/p&gt;




&lt;h2&gt;
  
  
  💡 What this means for the three perspectives
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Where my data point lands
&lt;/h3&gt;

&lt;p&gt;My single-DGX small-scale result lands somewhere between Bee and a partial refutation of Saunshi:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Bee's fixed-point at small T is reproduced.&lt;/strong&gt; Hidden state effectively stops evolving by T=4 (cosine sim ≥ 0.99) and certainly by T=8.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Saunshi's depth-extrapolation does NOT reproduce.&lt;/strong&gt; Inference at T &amp;gt; train_T does &lt;em&gt;not&lt;/em&gt; improve accuracy — it harms it. T=8 is already at 86% on d=5 (vs. 100% at T=4), and T=32 collapses to 26%. The "train at depth k, infer at depth k·L" recipe assumes the recurrent block has learned to keep refining; in my setup it has not.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Huginn's limited-gain finding is consistent at small scale.&lt;/strong&gt; Extra inference loops give negative ROI rather than diminishing positive ROI.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;New observation: seed-dependent grokking with up to 12K-step variance.&lt;/strong&gt; This is an under-emphasized variable in the public looped-transformer discourse — single-seed studies (Bee's solo replication, individual rows in Saunshi's tables) may be substantially under- or over-estimating typical behavior.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Reconciliation attempt
&lt;/h3&gt;

&lt;p&gt;Theory (Saunshi), large-scale empirics (Huginn), and independent replication (Bee) may not actually be in contradiction — they may be measuring different facets of the same phenomenon at different scales:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Saunshi&lt;/strong&gt;: shows loops &lt;em&gt;can&lt;/em&gt; work on the right kind of problem (algorithmic, depth-bounded reasoning) at the right kind of scale (small synthetic).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Huginn&lt;/strong&gt;: shows that loops trained at 3.5B / 800B token scale on natural-language data give only marginal gains on a benchmark (GSM8K) that already favors CoT.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bee&lt;/strong&gt;: shows that within a particular small-scale training recipe, the recurrent block's hidden state stops evolving very early in inference.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These three findings are compatible with a unified picture: &lt;strong&gt;loops carry compute, but only up to a depth bounded by the task's algorithmic complexity and the model's expressive capacity&lt;/strong&gt;. Beyond that depth, the hidden state stops moving meaningfully, and additional loops are computation without information.&lt;/p&gt;

&lt;h3&gt;
  
  
  What I'd watch next
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Increase loop count during training (here I used 4) and see if the inference-time scaling extends further&lt;/li&gt;
&lt;li&gt;Try ACT halting more aggressively to see how the model self-regulates loop depth per token&lt;/li&gt;
&lt;li&gt;Add task heterogeneity (mix p-hop induction or parity) to test whether the fixed-point timing varies by problem class&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🛠️ Technical details
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Reproducing this experiment
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/kyegomez/OpenMythos
&lt;span class="nb"&gt;cd &lt;/span&gt;OpenMythos
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;

&lt;span class="c"&gt;# Data, training, evaluation scripts (this Day 7 folder):&lt;/span&gt;
python scripts/train.py &lt;span class="nt"&gt;--seed&lt;/span&gt; 0 &lt;span class="nt"&gt;--max_steps&lt;/span&gt; 30000
python scripts/eval_accuracy.py &lt;span class="nt"&gt;--seeds&lt;/span&gt; 0 1 2 3
python scripts/eval_fixedpoint.py &lt;span class="nt"&gt;--seeds&lt;/span&gt; 0 1 2 3
python scripts/plot.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The training and evaluation scripts are at &lt;a href="https://github.com/SAETAG/dgx-100-experiments/tree/main/days/day07-openmythos-loop-debate/scripts" rel="noopener noreferrer"&gt;https://github.com/SAETAG/dgx-100-experiments/tree/main/days/day07-openmythos-loop-debate/scripts&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  What went wrong (and was fixed)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;bf16 broke complex RoPE buffer&lt;/strong&gt;: switched to fp32; fine at 3.4M parameters&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Initial training-time max_loop_iters too small&lt;/strong&gt;: kept at 4 per Saunshi's recipe; future experiments could vary this&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Greedy generation is slow at high loop counts&lt;/strong&gt;: each batch repeats &lt;code&gt;n_loops&lt;/code&gt; forward passes through the recurrent block; for loops=32 this is non-trivial&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Hyperparameter choices: why these
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;dim=256, expert_dim=512, 1 prelude / 1 coda layer&lt;/code&gt;: smallest config that still exhibits looping behavior; matches Saunshi's scale&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;n_experts=4&lt;/code&gt;: enough to demonstrate MoE routing without bloating params&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;lora_rank=8&lt;/code&gt;: depth-wise LoRA lets each loop iteration adapt slightly without breaking weight-sharing&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;max_seq_len=32&lt;/code&gt;: tight bound — d=5 addition fits in ~18 chars&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  References
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/kyegomez/OpenMythos" rel="noopener noreferrer"&gt;OpenMythos GitHub (Kye Gomez)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://red.anthropic.com/2026/mythos-preview/" rel="noopener noreferrer"&gt;Claude Mythos Preview (Anthropic, 2026-04-07)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.anthropic.com/glasswing" rel="noopener noreferrer"&gt;Project Glasswing&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2502.17416" rel="noopener noreferrer"&gt;Reasoning with Latent Thoughts (Saunshi et al., ICLR 2025)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2502.05171" rel="noopener noreferrer"&gt;Scaling up Test-Time Compute with Latent Reasoning (Geiping et al., Huginn)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://medium.com/@mbonsign/testing-the-openmythos-hypothesis-emergent-subspace-selectivity-in-looped-transformers-711f8ca0236c" rel="noopener noreferrer"&gt;Testing the OpenMythos Hypothesis (Micheal Bee)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2604.12946" rel="noopener noreferrer"&gt;Parcae — Scaling Laws for Stable Looped Language Models&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2604.07822" rel="noopener noreferrer"&gt;Loop, Think, &amp;amp; Generalize (Implicit Reasoning in Recurrent-Depth Transformers)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Tomorrow: Day 8
&lt;/h2&gt;

&lt;p&gt;A follow-up to Day 7, pushing looped thinking one step further into something harder…!&lt;/p&gt;

&lt;h1&gt;
  
  
  100ExperimentsWithDGX #LocalLLM
&lt;/h1&gt;

</description>
      <category>localllm</category>
      <category>ai</category>
      <category>dgxspark</category>
      <category>transformers</category>
    </item>
    <item>
      <title>[Day 6] I Had an AI Look at 25,000 iPhone Photos and It Decided My Mom and I Are the Same Person</title>
      <dc:creator>PEPPERCORN</dc:creator>
      <pubDate>Tue, 12 May 2026 18:10:12 +0000</pubDate>
      <link>https://dev.to/peppercorn_llm/day-6-i-had-an-ai-look-at-25000-iphone-photos-and-it-decided-my-mom-and-i-are-the-same-person-1epo</link>
      <guid>https://dev.to/peppercorn_llm/day-6-i-had-an-ai-look-at-25000-iphone-photos-and-it-decided-my-mom-and-i-are-the-same-person-1epo</guid>
      <description>&lt;h1&gt;
  
  
  [Day 6] I Had an AI Look at 25,000 iPhone Photos and It Decided My Mom and I Are the Same Person
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Intro
&lt;/h2&gt;

&lt;p&gt;Day 6!&lt;/p&gt;

&lt;p&gt;On Day 4, I had a local AI sort through 25,000 photos on my iPhone (&lt;a href="https://dev.to/peppercorn_llm/day-4-i-had-a-local-ai-sort-through-25000-photos-on-my-iphone-545p"&gt;Day 4 article&lt;/a&gt;). Today is the follow-up — I wanted to go one level deeper and have AI look at my &lt;strong&gt;behavioral patterns&lt;/strong&gt; over time.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Tools used: my home AI machine (DGX Spark) + a face recognition AI (&lt;a href="https://github.com/timesler/facenet-pytorch" rel="noopener noreferrer"&gt;FaceNet&lt;/a&gt;) + a summarization LLM (&lt;a href="https://qwenlm.github.io/" rel="noopener noreferrer"&gt;Qwen2.5 72B&lt;/a&gt; running on Ollama).&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Today's setup
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What I actually did
&lt;/h3&gt;

&lt;p&gt;Take 5 years of photos (25,000) and have an AI summarize my day-to-day life from them. Two phases:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Phase 1&lt;/strong&gt;: aggregate by capture date + camera model + photo category (cat, food, scenery, etc.), then ask the LLM to read it&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Phase 2&lt;/strong&gt;: add face recognition AI to answer "who is in each photo," then ask the LLM again&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  The key bit of today
&lt;/h3&gt;

&lt;p&gt;The face recognition AI treated me and my mom as &lt;strong&gt;the same person&lt;/strong&gt; — but the interesting part is that all the other misclassifications were "different people with the same expression," whereas in our case it was "&lt;strong&gt;different expressions, same person&lt;/strong&gt;" despite my mom being straight-faced and me grinning with teeth showing.&lt;/p&gt;

&lt;p&gt;The AI gets fooled by expressions, but it also seems to pick up on something &lt;strong&gt;beyond expressions&lt;/strong&gt; (bone structure? face shape?). That's today's headline.&lt;/p&gt;




&lt;h2&gt;
  
  
  🔧 How I went about it
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;25,000 photos (already categorized on Day 4)
   ↓
Phase 1: aggregate "capture date / camera model / category" only
   → ask the LLM to summarize year by year
   ↓
Phase 2: add face recognition AI to label "who is in each photo"
   → ask the LLM to summarize again
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;GPS data ("where") had been stripped during the iCloud export, so I substituted &lt;strong&gt;camera model&lt;/strong&gt; as a proxy (iPhone = daily life, Olympus TG = travel, DJI handheld = video shoots, etc.).&lt;/p&gt;

&lt;p&gt;(The tools and detailed steps are in the "Technical details" section at the end.)&lt;/p&gt;




&lt;h2&gt;
  
  
  📊 Results
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Phase 1: date / camera / category only
&lt;/h3&gt;

&lt;p&gt;I pulled "capture date + camera model + category (sorted on Day 4)" out of the 25,000 photos and turned it into &lt;strong&gt;four heatmaps&lt;/strong&gt; showing year-over-year patterns. Then handed those to the LLM.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;What's a heatmap?&lt;/strong&gt; = A table where the rows × columns are filled with color intensity based on count. Dense color = a hotspot of activity, visible at a glance.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  Photo count per year
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fygvu2bt3oc4fr0yptwhi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fygvu2bt3oc4fr0yptwhi.png" alt="Photo count per year" width="800" height="333"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;2019 was a clear outlier at 4,931 photos — 2-3x the other years.&lt;/p&gt;

&lt;h4&gt;
  
  
  Year × Category
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F72jxhygv36szmklu6we2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F72jxhygv36szmklu6we2.png" alt="Year × Category heatmap" width="800" height="367"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Cat photos exploded starting 2021 → matches the year my cat joined the household.&lt;/p&gt;

&lt;h4&gt;
  
  
  Year × Month
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv9wp7ew6mhgh8jsnj6zl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv9wp7ew6mhgh8jsnj6zl.png" alt="Year × Month activity heatmap" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;August 2019 was the single highest month at 1,082 photos.&lt;/p&gt;

&lt;h4&gt;
  
  
  Year × Camera model
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fttxshy9hvrzp4x7rp598.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fttxshy9hvrzp4x7rp598.png" alt="Year × Camera heatmap" width="800" height="333"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Olympus TG dropped off sharply in 2020 (matches the COVID period). The DJI handheld shows up starting 2025.&lt;/p&gt;

&lt;p&gt;When I handed this to the LLM and asked for a yearly summary, the output was along the lines of "this might have been a busy year" or "looks like an active year." Well, of course — the only info I gave the LLM was "when, what camera, what kind of subject." That's the ceiling for what it can say.&lt;/p&gt;

&lt;p&gt;So the next question: what happens if you add &lt;strong&gt;who is in each photo&lt;/strong&gt;? That's Phase 2.&lt;/p&gt;




&lt;h3&gt;
  
  
  Phase 2: adding "who's in the photo"
&lt;/h3&gt;

&lt;p&gt;I ran face recognition AI over the 25,000 photos, detected 21,000 faces, and grouped similar-looking ones into 209 groups (&lt;code&gt;C1, C2, …, C209&lt;/code&gt;). Plotting those over time:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;What's a "similar-face group"?&lt;/strong&gt; = a group the face recognition AI thinks contains "the same person" (technically called a "cluster"). The AI only manages them as numbered IDs, so a human still has to look at each group and label "this is person X."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  Person cluster × Year heatmap
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4ht5lmqaq6183qaq75qh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4ht5lmqaq6183qaq75qh.png" alt="Person cluster × Year" width="800" height="333"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This heatmap turned out to be interesting:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Long-spanning groups&lt;/strong&gt; (C1, C2, C3) → likely family or myself&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Short-spanning groups&lt;/strong&gt; → likely acquaintances from a specific period&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;…which gives you a working guess. When I fed this back to the LLM, the summary turned much more concrete: "C3 is a new appearance," "C2 is decreasing in frequency," etc.&lt;/p&gt;




&lt;h2&gt;
  
  
  💡 Today's biggest finding
&lt;/h2&gt;

&lt;p&gt;I went through the face clusters one by one and saw that the AI's groupings landed in a mix of "worked great," "fair enough," and "failed":&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;✅ Worked great&lt;/strong&gt;: grouped the same person across different angles and expressions (one group had all 4 photos of the same family member)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🤔 Fair enough&lt;/strong&gt;: burst shots end up grouped (multiple groups were just consecutive frames of the same moment)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;⚠️ Failure pattern A&lt;/strong&gt;: grouped different people who happened to share a similar smile (happened in several groups)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;😳 Failure pattern B&lt;/strong&gt;: grouped me and my mom despite our totally different expressions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The most striking one was &lt;strong&gt;failure pattern B&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Failure patterns A and B are misclassifications for opposite reasons
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Failure pattern A: different people, same expression
&lt;/h4&gt;

&lt;p&gt;Different people grouped together because of a &lt;strong&gt;similar smile&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9sgadvv259yhrvqq2v34.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9sgadvv259yhrvqq2v34.png" alt="Different people grouped due to similar smile (illustration)" width="800" height="352"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Three different people — but when smiles are similar, the AI calls them "same person."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;※The actual experiment used real photos. The illustrations here are AI-generated stand-ins for privacy.&lt;/em&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Failure pattern B: parent and child, different expressions
&lt;/h4&gt;

&lt;p&gt;My mom and I in the same group — despite the expression difference (I'm grinning with teeth showing, she's neutral).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb0un4gou9kejrh38y116.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb0un4gou9kejrh38y116.png" alt="Parent and child grouped despite different expressions (illustration)" width="800" height="529"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Parent and child with clearly different expressions — but the AI still says "same person."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;※The actual experiment used real photos. The illustrations here are AI-generated stand-ins for privacy.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In the groups I eyeballed, "different expressions but same person" only happened in our case. Every other misclassification was "same expression, different people."&lt;/p&gt;

&lt;p&gt;So my mom and I are a different kind of mistake. Either the AI is picking up on &lt;strong&gt;genetic facial similarity&lt;/strong&gt;, or there's some other mechanism at work (I'll touch on this in the technical details). Hard to be definitive, but a fascinating case.&lt;/p&gt;




&lt;h3&gt;
  
  
  Summary: how the AI "sees" faces
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Situation&lt;/th&gt;
&lt;th&gt;AI judgment&lt;/th&gt;
&lt;th&gt;Likely reason&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Same person, different angle &amp;amp; expression&lt;/td&gt;
&lt;td&gt;◯ to △&lt;/td&gt;
&lt;td&gt;Bone structure matches well&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Different people, same expression&lt;/td&gt;
&lt;td&gt;✕ (often grouped)&lt;/td&gt;
&lt;td&gt;Pulled in by expression noise&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Parent &amp;amp; child, different expressions&lt;/td&gt;
&lt;td&gt;✕ (sometimes grouped)&lt;/td&gt;
&lt;td&gt;Bone structure similarity outweighs expression difference&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The AI gets fooled by expressions, but seems to actually pick up on something &lt;strong&gt;beyond expressions&lt;/strong&gt; (bone structure? face shape?) — that was the most interesting observation of the day.&lt;/p&gt;




&lt;h2&gt;
  
  
  🛠️ Technical details
&lt;/h2&gt;

&lt;p&gt;:::details Tools used&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;EXIF extraction&lt;/strong&gt;: Python &lt;code&gt;pillow_heif&lt;/code&gt; + &lt;code&gt;PIL.Image.getexif()&lt;/code&gt; (HEIC-aware)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Face recognition&lt;/strong&gt;: &lt;code&gt;facenet-pytorch&lt;/code&gt; (&lt;code&gt;InceptionResnetV1&lt;/code&gt;, vggface2-pretrained)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Clustering&lt;/strong&gt;: scikit-learn &lt;code&gt;DBSCAN&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM summarization&lt;/strong&gt;: Qwen2.5 72B via Ollama&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compute&lt;/strong&gt;: DGX Spark (lots of GPU memory)&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;What's EXIF?&lt;/strong&gt; = the camera metadata embedded in each photo file (capture time, camera model, GPS, etc.).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's FaceNet?&lt;/strong&gt; = an AI that converts a face photo into a 512-dimensional vector. Same person's faces are close vectors, different people are far apart.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's DBSCAN?&lt;/strong&gt; = a classic ML clustering method that automatically groups similar items. You don't need to specify the number of groups in advance.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;:::details EXIF extraction script (parallelized, 6 seconds total)&lt;/p&gt;

&lt;p&gt;&lt;code&gt;pillow_heif&lt;/code&gt; to support HEIC, &lt;code&gt;PIL.Image.getexif()&lt;/code&gt; to read EXIF. Parallelized with &lt;code&gt;concurrent.futures.ProcessPoolExecutor&lt;/code&gt; (12 processes).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pillow_heif&lt;/span&gt;
&lt;span class="n"&gt;pillow_heif&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;register_heif_opener&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;PIL&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Image&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ExifTags&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_one&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;Image&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;img&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;exif&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;img&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getexif&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;inner&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;exif&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_ifd&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;34665&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# ExifIFD
&lt;/span&gt;        &lt;span class="c1"&gt;# DateTimeOriginal lives inside ExifIFD
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="mi"&gt;36867&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;inner&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;dt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;parse_exif_datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inner&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;36867&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="n"&gt;make&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;exif&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;271&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;exif&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;272&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;gps&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;exif&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_ifd&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;34853&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Photos with no EXIF date (screenshots, etc.) fall back to file mtime, but that's just "the day I copied the file," so I excluded those from year-level aggregation.&lt;/p&gt;

&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;:::details Tuning DBSCAN's eps&lt;/p&gt;

&lt;p&gt;Distance between embeddings is cosine distance (1 - dot product).&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;eps&lt;/th&gt;
&lt;th&gt;Clusters&lt;/th&gt;
&lt;th&gt;Largest cluster size&lt;/th&gt;
&lt;th&gt;Verdict&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0.4 (loose)&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;21,310&lt;/td&gt;
&lt;td&gt;Everyone in one group&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0.3&lt;/td&gt;
&lt;td&gt;73&lt;/td&gt;
&lt;td&gt;17,234&lt;/td&gt;
&lt;td&gt;Still big lumps&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0.25&lt;/td&gt;
&lt;td&gt;146&lt;/td&gt;
&lt;td&gt;12,905&lt;/td&gt;
&lt;td&gt;Still too big&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;0.2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;209&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;4,582&lt;/td&gt;
&lt;td&gt;◎ chosen&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0.18&lt;/td&gt;
&lt;td&gt;216&lt;/td&gt;
&lt;td&gt;3,131&lt;/td&gt;
&lt;td&gt;Too tight — single people split into multiple clusters&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.cluster&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DBSCAN&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.preprocessing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;normalize&lt;/span&gt;

&lt;span class="n"&gt;embeds_n&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;normalize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embeds&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;norm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;l2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;db&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DBSCAN&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;eps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;min_samples&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;metric&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cosine&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_jobs&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;labels&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit_predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embeds_n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;min_samples=5&lt;/code&gt; means only people who show up 5+ times across photos get clustered.&lt;/p&gt;

&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;:::details Why parent and child tend to land in the same cluster&lt;/p&gt;

&lt;p&gt;&lt;code&gt;facenet-pytorch&lt;/code&gt;'s &lt;code&gt;InceptionResnetV1&lt;/code&gt; (vggface2-pretrained) produces 512-dim embeddings that are designed to capture geometric (bone structure) features. Lighting, angle, and expression noise also leak in.&lt;/p&gt;

&lt;p&gt;Parent and child share genetic bone structure, so their embeddings can be closer than you'd get between random different people. This is a known phenomenon in face recognition research — several papers have demonstrated it.&lt;/p&gt;

&lt;p&gt;DBSCAN is density-based: if "A→B is close" and "B→C is close," then A and C end up in the same cluster even if A and C aren't directly close. If there's one photo of me looking especially like my mom that sits in between, that single bridge photo can connect us into one cluster.&lt;/p&gt;

&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;:::details Generating representative face thumbnails for manual labeling&lt;/p&gt;

&lt;p&gt;Clusters are just IDs (C0, C1, …), so a human has to look at them and label "this is person X."&lt;/p&gt;

&lt;p&gt;I wrote a script that crops the largest face from each cluster's representative photos and lays them out as a diagnostic image:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;facenet_pytorch&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;MTCNN&lt;/span&gt;

&lt;span class="n"&gt;mtcnn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;MTCNN&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;keep_all&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;min_face_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;40&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cuda&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;boxes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mtcnn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;detect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;img&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;boxes&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;biggest&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;boxes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
    &lt;span class="n"&gt;crop&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;img&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;crop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;biggest&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This image contains real faces of family and friends, so I kept it strictly local in &lt;code&gt;private-data/day06-timeline/&lt;/code&gt; (gitignored). Opened it via VS Code Remote-SSH to label by eye.&lt;/p&gt;

&lt;p&gt;:::&lt;/p&gt;




&lt;h2&gt;
  
  
  📝 Today's takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Handing the LLM only "when / what camera / what category" yields a blurry overview&lt;/li&gt;
&lt;li&gt;Adding "who is in the photo" jumps the resolution of the analysis up several notches&lt;/li&gt;
&lt;li&gt;Face recognition AI is sensitive to expression noise but does pick up something beyond expressions (bone structure / face shape)&lt;/li&gt;
&lt;li&gt;Because of that, parent-child being grouped "despite different expressions" became the one unique case in my dataset&lt;/li&gt;
&lt;li&gt;Keeping sensitive face data off the cloud is a big advantage of running this locally&lt;/li&gt;
&lt;li&gt;Processing 25,000 photos in one go is also realistic on a local setup — no API costs to worry about&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Tomorrow's preview: Day 7
&lt;/h2&gt;

&lt;p&gt;Day 7 plan: &lt;strong&gt;local AI vs cloud AI, 5-round showdown&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Going to take the tasks I usually do with local AI (photo classification, credit card analysis, code completion, etc.), run them on both sides, and build a head-to-head matrix.&lt;/p&gt;

&lt;p&gt;To be continued &amp;gt;&amp;gt;&amp;gt;&lt;/p&gt;




&lt;h1&gt;
  
  
  100ExperimentsWithDGX #LocalLLM #ImageAnalysis #FaceNet
&lt;/h1&gt;

</description>
      <category>localllm</category>
      <category>ai</category>
      <category>dgxspark</category>
      <category>facenet</category>
    </item>
    <item>
      <title>[Day 5] I Trained My Cat-LoRA on 22 vs 213 Photos and the Results Were Basically Identical</title>
      <dc:creator>PEPPERCORN</dc:creator>
      <pubDate>Mon, 11 May 2026 01:23:26 +0000</pubDate>
      <link>https://dev.to/peppercorn_llm/day-5-my-cat-lora-got-worse-with-45x-more-photos-so-i-figured-out-why-and-fixed-it-i6m</link>
      <guid>https://dev.to/peppercorn_llm/day-5-my-cat-lora-got-worse-with-45x-more-photos-so-i-figured-out-why-and-fixed-it-i6m</guid>
      <description>&lt;h1&gt;
  
  
  [Day 5] I Trained My Cat-LoRA on 22 vs 213 Photos and the Results Were Basically Identical
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Intro
&lt;/h2&gt;

&lt;p&gt;Day 5!&lt;/p&gt;

&lt;p&gt;Today was originally going to be "have AI analyze a year of my Amazon order history," but downloading the Amazon purchase history just wouldn't work no matter what I tried. So that was a bust.&lt;/p&gt;

&lt;p&gt;Pivoted.&lt;/p&gt;

&lt;p&gt;On Day 2, I trained an AI to memorize my cat from 22 photos (&lt;a href="https://dev.to/peppercorn_llm/day-2-i-trained-an-ai-on-22-photos-of-my-cat-now-it-draws-her-in-any-scene-3a92"&gt;Day 2 article&lt;/a&gt;). That thing is called a "LoRA."&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;What's a LoRA?&lt;/strong&gt; = A small add-on that teaches an AI to recognize a specific subject. Pair photos with a trigger word like &lt;code&gt;ohwx cat&lt;/code&gt;, train, and then writing &lt;code&gt;ohwx cat&lt;/code&gt; in any prompt makes the AI draw my cat.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;On Day 4, I had AI sort through 25,000 photos on my iPhone (&lt;a href="https://dev.to/peppercorn_llm/day-4-i-had-a-local-ai-sort-through-25000-photos-on-my-iphone-545p"&gt;Day 4 article&lt;/a&gt;). It found &lt;strong&gt;999 photos&lt;/strong&gt; it identified as cats.&lt;/p&gt;

&lt;p&gt;Today's experiment: &lt;strong&gt;Will using those 999 photos make my cat-LoRA stronger?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A simple expectation, really. 22 photos → 999 photos is &lt;strong&gt;45x more data&lt;/strong&gt;. Surely the LoRA gets stronger, right?&lt;/p&gt;




&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;Spoiler-free version:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Training with 999 photos made things &lt;strong&gt;worse&lt;/strong&gt;, not better&lt;/li&gt;
&lt;li&gt;After removing "other people's cats" from the dataset (down to &lt;strong&gt;213 photos&lt;/strong&gt;), I got LoRA quality matching my original 22-photo version&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;22 photos and 213 photos produced basically the same quality&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I came in thinking "more photos = stronger LoRA." Turns out &lt;strong&gt;that's not really how it works&lt;/strong&gt;, and today I learned why.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I actually did
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Trained on 999 photos → got worse (v2)
&lt;/h3&gt;

&lt;p&gt;Same base model and trigger word (&lt;code&gt;ohwx cat&lt;/code&gt;) as Day 2. Just bumped the photo count from 22 to 999. Kohya_ss training, 14 minutes. Calling this &lt;strong&gt;v2&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Generated test images and…&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fftwv75qe21hy51ishsaz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fftwv75qe21hy51ishsaz.png" alt="No LoRA / v1 / v2 comparison" width="800" height="419"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Photorealistic scene (left: no LoRA, center: v1=22 photos, right: v2=999 photos). &lt;strong&gt;v2 looks barely different from no-LoRA.&lt;/strong&gt; 45x more data, but the cat identity is gone.&lt;/p&gt;

&lt;p&gt;Creative prompts were worse:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzbkxtftcyq7hedns58r1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzbkxtftcyq7hedns58r1.png" alt="Chef v1 vs v2" width="800" height="628"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Prompt: "ohwx cat as a cute chef." v2 produced &lt;strong&gt;a human woman&lt;/strong&gt; as the chef, with the cat reduced to a tiny illustration on her apron.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvxm6ineeqblhoyid8p0p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvxm6ineeqblhoyid8p0p.png" alt="Astronaut v1 vs v2" width="800" height="628"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Prompt: "ohwx cat as an astronaut." v2 produced &lt;strong&gt;a tabby (orange-striped) cat&lt;/strong&gt; — the fur color is straight up wrong. My cat is black and white.&lt;/p&gt;

&lt;p&gt;→ &lt;strong&gt;More data made the LoRA broadly worse&lt;/strong&gt;, across both photorealistic and creative prompts.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cause: "other cats" had snuck into the dataset
&lt;/h3&gt;

&lt;p&gt;Once I thought about it, it was obvious.&lt;/p&gt;

&lt;p&gt;Day 4's classifier labels images as &lt;strong&gt;"contains a cat or not"&lt;/strong&gt; — it does NOT verify "is this MY cat." So the 999-photo "cat" folder included:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;My cat&lt;/li&gt;
&lt;li&gt;Friends' and family's cats&lt;/li&gt;
&lt;li&gt;Stray cats from around town&lt;/li&gt;
&lt;li&gt;Cats at pet stores&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All mixed together. When I trained with the label &lt;code&gt;ohwx cat = my cat&lt;/code&gt;, the model basically learned &lt;code&gt;ohwx cat ≈ generic cat-shape&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pulled out just my cat → 213 photos (v3)
&lt;/h3&gt;

&lt;p&gt;To curate, I borrowed another AI — &lt;strong&gt;CLIP&lt;/strong&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;What's CLIP?&lt;/strong&gt; = An OpenAI image-understanding model. Show it two images and it returns a similarity score.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I used the 22 confirmed-my-cat photos from Day 2 as a reference set, then asked CLIP to score how similar each of the 999 candidates was. Sorted by score, threw the thumbnails into a single HTML page, and went through visually — checking "this one's a different cat", "this has a person in it", and so on, marking exclusions as I went.&lt;/p&gt;

&lt;p&gt;Final cut: &lt;strong&gt;213 photos, all confirmed to be my cat&lt;/strong&gt;. Re-trained → &lt;strong&gt;v3&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Result:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbflgk6b8xso3b7rt6vwq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbflgk6b8xso3b7rt6vwq.png" alt="No LoRA / v1 / v2 / v3" width="800" height="314"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;v3 is &lt;strong&gt;as sharp as v1&lt;/strong&gt;. Tuxedo pattern, white chest, the works.&lt;/p&gt;

&lt;p&gt;Creative prompts came back too:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbp038vlynrt2w9x48nhr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbp038vlynrt2w9x48nhr.png" alt="Chef v1 / v2 / v3" width="800" height="419"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The human chef from v2 is gone, replaced by my cat. The astronaut and forest cat similarly snapped back (more comparisons in the collapsible section below).&lt;/p&gt;

&lt;p&gt;→ &lt;strong&gt;Cleaning the data was enough to fix everything.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Bonus: also tried natural-language captions (v4)
&lt;/h3&gt;

&lt;p&gt;One more thing I wanted to test.&lt;/p&gt;

&lt;p&gt;v1 (Day 2) and v3 (today) differ in their &lt;strong&gt;captions&lt;/strong&gt; — the text labels paired with each training photo:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;v1: hand-written natural sentences (&lt;code&gt;ohwx cat, walking on a metal kitchen counter, side profile, indoor kitchen with spice bottles...&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;v3: just the trigger word (&lt;code&gt;ohwx cat&lt;/code&gt;) repeated for every image&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;What's a caption?&lt;/strong&gt; = A short English text describing what's in each photo, paired with that photo during training.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Would adding richer captions on top of clean data push v3 further? Hand-writing 213 captions wasn't realistic, so I had &lt;strong&gt;another AI (Qwen2-VL) auto-generate them&lt;/strong&gt;. Calling this &lt;strong&gt;v4&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Result: &lt;strong&gt;v4 looked basically identical to v3.&lt;/strong&gt; Small differences here and there but nothing substantial.&lt;/p&gt;

&lt;p&gt;→ &lt;strong&gt;Caption granularity barely matters once the data is clean.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The actual question: does more data make a stronger LoRA?
&lt;/h2&gt;

&lt;p&gt;Now for the real comparison. &lt;strong&gt;v1 (22 photos)&lt;/strong&gt; vs &lt;strong&gt;v4 (213 photos)&lt;/strong&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Photos&lt;/th&gt;
&lt;th&gt;Data purity&lt;/th&gt;
&lt;th&gt;Captions&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;v1&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;22&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;My cat only&lt;/td&gt;
&lt;td&gt;Hand-written natural language&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;v4&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;213&lt;/strong&gt; (10x!)&lt;/td&gt;
&lt;td&gt;My cat only&lt;/td&gt;
&lt;td&gt;VLM natural language (same style)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The only meaningful difference is &lt;strong&gt;photo count&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Five-way comparison:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2jqdzppduw2zva026hv7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2jqdzppduw2zva026hv7.png" alt="No LoRA / v1 / v2 / v3 / v4" width="800" height="251"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Left to right: no LoRA, &lt;strong&gt;v1 (22)&lt;/strong&gt;, v2 (999, contaminated), v3 (213, trigger-only), &lt;strong&gt;v4 (213, natural captions)&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;v1 and v4 are essentially the same quality.&lt;/strong&gt; To my eye, v1 has a slightly more painterly feel on the chef prompt, but otherwise — same.&lt;/p&gt;

&lt;p&gt;Same pattern across all the other prompts:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwssthpa295likl4hflsz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwssthpa295likl4hflsz.png" alt="Chef v1 / v2 / v3 / v4" width="800" height="314"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;→ &lt;strong&gt;10x more photos. No visible improvement.&lt;/strong&gt; This was today's main finding.&lt;/p&gt;




&lt;h2&gt;
  
  
  After the fact, I looked it up. Turns out this is common knowledge.
&lt;/h2&gt;

&lt;p&gt;I found "more photos doesn't help" interesting enough to look up afterward, and:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Character LoRAs are typically trained on &lt;strong&gt;25–40 images&lt;/strong&gt;, with 40–80 as a soft cap&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;"Over 30 images shows diminishing returns; dataset quality matters more than dataset size"&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;"15–20 well-curated images beat 50 mediocre ones"&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Too many images can actually overfit and degrade the result&lt;/li&gt;
&lt;li&gt;DreamBooth (a closely related technique) was designed around &lt;strong&gt;3–5 images&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;→ It's &lt;strong&gt;established consensus in the field&lt;/strong&gt;: photo count saturates fast, and dataset purity is the real lever.&lt;/p&gt;

&lt;p&gt;Day 2's 22 photos? Turns out that was already a healthy amount.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I learned today
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Quality &amp;gt; Quantity, apparently
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;22 photos (v1) ≈ 213 photos (v4): photo count doesn't push quality much&lt;/li&gt;
&lt;li&gt;999 photos (v2): contamination made things worse&lt;/li&gt;
&lt;li&gt;213 photos (v3): cleaning brought everything back&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;"More photos = better LoRA" runs out of road fast. What actually moves the needle is &lt;strong&gt;the right photos, not more photos&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  A working playbook (so far)
&lt;/h3&gt;

&lt;p&gt;From today's experiments:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Source photos that match the goal&lt;/strong&gt; (photos of MY cat, not "any cat")&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Aim for 20–30 photos&lt;/strong&gt; — past that, diminishing returns&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Captions help, but don't sweat the wording&lt;/strong&gt; — auto-generated is fine&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;If you must use a big dataset, curate aggressively first&lt;/strong&gt; — contamination is brutal&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  💡 Tip: when you want to use a big dataset anyway
&lt;/h3&gt;

&lt;p&gt;If you're starting from a large unfiltered pile and want to keep it that way, &lt;strong&gt;pre-curation is essential&lt;/strong&gt;. The approach that worked today:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pick a small "ground truth" set (~20 confirmed examples)&lt;/li&gt;
&lt;li&gt;Use &lt;strong&gt;CLIP image similarity&lt;/strong&gt; to score the big pile against the ground truth&lt;/li&gt;
&lt;li&gt;Browse thumbnails sorted by score, eyeball-exclude the misses&lt;/li&gt;
&lt;li&gt;Train on what's left&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Details in the collapsible section below.&lt;/p&gt;




&lt;h2&gt;
  
  
  Technical details (the AI explains)
&lt;/h2&gt;

&lt;p&gt;The implementation details, walked through by Claude.&lt;/p&gt;

&lt;p&gt;:::details 1. More v2 failure examples&lt;/p&gt;

&lt;p&gt;Skipped from the main body for length, but worth seeing:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiu953mrzwnoiwb63zrig.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiu953mrzwnoiwb63zrig.png" alt="Fantasy forest v1 vs v2" width="800" height="628"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Prompt: "ohwx cat in a magical forest." v2 produced &lt;strong&gt;a black-bear-style illustration&lt;/strong&gt; — the cat identity is completely gone.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2tfqvxk2cudqz5n056kl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2tfqvxk2cudqz5n056kl.png" alt="Balcony v1 vs v2" width="800" height="628"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The one photorealistic-ish prompt where v2 sort-of held it together.&lt;/p&gt;

&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;:::details 2. Data prep and CLIP similarity ranking&lt;/p&gt;

&lt;p&gt;Day 4's &lt;code&gt;_review/cat/&lt;/code&gt; had 1,009 symlinks (503 HEIC, 505 JPG, 1 other). Resized to short-side 512px:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3 shared/utils/resize-shortside.py &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--src&lt;/span&gt; private-data/iphone-photos-classified/_review/cat &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--dst&lt;/span&gt; private-data/cat-lora-v2/images-512 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--short-side&lt;/span&gt; 512
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;1,009 → 999 after collisions (9 stem collisions where &lt;code&gt;IMG_XXXX.HEIC&lt;/code&gt; and &lt;code&gt;IMG_XXXX.JPG&lt;/code&gt; produced the same &lt;code&gt;.jpg&lt;/code&gt; name) and 1 resize failure.&lt;/p&gt;

&lt;p&gt;CLIP similarity scoring with &lt;code&gt;openai/clip-vit-base-patch32&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;CLIPModel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CLIPProcessor&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;CLIPModel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai/clip-vit-base-patch32&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cuda&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ref_feats&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;embed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;processor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ref_paths&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;     &lt;span class="c1"&gt;# 22 refs
&lt;/span&gt;&lt;span class="n"&gt;cand_feats&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;embed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;processor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cand_paths&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# 999 candidates
&lt;/span&gt;&lt;span class="n"&gt;sim&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cand_feats&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt; &lt;span class="n"&gt;ref_feats&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;                     &lt;span class="c1"&gt;# (999, 22)
&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                            &lt;span class="c1"&gt;# (999,)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Score distribution:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Score band&lt;/th&gt;
&lt;th&gt;Contents&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;≥ 0.85&lt;/td&gt;
&lt;td&gt;Almost all solo shots of my cat&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0.76 – 0.85&lt;/td&gt;
&lt;td&gt;Mostly my cat, with occasional other-cat or human contamination&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&amp;lt; 0.76&lt;/td&gt;
&lt;td&gt;Mostly other cats or photos with people&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Cut at 0.76 and reviewed everything above visually. 312 manual exclusions later: &lt;strong&gt;213 photos&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;:::details 3. Browser-based curation UI&lt;/p&gt;

&lt;p&gt;A single HTML page laying out all 999 thumbnails in score order, served via &lt;code&gt;python3 -m http.server&lt;/code&gt;. Each thumbnail has a checkbox:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight html"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;class=&lt;/span&gt;&lt;span class="s"&gt;"cell"&lt;/span&gt; &lt;span class="na"&gt;data-name=&lt;/span&gt;&lt;span class="s"&gt;"IMG_2906.jpg"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;img&lt;/span&gt; &lt;span class="na"&gt;src=&lt;/span&gt;&lt;span class="s"&gt;"thumbs-256/IMG_2906.jpg"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;class=&lt;/span&gt;&lt;span class="s"&gt;"meta"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;#1 0.871&lt;span class="nt"&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;input&lt;/span&gt; &lt;span class="na"&gt;type=&lt;/span&gt;&lt;span class="s"&gt;"checkbox"&lt;/span&gt; &lt;span class="na"&gt;onchange=&lt;/span&gt;&lt;span class="s"&gt;"toggleExclude(this)"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;script&amp;gt;&lt;/span&gt;
&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;exportExcluded&lt;/span&gt;&lt;span class="p"&gt;(){&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;names&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[...&lt;/span&gt;&lt;span class="nb"&gt;document&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;querySelectorAll&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;.cell.excluded&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;c&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nf"&gt;download&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;excluded.txt&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;names&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/script&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Click "Export excluded list" to download &lt;code&gt;excluded.txt&lt;/code&gt;, then use that to filter the training dir.&lt;/p&gt;

&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;:::details 4. Training configs (Kohya_ss / TOML)&lt;/p&gt;

&lt;p&gt;The training config is identical across v1/v2/v3/v4 — only the dataset and output name change:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="py"&gt;output_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"ohwx_cat_v3"&lt;/span&gt;   &lt;span class="c"&gt;# or v4&lt;/span&gt;
&lt;span class="py"&gt;max_train_epochs&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
&lt;span class="py"&gt;network_dim&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;32&lt;/span&gt;
&lt;span class="py"&gt;network_alpha&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;
&lt;span class="py"&gt;unet_lr&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1e-4&lt;/span&gt;
&lt;span class="py"&gt;text_encoder_lr&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;5e-5&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Step count is also matched:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Math&lt;/th&gt;
&lt;th&gt;Steps&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;v1&lt;/td&gt;
&lt;td&gt;22 × 10 × 10 ÷ 2&lt;/td&gt;
&lt;td&gt;1,100&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;v2&lt;/td&gt;
&lt;td&gt;999 × 1 × 2 ÷ 2&lt;/td&gt;
&lt;td&gt;999&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;v3 / v4&lt;/td&gt;
&lt;td&gt;213 × 5 × 2 ÷ 2&lt;/td&gt;
&lt;td&gt;1,065&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;All within ~1,000 steps, so the only variables in play are photo count and caption granularity.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; ~/Kohya_ss &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;source &lt;/span&gt;venv/bin/activate
accelerate launch &lt;span class="nt"&gt;--num_cpu_threads_per_process&lt;/span&gt; 8 train_network.py &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--config_file&lt;/span&gt; configs/train_v3.toml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--dataset_config&lt;/span&gt; configs/dataset_v3.toml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;DGX Spark, 1.4 it/s, ~14 minutes per training run.&lt;/p&gt;

&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;:::details 5. Qwen2-VL caption auto-generation&lt;/p&gt;

&lt;p&gt;Reusing Day 4's Qwen2-VL 7B Instruct setup. The prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Describe what is happening in this cat photo using short comma-separated
phrases. Cover: (1) the cat's pose or action, (2) the view angle,
(3) the setting and notable background details. Keep it under 25 words.
Do NOT describe the cat's appearance (color, breed, fur, markings) — focus
only on the scene. Output the description directly without any preamble.
Example: walking on a metal kitchen counter, side profile, indoor kitchen
with spice bottles and shelves in the background
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The "do not describe the cat's appearance" line is intentional: identity is supposed to come from the trigger word &lt;code&gt;ohwx cat&lt;/code&gt;, so captions should only describe context.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;desc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;vlm_caption&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;processor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;img&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;caption&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ohwx cat, &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;desc&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;txt_path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;caption&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;213 captions in 6 minutes. Sample output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ohwx cat, sitting, side view, indoor setting, wooden floor,
folding chair, curtain, air conditioner
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Stylistically very close to Day 2's hand-written captions.&lt;/p&gt;

&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;:::details 6. Version summary&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;v1 (Day 2)&lt;/th&gt;
&lt;th&gt;v2&lt;/th&gt;
&lt;th&gt;v3&lt;/th&gt;
&lt;th&gt;v4&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Photos&lt;/td&gt;
&lt;td&gt;22&lt;/td&gt;
&lt;td&gt;999&lt;/td&gt;
&lt;td&gt;213&lt;/td&gt;
&lt;td&gt;213&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cat content&lt;/td&gt;
&lt;td&gt;My cat only&lt;/td&gt;
&lt;td&gt;My cat + many others&lt;/td&gt;
&lt;td&gt;My cat only&lt;/td&gt;
&lt;td&gt;My cat only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Captions&lt;/td&gt;
&lt;td&gt;Hand-written natural&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;ohwx cat&lt;/code&gt; only&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;ohwx cat&lt;/code&gt; only&lt;/td&gt;
&lt;td&gt;VLM natural&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total steps&lt;/td&gt;
&lt;td&gt;1,100&lt;/td&gt;
&lt;td&gt;999&lt;/td&gt;
&lt;td&gt;1,065&lt;/td&gt;
&lt;td&gt;1,065&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Training time&lt;/td&gt;
&lt;td&gt;13m 3s&lt;/td&gt;
&lt;td&gt;14m 0s&lt;/td&gt;
&lt;td&gt;14m 0s&lt;/td&gt;
&lt;td&gt;14m 0s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;What each pair isolates:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;v2 vs v3&lt;/strong&gt; → effect of data purity (same captions, only purity differs)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;v3 vs v4&lt;/strong&gt; → effect of caption granularity (same data, only captions differ)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;v1 vs v4&lt;/strong&gt; → effect of photo count (clean data, natural captions, only count differs)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;:::details 7. References on LoRA training dataset size&lt;/p&gt;

&lt;p&gt;The "diminishing returns past ~30 photos" claim has multiple sources:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;20–30 photos saturates; dataset quality &amp;gt; dataset size&lt;/strong&gt; (&lt;a href="https://civitai.com/articles/699/large-dataset-lora-tips-and-tricks-google-colab-sd-15-optimized" rel="noopener noreferrer"&gt;Civitai: Large Dataset LoRA Tips&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;15–20 well-curated images beat 50 mediocre ones&lt;/strong&gt; (same)&lt;/li&gt;
&lt;li&gt;Over-training and "overcooked" LoRAs from too much data (&lt;a href="https://huggingface.co/blog/FPHam/lora-secrets-1" rel="noopener noreferrer"&gt;Hugging Face Blog: After 500+ LoRAs&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;DreamBooth (the original subject-finetuning technique) was designed around 3–5 images (&lt;a href="https://dreambooth.github.io/" rel="noopener noreferrer"&gt;DreamBooth project page&lt;/a&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;:::&lt;/p&gt;




&lt;h2&gt;
  
  
  Tomorrow's preview: Day 6
&lt;/h2&gt;

&lt;p&gt;Day 6: still undecided. Decision tomorrow morning.&lt;/p&gt;




&lt;h1&gt;
  
  
  100ExperimentsWithDGX #LocalLLM #LoRA #StableDiffusion
&lt;/h1&gt;

</description>
      <category>localllm</category>
      <category>ai</category>
      <category>dgxspark</category>
      <category>lora</category>
    </item>
    <item>
      <title>[Day 4] I Had a Local AI Sort Through 25,000 Photos on My iPhone</title>
      <dc:creator>PEPPERCORN</dc:creator>
      <pubDate>Thu, 07 May 2026 23:58:45 +0000</pubDate>
      <link>https://dev.to/peppercorn_llm/day-4-i-had-a-local-ai-sort-through-25000-photos-on-my-iphone-545p</link>
      <guid>https://dev.to/peppercorn_llm/day-4-i-had-a-local-ai-sort-through-25000-photos-on-my-iphone-545p</guid>
      <description>&lt;h1&gt;
  
  
  [Day 4] I Had a Local AI Sort Through 25,000 Photos on My iPhone
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Intro
&lt;/h2&gt;

&lt;p&gt;Day 4: I'm going to hand the 25,000 photos sitting on my iPhone over to a local AI for sorting.&lt;/p&gt;

&lt;p&gt;This is experiment #4.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What I'm using today: DGX Spark + &lt;a href="https://github.com/openai/CLIP" rel="noopener noreferrer"&gt;CLIP&lt;/a&gt; (image-understanding AI from OpenAI) + &lt;a href="https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct" rel="noopener noreferrer"&gt;Qwen2-VL&lt;/a&gt; (a vision-language model that can chat about images, from Alibaba).&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Today's setup
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data&lt;/strong&gt;: 25,382 photos and videos sitting on my iPhone (96 GB).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Goal&lt;/strong&gt;: Have AI find unnecessary photos so I can drop my phone storage subscription.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Approach&lt;/strong&gt;:

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Stage 1&lt;/strong&gt;: Quickly classify all 25K with CLIP.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stage 2&lt;/strong&gt;: Have Qwen2-VL (a VLM) grade CLIP's classifications.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Comparison axis&lt;/strong&gt;: Lightweight + fast classifier (CLIP) vs. heavyweight + smart conversational AI (VLM).&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Bottom line&lt;/strong&gt;: Overall agreement of &lt;strong&gt;84.5%&lt;/strong&gt; when the VLM grades CLIP's classifications. &lt;strong&gt;People detection: 99.2%&lt;/strong&gt; — only 59 misses out of 7,195 photos. Documents and screenshots ended up wrong about half the time. Oh, and I gave up midway and just dumped everything into Amazon Photos because I'd just learned Prime members get unlimited photo storage. Five years a Prime member, never knew.&lt;/p&gt;




&lt;h2&gt;
  
  
  🔧 Steps
&lt;/h2&gt;

&lt;p&gt;Big picture flow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;iPhone
   ↓ ① Sync via iCloud for Windows
myPC1 (Windows)
   ↓ ② scp transfer to DGX (96 GB)
DGX (Linux)
   ├─ ③ Split photos and videos by extension
   │     └ Photos 24,497 / Videos 884
   ├─ ④ Classify with CLIP (~20 min)
   │     └ Sorted into 8 categories
   └─ ⑤ Have VLM grade "is this category right?" (~3 hours)
         └ Overall agreement: 84.5%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's walk through each step.&lt;/p&gt;

&lt;h3&gt;
  
  
  Getting photos onto the DGX (the biggest hurdle)
&lt;/h3&gt;

&lt;p&gt;iPhone → myPC1 (a Windows laptop I use day-to-day) → DGX, a two-leg relay.&lt;/p&gt;

&lt;p&gt;The first leg started at &lt;strong&gt;0.5 MB/s&lt;/strong&gt;, with the ETA showing "6 days." After realizing my Wi-Fi was the bottleneck, I switched to wired LAN, fixed the hostname-resolution path, and got it up to &lt;strong&gt;80 MB/s (~160x faster)&lt;/strong&gt;. Burned half a day. More technical details in the collapsible section below.&lt;/p&gt;

&lt;h3&gt;
  
  
  Splitting photos and videos
&lt;/h3&gt;

&lt;p&gt;The 25,382 transferred files broke down like this:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Extension&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;HEIC&lt;/td&gt;
&lt;td&gt;13,107&lt;/td&gt;
&lt;td&gt;Photo (Apple's format)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;JPG / JPEG&lt;/td&gt;
&lt;td&gt;10,721&lt;/td&gt;
&lt;td&gt;Photo&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PNG&lt;/td&gt;
&lt;td&gt;660&lt;/td&gt;
&lt;td&gt;Photo (mostly screenshots)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;WEBP&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;Photo&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MOV&lt;/td&gt;
&lt;td&gt;799&lt;/td&gt;
&lt;td&gt;Video&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MP4&lt;/td&gt;
&lt;td&gt;85&lt;/td&gt;
&lt;td&gt;Video&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ini&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;System file (ignored)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;I had Claude write a small script that splits photos and videos into separate folders by extension (one command, takes a few minutes — details in the collapsible section).&lt;/p&gt;

&lt;p&gt;Result:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Photos: &lt;strong&gt;24,497&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Videos: 884&lt;/li&gt;
&lt;li&gt;Photos are the focus from here.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What is CLIP?
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;CLIP&lt;/strong&gt; ＝ an image-understanding AI from OpenAI, apparently. You hand it a photo and ask "is this a cat? a landscape? a screenshot?" with multiple labels, and it returns a &lt;strong&gt;similarity score&lt;/strong&gt; for each. Lightweight and fast is its specialty, supposedly.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Stage 1: Classifying all 25K photos with CLIP
&lt;/h3&gt;

&lt;p&gt;I set up 8 categories:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Trash candidates&lt;/strong&gt;: screenshot / document / blank&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Keep&lt;/strong&gt;: food / landscape / other&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Review&lt;/strong&gt;: people / cat&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For each category, I prepared multiple English captions (e.g., "a screenshot of an app", "a photo of a cat") and used the maximum similarity. Also: &lt;strong&gt;anything below 0.5 confidence goes into the uncertain bucket&lt;/strong&gt; for manual review.&lt;/p&gt;

&lt;p&gt;Batch size 64, ~20 minutes of GPU time, all done. Results in the next section!&lt;/p&gt;

&lt;h3&gt;
  
  
  The "How accurate is it?" question
&lt;/h3&gt;

&lt;p&gt;CLIP did the classification, but &lt;strong&gt;how accurate is it really?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Normally you'd verify by manual inspection, but &lt;strong&gt;eyeballing 25,000 photos is not realistic&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;So I decided to have &lt;strong&gt;a smarter AI grade CLIP's classifications&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is a VLM?
&lt;/h3&gt;

&lt;p&gt;A &lt;strong&gt;VLM (Vision-Language Model)&lt;/strong&gt; is an AI that can hold a conversation about images, apparently.&lt;/p&gt;

&lt;p&gt;How it differs from CLIP:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;CLIP&lt;/th&gt;
&lt;th&gt;VLM&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;What it does&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Category classification (returns probabilities)&lt;/td&gt;
&lt;td&gt;Can &lt;strong&gt;describe&lt;/strong&gt; image content in natural language&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Smartness&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Lightweight, fast, coarse&lt;/td&gt;
&lt;td&gt;Heavy, slow, smart&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Size&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~400 MB&lt;/td&gt;
&lt;td&gt;~16 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;I picked &lt;strong&gt;Qwen2-VL 7B Instruct&lt;/strong&gt; (Alibaba). Apache 2.0 licensed for commercial use, no Hugging Face authentication required for download — those were the selection criteria.&lt;/p&gt;

&lt;p&gt;The plan: ask the VLM "is this a screenshot? answer yes or no" for each image and record the answer.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stage 2: Grading all 25K photos with VLM
&lt;/h3&gt;

&lt;p&gt;Started at &lt;strong&gt;16 seconds per image&lt;/strong&gt; (~5 days for the full set). The cause was image size — resizing to 448px on the short side dropped it to &lt;strong&gt;0.3 sec/image (~54x faster)&lt;/strong&gt;. Even with one-image-at-a-time inference, the full set takes ~2-3 hours.&lt;/p&gt;

&lt;p&gt;Started before bed, woke up to 24,496 graded results.&lt;/p&gt;




&lt;h2&gt;
  
  
  📊 Results
&lt;/h2&gt;

&lt;h3&gt;
  
  
  CLIP's classification results
&lt;/h3&gt;

&lt;p&gt;After CLIP processed 24,496 photos, the distribution looked like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;private-data/iphone-photos-classified/
├── _trash-candidate/      Trash candidates
│   ├── screenshots/    (981)
│   ├── documents/    (1,804)
│   └── blank/           (59)
├── _review/                Review
│   ├── people/       (7,195)
│   ├── cat/          (1,009)
│   └── uncertain/    (7,700)
└── _keep/                  Keep
    ├── food/         (1,682)
    ├── landscape/    (1,991)
    └── other/        (2,075)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;th&gt;Share&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;people&lt;/td&gt;
&lt;td&gt;7,195&lt;/td&gt;
&lt;td&gt;29.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;uncertain (low confidence)&lt;/td&gt;
&lt;td&gt;7,700&lt;/td&gt;
&lt;td&gt;31.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;other&lt;/td&gt;
&lt;td&gt;2,075&lt;/td&gt;
&lt;td&gt;8.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;landscape&lt;/td&gt;
&lt;td&gt;1,991&lt;/td&gt;
&lt;td&gt;8.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;document&lt;/td&gt;
&lt;td&gt;1,804&lt;/td&gt;
&lt;td&gt;7.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;food&lt;/td&gt;
&lt;td&gt;1,682&lt;/td&gt;
&lt;td&gt;6.9%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;cat&lt;/td&gt;
&lt;td&gt;1,009&lt;/td&gt;
&lt;td&gt;4.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;screenshot&lt;/td&gt;
&lt;td&gt;981&lt;/td&gt;
&lt;td&gt;4.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;blank&lt;/td&gt;
&lt;td&gt;59&lt;/td&gt;
&lt;td&gt;0.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That's a lot of cat photos...&lt;/p&gt;

&lt;p&gt;Let's see how CLIP actually judged some of these.&lt;/p&gt;

&lt;h4&gt;
  
  
  🎯 Big wins
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmanel40irqlut2qjmg29.jpg" alt="cat" width="800" height="1067"&gt;&lt;/th&gt;
&lt;th&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb8wj5qdxday12val3ffc.jpg" alt="food" width="800" height="600"&gt;&lt;/th&gt;
&lt;th&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flhhxkvszzn2qau4moc4w.jpg" alt="screenshot" width="800" height="1067"&gt;&lt;/th&gt;
&lt;th&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh02jazjunc37kcau34z9.jpg" alt="landscape" width="800" height="600"&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;My cat&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;A meal&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;App screenshot&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Mountain (landscape)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;cat &lt;strong&gt;0.97&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;food &lt;strong&gt;0.999&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;screenshot &lt;strong&gt;0.74&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;landscape &lt;strong&gt;0.98&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;CLIP nailed the cat without hesitation, food at 0.999, screenshots and landscapes too. Reliable.&lt;/p&gt;

&lt;h4&gt;
  
  
  ✨ Subtly impressive recognition
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwv4k8lfjqi4w8q5dpif7.jpg" alt="keychain" width="800" height="1422"&gt;&lt;/th&gt;
&lt;th&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6iyz231ir399ggft9tsc.jpg" alt="coffee" width="800" height="600"&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cat keychain&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Close-up of coffee beans&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;cat &lt;strong&gt;0.64&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;food &lt;strong&gt;0.53&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Even the keychain got recognized as "cat." And coffee beans up close as "food." Quietly impressive.&lt;/p&gt;

&lt;h4&gt;
  
  
  🤔 Funny misclassifications (CLIP's quirks)
&lt;/h4&gt;

&lt;p&gt;Browsing thumbnails by category, some interesting patterns emerged.&lt;/p&gt;

&lt;h5&gt;
  
  
  Food edition: "Trash sorting chart" beats "homemade cake" for being food-like
&lt;/h5&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs2n0aw55oh1vtxogal5g.jpg" alt="cake" width="800" height="1067"&gt;&lt;/th&gt;
&lt;th&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fddkhzur3sde5y4nhbd57.jpg" alt="trash chart" width="800" height="600"&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;My homemade cake&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Trash sorting chart&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;food &lt;strong&gt;0.57&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;food &lt;strong&gt;0.83&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Both ended up in the "food" category. Apparently &lt;strong&gt;the trash sorting chart looks more food-like to CLIP than my homemade cake&lt;/strong&gt;. Reacting to the text? The table layout? Mystery.&lt;/p&gt;

&lt;h5&gt;
  
  
  People edition: "A doodle" beats "Mona Lisa" for being people-like
&lt;/h5&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhqgqjays89q846vhw966.jpg" alt="Mona Lisa" width="800" height="1067"&gt;&lt;/th&gt;
&lt;th&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffhwg26apqs72z6ni99k9.jpg" alt="doodle" width="800" height="1067"&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;The Mona Lisa&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;A face I doodled myself&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;people &lt;strong&gt;0.50&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;people &lt;strong&gt;0.52&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Both in the "people" category. &lt;strong&gt;My crappy doodle edges out da Vinci's Mona Lisa for being more "people-like"&lt;/strong&gt; (just barely).&lt;/p&gt;

&lt;p&gt;CLIP's quirks — kind of charming.&lt;/p&gt;




&lt;h3&gt;
  
  
  VLM's grading results
&lt;/h3&gt;

&lt;p&gt;I asked the VLM, one photo at a time, whether CLIP's category was correct. For example, photos in the cat folder got "is this a cat?", food folder got "is this food?" — yes/no answers.&lt;/p&gt;

&lt;p&gt;Summary by final destination bucket:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Final bucket&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;th&gt;VLM agreement&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;people&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;7,195&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;99.2%&lt;/strong&gt; 🎯&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;food&lt;/td&gt;
&lt;td&gt;1,682&lt;/td&gt;
&lt;td&gt;95.3% 🎯&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;cat&lt;/td&gt;
&lt;td&gt;1,009&lt;/td&gt;
&lt;td&gt;95.0% 🎯&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;other&lt;/td&gt;
&lt;td&gt;2,075&lt;/td&gt;
&lt;td&gt;93.6% 🎯&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;landscape&lt;/td&gt;
&lt;td&gt;1,991&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;83.5%&lt;/strong&gt; ⚠️&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;screenshot&lt;/td&gt;
&lt;td&gt;981&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;75.2%&lt;/strong&gt; ⚠️&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;document&lt;/td&gt;
&lt;td&gt;1,804&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;67.4%&lt;/strong&gt; ⚠️&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;blank&lt;/td&gt;
&lt;td&gt;59&lt;/td&gt;
&lt;td&gt;52.5% ⚠️&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OVERALL&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;24,496&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;84.5%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;People detection at &lt;strong&gt;99.2%&lt;/strong&gt; is quietly amazing. Out of 7,195 photos, the VLM said "no" to only &lt;strong&gt;59&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Documents and screenshots, on the other hand, came back "no" about half the time. CLIP-only confidence isn't enough for those. Out of 24,496 photos, &lt;strong&gt;3,808&lt;/strong&gt; got a "no" from the VLM — that's the part CLIP alone wouldn't have caught.&lt;/p&gt;




&lt;h2&gt;
  
  
  💡 Today's discoveries
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Multimodal AI runs at home
&lt;/h3&gt;

&lt;p&gt;Both CLIP (400 MB, classifier) and Qwen2-VL (16 GB, conversational) ran fine on my home machine. Reassuring.&lt;/p&gt;

&lt;h3&gt;
  
  
  CLIP's confidence is a reliable signal
&lt;/h3&gt;

&lt;p&gt;VLM agreement broken down by CLIP confidence:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;CLIP confidence&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;th&gt;VLM agreement&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0.9+ (super confident)&lt;/td&gt;
&lt;td&gt;3,555&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;96.5%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0.7–0.9&lt;/td&gt;
&lt;td&gt;6,285&lt;/td&gt;
&lt;td&gt;93.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0.5–0.7&lt;/td&gt;
&lt;td&gt;6,956&lt;/td&gt;
&lt;td&gt;86.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&amp;lt;0.5 (uncertain)&lt;/td&gt;
&lt;td&gt;7,700&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;70.1%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Boring but important: &lt;strong&gt;when an AI says it's confident, you can trust it&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  CLIP's weak spots
&lt;/h3&gt;

&lt;p&gt;Things that &lt;strong&gt;clearly appear in photos&lt;/strong&gt; — people, food, cats, objects — score 95%+. Abstract or compound subjects — documents, screenshots, landscapes — drop to 60-80%.&lt;/p&gt;

&lt;p&gt;Documents at 67.4% in particular. That's where VLM re-grading earns its keep.&lt;/p&gt;

&lt;h3&gt;
  
  
  Role split: lightweight model × smart model
&lt;/h3&gt;

&lt;p&gt;Use CLIP to triage everything quickly, VLM to grade the suspicious cases — a two-layer setup. &lt;strong&gt;Best of both worlds in speed and accuracy&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Day 3 had the same pattern: &lt;strong&gt;"aggregation = tools, interpretation = AI."&lt;/strong&gt; Today's variant: &lt;strong&gt;"rough sorting = CLIP, accuracy check = VLM."&lt;/strong&gt; Picking the right AI for the right task pays off in both performance and cost.&lt;/p&gt;

&lt;h3&gt;
  
  
  "Input quality matters more than model size" struck again
&lt;/h3&gt;

&lt;p&gt;In Day 3 (credit card analysis), I learned &lt;strong&gt;"input quality &amp;gt; model size."&lt;/strong&gt; The same pattern showed up today:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;VLM with &lt;strong&gt;original-resolution images&lt;/strong&gt;: 16 sec/image (5 days for full run)&lt;/li&gt;
&lt;li&gt;VLM with &lt;strong&gt;resized 448px images&lt;/strong&gt;: 0.3 sec/image (2 hours)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Just by tidying up the input, &lt;strong&gt;54x speedup&lt;/strong&gt; — small change, huge impact.&lt;/p&gt;

&lt;p&gt;Not "biggest model possible" or "raw original" — &lt;strong&gt;clean up the input before sending it to the AI&lt;/strong&gt;. This worked in Day 3 and Day 4 in a row.&lt;/p&gt;

&lt;h3&gt;
  
  
  Heart broken, switched to Amazon Photos
&lt;/h3&gt;

&lt;p&gt;I tried to verify the trash candidate folder, then realized I'd need to cross-reference VLM scores too, then realized &lt;strong&gt;I never set clear criteria for "what to delete" in the first place&lt;/strong&gt;. Couldn't finalize the cleanup, and morale broke.&lt;/p&gt;

&lt;p&gt;Right then I learned that &lt;strong&gt;Amazon Prime members get unlimited photo storage&lt;/strong&gt;, so I just dumped everything into Amazon Photos. Lol.&lt;/p&gt;

&lt;p&gt;That said, I really should have &lt;strong&gt;defined the deletion criteria&lt;/strong&gt; before starting.&lt;/p&gt;

&lt;p&gt;The classified data on the DGX is a useful resource for future Day experiments.&lt;/p&gt;




&lt;h2&gt;
  
  
  🛠️ How I actually did this
&lt;/h2&gt;

&lt;p&gt;:::details Wi-Fi 0.5 MB/s → wired LAN 80 MB/s journey&lt;/p&gt;

&lt;p&gt;myPC1 → DGX over 96 GB started at &lt;strong&gt;236 KB/s&lt;/strong&gt; via WinSCP (ETA: 6 days). The cause was myPC1 being on Wi-Fi.&lt;/p&gt;

&lt;p&gt;I plugged the PC into the router with a LAN cable → ping dropped close to 0 ms. But WinSCP was still stuck at &lt;strong&gt;500 KB/s&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;PowerShell &lt;code&gt;ping spark-XXXX.local&lt;/code&gt; revealed the address resolved to &lt;strong&gt;DGX's Wi-Fi-side IP&lt;/strong&gt;. The DGX was dual-homed (wired + Wi-Fi), and mDNS was returning the old route.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Failure (routes through Wi-Fi)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;scp&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-r&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"C:\Users\[user]\Pictures\iCloud Photos\Photos"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="err"&gt;@&lt;/span&gt;&lt;span class="nx"&gt;spark-XXXX.local:...&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c"&gt;# Success (direct IP over wired LAN)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;scp&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-r&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"C:\Users\[user]\Pictures\iCloud Photos\Photos"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="err"&gt;@&lt;/span&gt;&lt;span class="nx"&gt;10.0.0.205:...&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Switched from hostname to &lt;strong&gt;explicit IP&lt;/strong&gt; and watched it scream:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;IMG_0190.HEIC                100% 1812KB  84.3MB/s   00:00
IMG_0190.MOV                 100%   17MB 102.4MB/s   00:00
IMG_0192.HEIC                100% 2256KB  81.6MB/s   00:00
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Also discovered WinSCP (SFTP-based) struggles with many small files, while &lt;strong&gt;scp (stream transfer) is much faster&lt;/strong&gt;. With 25,382 files, scp won by a landslide.&lt;/p&gt;

&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;:::details Splitting photos and videos by extension&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;PHOTO_EXTS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.jpg&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.jpeg&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.heic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.heif&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.png&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.webp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;VIDEO_EXTS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.mov&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.mp4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.m4v&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;src&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;input_dir&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rglob&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;is_file&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="k"&gt;continue&lt;/span&gt;
    &lt;span class="n"&gt;ext&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;suffix&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;ext&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;PHOTO_EXTS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;shutil&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;move&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;photos_out&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;ext&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;VIDEO_EXTS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;shutil&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;move&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;videos_out&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Simple. Caught one snag: right after transfer, the directory permission was &lt;code&gt;dr-x------&lt;/code&gt; (read-only), so the first &lt;code&gt;shutil.move&lt;/code&gt; died with &lt;code&gt;PermissionError&lt;/code&gt;. &lt;code&gt;chmod u+w&lt;/code&gt; fixed it.&lt;/p&gt;

&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;:::details CLIP classification script&lt;/p&gt;

&lt;p&gt;Used &lt;code&gt;transformers&lt;/code&gt; to load &lt;code&gt;openai/clip-vit-base-patch32&lt;/code&gt;. For each category, multiple captions are prepared, and the max softmax score is used:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;CATEGORIES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;screenshot&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a screenshot of an app&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a phone screenshot&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a screenshot of a website or chat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;document&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a photo of a document or paper&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a photo of a receipt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a QR code or barcode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a photo of an ID card or driver&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s license&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;people&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a photo of a person&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a photo of people&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a portrait of someone&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a photo of a cat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;food&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a photo of food or a meal&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;landscape&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a photo of a landscape or scenery&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a photo of a building or city&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;other&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a photo of an object or item&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;inputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;processor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;text_prompts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;images&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;images&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                   &lt;span class="n"&gt;return_tensors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;padding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;no_grad&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;outputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;probs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;logits_per_image&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;softmax&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Anything below 0.5 confidence goes into &lt;code&gt;_review/uncertain/&lt;/code&gt;. Near-black/near-white images get caught by a brightness check and routed to &lt;code&gt;_trash-candidate/blank/&lt;/code&gt; before they reach CLIP.&lt;/p&gt;

&lt;p&gt;All per-image category scores are also saved to JSON. That JSON is what the VLM evaluation step consumes later.&lt;/p&gt;

&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;:::details The 54x speedup from image resizing for VLM&lt;/p&gt;

&lt;p&gt;Qwen2-VL's vision token count scales with input resolution. Original-size images (several thousand pixels) consume hundreds to thousands of tokens, slowing inference dramatically.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;processor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoProcessor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Qwen/Qwen2-VL-7B-Instruct&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;min_pixels&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;224&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;224&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_pixels&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;448&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;448&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# ← cap here
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Belt and suspenders — also pre-resize the image
&lt;/span&gt;&lt;span class="n"&gt;img&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Image&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;convert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;RGB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;img&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ImageOps&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exif_transpose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;img&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;img&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;thumbnail&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="mi"&gt;448&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;448&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That took 16 sec/image → &lt;strong&gt;0.3 sec/image&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The verification prompt is dead simple:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;CATEGORY_PROMPTS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;screenshot&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Is this image a screenshot of a phone screen, an app, or a website? Answer with one word: yes or no.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;document&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Is this image primarily a document, receipt, ID card, or QR code? Answer with one word: yes or no.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;people&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;     &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Does this image clearly show one or more human persons? Answer with one word: yes or no.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="c1"&gt;# ...
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;max_new_tokens=5&lt;/code&gt; means only yes/no comes back. Minimal design.&lt;/p&gt;

&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;:::details Resumable checkpointing&lt;/p&gt;

&lt;p&gt;Running 24,000 images for 3 hours straight, you really want recovery if something hiccups:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;CHECKPOINT_INTERVAL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;todo&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# ... inference ...
&lt;/span&gt;    &lt;span class="nf"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;CHECKPOINT_INTERVAL&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;save_checkpoint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And a &lt;code&gt;--resume&lt;/code&gt; flag that picks up where the JSON left off:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;resume&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;is_file&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Resumed from &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; existing entries&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;todo&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;clip_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Essential for any overnight job.&lt;/p&gt;

&lt;p&gt;:::&lt;/p&gt;




&lt;h2&gt;
  
  
  Next up: Day 5
&lt;/h2&gt;

&lt;p&gt;Tomorrow: &lt;strong&gt;have an AI analyze a year of my Amazon purchase history&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Switching to Amazon Photos for storage made me realize Amazon also has my entire purchase history. &lt;strong&gt;What if I asked AI "what kind of person am I, based on this?"&lt;/strong&gt; — see what patterns emerge that I never noticed myself.&lt;/p&gt;

&lt;p&gt;To be continued ＞＞＞&lt;/p&gt;




&lt;h1&gt;
  
  
  100ExperimentsWithDGX #LocalLLM #ImageClassification #CLIP
&lt;/h1&gt;

</description>
      <category>localllm</category>
      <category>ai</category>
      <category>dgxspark</category>
      <category>clip</category>
    </item>
    <item>
      <title>[Day 3] I Had a Local LLM Analyze a Year of My Credit Card Statements</title>
      <dc:creator>PEPPERCORN</dc:creator>
      <pubDate>Tue, 05 May 2026 22:52:50 +0000</pubDate>
      <link>https://dev.to/peppercorn_llm/day-3-i-had-a-local-llm-analyze-a-year-of-my-credit-card-statements-4eab</link>
      <guid>https://dev.to/peppercorn_llm/day-3-i-had-a-local-llm-analyze-a-year-of-my-credit-card-statements-4eab</guid>
      <description>&lt;h1&gt;
  
  
  [Day 3] I Had a Local LLM Analyze a Year of My Credit Card Statements
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Intro
&lt;/h2&gt;

&lt;p&gt;Day 3: I'm going to hand a year of credit card statements over to a local LLM and see what it can do.&lt;/p&gt;

&lt;p&gt;This is experiment #3.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What I'm using today: DGX Spark + &lt;a href="https://ollama.com/" rel="noopener noreferrer"&gt;Ollama&lt;/a&gt; + &lt;a href="https://qwenlm.github.io/" rel="noopener noreferrer"&gt;Qwen2.5&lt;/a&gt; (comparing 7B vs 72B). Ollama is the de-facto local-LLM runtime, and Qwen2.5 is a multilingual model from Alibaba (China) that handles Japanese reasonably well, apparently.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Today's setup
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data&lt;/strong&gt;: 12 months of credit card statements from a single card.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Volume&lt;/strong&gt;: 383 transactions, ¥2,761,555 in total spend.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Goal&lt;/strong&gt;: get the AI to spot waste patterns and propose savings.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Comparison axes&lt;/strong&gt;:

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Model size&lt;/strong&gt;: 7B (light) vs 72B (heavy)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Input format&lt;/strong&gt;: raw CSV vs pandas-aggregated summary&lt;/li&gt;
&lt;li&gt;→ &lt;strong&gt;4 patterns&lt;/strong&gt; total&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Takeaway&lt;/strong&gt;: "If you ask an AI to aggregate raw data, the numbers come out way off." / "If you pre-aggregate with a spreadsheet tool first and then feed the AI, you get fast and accurate results." A small but practical finding.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Get the CSVs onto the DGX
&lt;/h2&gt;

&lt;p&gt;Log into the credit card company's web statements page on myPC1 (my Windows laptop), download 12 months of CSVs, then push them to the DGX.&lt;/p&gt;

&lt;p&gt;I deliberately skipped GitHub for the transfer this time — once you push something, it's in the history forever, and credit card data shouldn't be there even briefly. Instead, I used &lt;strong&gt;direct PC-to-PC transfer over SSH&lt;/strong&gt; (one command, finishes in seconds; details in the collapsibles at the end). The &lt;code&gt;.gitignore&lt;/code&gt; excludes &lt;code&gt;private-data/&lt;/code&gt; too, so accidental commits are ruled out.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Install Ollama
&lt;/h2&gt;

&lt;p&gt;Ollama is the de-facto runtime for local LLMs. One command should be enough.&lt;/p&gt;

&lt;p&gt;There was a small password hiccup during install (details below), but eventually it was up and running.&lt;/p&gt;

&lt;p&gt;The DGX Spark specs really show through:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Memory: 121 GB&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Default context window: ~262,144 tokens&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In other words: "throw a whole book at it, no problem" territory. Reassuring.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Two model sizes: Qwen2.5 7B vs 72B
&lt;/h2&gt;

&lt;p&gt;The strategy: &lt;strong&gt;same model family, different sizes&lt;/strong&gt;. That way the differences come from size, not architecture.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;7B (light)&lt;/strong&gt;: ~4.7 GB, downloads in 5 minutes. Fast.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;72B (heavy)&lt;/strong&gt;: ~47 GB, 25 minutes to download. Slow but smart.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;What does "B" mean?&lt;/strong&gt; Short for &lt;em&gt;Billion&lt;/em&gt;. It's the number of "weights" inside the AI — more weights, more it remembers, basically. So &lt;strong&gt;7B has 7 billion weights, 72B has 72 billion&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Loading both onto the DGX simultaneously, memory usage looks like:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;AI model&lt;/th&gt;
&lt;th&gt;Memory occupied&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;qwen2.5:72b&lt;/td&gt;
&lt;td&gt;61 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;qwen2.5:7b&lt;/td&gt;
&lt;td&gt;8.2 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;69 GB&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;69 GB. Spacious!&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Prepping the CSVs
&lt;/h2&gt;

&lt;p&gt;Once I had the CSVs in hand, &lt;strong&gt;three small headaches&lt;/strong&gt; before they were ready for the AI:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Headache 1&lt;/strong&gt;: An older encoding (Windows Japanese flavor) → needs converting to modern UTF-8&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Headache 2&lt;/strong&gt;: Some merchant names contain commas, which breaks naive CSV parsing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Headache 3&lt;/strong&gt;: Each file has a "monthly total" line at the end that isn't actually data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Details in the collapsible. After cleanup, the 12 files merge into a single dataset:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Item&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Transactions&lt;/td&gt;
&lt;td&gt;383&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Period&lt;/td&gt;
&lt;td&gt;12 months (1 year)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total spend&lt;/td&gt;
&lt;td&gt;¥2,761,555&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Avg per tx&lt;/td&gt;
&lt;td&gt;¥7,210&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Median per tx&lt;/td&gt;
&lt;td&gt;¥3,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Largest single tx&lt;/td&gt;
&lt;td&gt;¥209,283 (overseas flight)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Smallest&lt;/td&gt;
&lt;td&gt;¥-3,980 (refund)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Now to feed this to 7B and 72B and see what each of them says.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Experiment 1: Throw the raw CSV at the AI
&lt;/h2&gt;

&lt;p&gt;No tricks: &lt;strong&gt;all 383 rows, straight at the AI&lt;/strong&gt;. Prompt is the full ask: "As a household budget consultant, output category breakdown / monthly trend / waste patterns / savings suggestions / lifestyle hypothesis."&lt;/p&gt;

&lt;h3&gt;
  
  
  7B's answer (75 seconds)
&lt;/h3&gt;

&lt;p&gt;...this is where &lt;strong&gt;the numbers go wildly off&lt;/strong&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Item&lt;/th&gt;
&lt;th&gt;What 7B said&lt;/th&gt;
&lt;th&gt;Real data&lt;/th&gt;
&lt;th&gt;Match?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Amazon total&lt;/td&gt;
&lt;td&gt;¥2,014,386 (257 tx)&lt;/td&gt;
&lt;td&gt;¥693,663 (166 tx)&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Amazon Downloads&lt;/td&gt;
&lt;td&gt;¥2,014,386 (257 tx)&lt;/td&gt;
&lt;td&gt;¥80,323 (50 tx)&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Outdoor brand&lt;/td&gt;
&lt;td&gt;¥495,740&lt;/td&gt;
&lt;td&gt;¥154,820&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;A local recreation venue&lt;/td&gt;
&lt;td&gt;"¥49,574" cited&lt;/td&gt;
&lt;td&gt;(a different small charge actually exists)&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;None of the numbers line up. Amazon total is roughly 3× off, Amazon Downloads about 25× off, and the cited venue context is a different charge entirely.&lt;/p&gt;

&lt;p&gt;Reading 383 rows of CSV and computing totals turned out to be a heavy lift for the 7B model.&lt;/p&gt;

&lt;h3&gt;
  
  
  72B's answer (12m 9s)
&lt;/h3&gt;

&lt;p&gt;What if we throw size at the problem? After 12 minutes of patience:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Item&lt;/th&gt;
&lt;th&gt;What 72B said&lt;/th&gt;
&lt;th&gt;Real data&lt;/th&gt;
&lt;th&gt;Match?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Amazon total&lt;/td&gt;
&lt;td&gt;¥635,792 (104 tx)&lt;/td&gt;
&lt;td&gt;¥693,663 (166 tx)&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AI/dev tools&lt;/td&gt;
&lt;td&gt;¥193,629 (21 tx)&lt;/td&gt;
&lt;td&gt;¥176,850 (24 tx)&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Travel&lt;/td&gt;
&lt;td&gt;¥487,555 (43 tx)&lt;/td&gt;
&lt;td&gt;¥416,268 (8 tx)&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Not exact, but the off-by amounts are within ~10%, and there are no fabricated venues. A real improvement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;However — when asked about the monthly trend, here's what 72B said:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Month 1: ¥316,789 → Month 2: ¥229,600 → Month 3: ¥237,500 → ... → Month 12: ¥291,500&lt;br&gt;
(Gradually increasing.)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The actual range is ¥69,961 (low) to ¥493,072 (high) — a chaotic up-and-down waveform. "Gradually increasing" isn't quite right. Even 72B isn't great at aggregating distributed data over a long CSV.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Experiment 2: Aggregate first, then feed the AI
&lt;/h2&gt;

&lt;p&gt;If the AI struggles with aggregation, do the aggregation in a different tool first and only hand the AI the result.&lt;/p&gt;

&lt;p&gt;The flow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;📥 Raw CSV (22,132 chars, 383 rows)
       ↓
🔧 Pre-aggregate with a spreadsheet tool (Python's pandas)
       ↓
📋 Aggregate summary (1,884 chars, ~90% smaller)
       ↓
🤖 Hand it to the AI (let it interpret and propose)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;Python's &lt;strong&gt;pandas&lt;/strong&gt; = a spreadsheet-like library, but ~10,000× more powerful than Excel functions, used for tabular data analysis.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  7B + pre-aggregated input (50 seconds)
&lt;/h3&gt;

&lt;p&gt;Numbers are &lt;strong&gt;fully accurate&lt;/strong&gt; now.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Item&lt;/th&gt;
&lt;th&gt;What 7B said&lt;/th&gt;
&lt;th&gt;Real data&lt;/th&gt;
&lt;th&gt;Match?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Amazon total&lt;/td&gt;
&lt;td&gt;¥693,663&lt;/td&gt;
&lt;td&gt;¥693,663&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AI/dev tools&lt;/td&gt;
&lt;td&gt;¥176,850&lt;/td&gt;
&lt;td&gt;¥176,850&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Monthly max&lt;/td&gt;
&lt;td&gt;¥493,072&lt;/td&gt;
&lt;td&gt;¥493,072&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Monthly min&lt;/td&gt;
&lt;td&gt;¥69,961&lt;/td&gt;
&lt;td&gt;¥69,961&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Quoting straight from the pre-aggregated numbers, the hallucinations vanished.&lt;/p&gt;

&lt;p&gt;And 7B did this in 50 seconds — better quality than the 72B + raw CSV at 12 minutes. Quietly remarkable.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Before (raw CSV)&lt;/th&gt;
&lt;th&gt;After (aggregated)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Time&lt;/td&gt;
&lt;td&gt;75s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;50s&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Numbers&lt;/td&gt;
&lt;td&gt;wildly off&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;exact&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Verdict&lt;/td&gt;
&lt;td&gt;not usable as-is&lt;/td&gt;
&lt;td&gt;quote directly&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  72B + pre-aggregated input (12m 13s)
&lt;/h3&gt;

&lt;p&gt;72B's numbers also match exactly (well, since they're being quoted from pre-aggregated data, that's expected). The proposal quality was the strongest of the four patterns:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Reduce Amazon dependency&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Current: online shopping (Amazon family) is 25.1% of total (¥693,663).&lt;/li&gt;
&lt;li&gt;Suggestion: stick to essentials only, regular review, avoid impulse buys.&lt;/li&gt;
&lt;li&gt;Expected savings: ¥57,805/month average (25% reduction) → ¥693,660/year&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;...wait, hold on. Annual Amazon spend was ¥693,663. The "savings" 72B suggests is ¥693,660. That's basically the &lt;strong&gt;same number&lt;/strong&gt;. So the proposal is effectively "stop buying on Amazon entirely (100%)" — definitely not 25%. Apparently 72B's percentage arithmetic isn't bulletproof either.&lt;/p&gt;

&lt;p&gt;That aside, the &lt;strong&gt;lifestyle hypothesis&lt;/strong&gt; section was kind of striking. Here's what 72B observed:&lt;/p&gt;

&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Heavy reliance on apps and subscriptions&lt;/strong&gt;: "App/subscription" category is 10.5% of total&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Frequent international travel&lt;/strong&gt;: "Travel/airline" is 15.1%, with notable overseas charges&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Frequent online shopping&lt;/strong&gt;: "Online (Amazon)" is 25.1% of total&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;It's just one card's data, so this isn't a complete picture — but if I fed an AI my full household financials, &lt;strong&gt;the analysis and advice would probably go a lot deeper&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Summary: 4 patterns
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Input&lt;/th&gt;
&lt;th&gt;Time&lt;/th&gt;
&lt;th&gt;Numerical accuracy&lt;/th&gt;
&lt;th&gt;Proposal quality&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;7B&lt;/td&gt;
&lt;td&gt;Raw CSV&lt;/td&gt;
&lt;td&gt;75s&lt;/td&gt;
&lt;td&gt;❌ Numbers way off&lt;/td&gt;
&lt;td&gt;△&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;72B&lt;/td&gt;
&lt;td&gt;Raw CSV&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;12m 9s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;△ Misread monthly trend&lt;/td&gt;
&lt;td&gt;○&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;7B&lt;/td&gt;
&lt;td&gt;Aggregated&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;50s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅ Exact&lt;/td&gt;
&lt;td&gt;○ Some repetition&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;72B&lt;/td&gt;
&lt;td&gt;Aggregated&lt;/td&gt;
&lt;td&gt;12m 13s&lt;/td&gt;
&lt;td&gt;✅ Exact&lt;/td&gt;
&lt;td&gt;◎ Best (mind the % math)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Quietly notable: &lt;strong&gt;72B takes ~12 minutes regardless of input size&lt;/strong&gt; (shrinking the prompt didn't change wall-clock time much). Output generation is the bottleneck. Which strengthens the case for "small model + pre-aggregate" as the cost-effective default.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. Cross-check: the actual graphs
&lt;/h2&gt;

&lt;p&gt;Before trusting any of the AI output, let me put the real numbers on charts using the spreadsheet tool (pandas).&lt;/p&gt;

&lt;h3&gt;
  
  
  Monthly spending
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3wvfzqh0st6qv1323fgr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3wvfzqh0st6qv1323fgr.png" alt="Monthly spending"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Average ¥230,130/month, but the range is ¥69,961 (lowest) to ¥493,072 (highest) — about a 7× spread. The 72B's "gradually increasing" claim was a bit off the mark; the reality is bouncy.&lt;/p&gt;

&lt;h3&gt;
  
  
  Category share
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5wepa7rudozlx1igsp4o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5wepa7rudozlx1igsp4o.png" alt="Categories"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;"Other" being 32% is because my categorization rule is sloppy. I just wrote a simple "if the merchant name contains keyword X, bucket Y" rule, and lots of merchants didn't match any keyword and ended up in "Other." &lt;strong&gt;Reading meaning from a merchant name&lt;/strong&gt; is exactly the kind of thing AI is good at, so next time I'll let the AI do the categorization itself.&lt;/p&gt;

&lt;h3&gt;
  
  
  Top 15 merchants
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqynqrvxdlol28s3mr63m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqynqrvxdlol28s3mr63m.png" alt="Top merchants"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Amazon at ¥421,978 (105 tx) is far and away #1. Amazon really is too convenient...&lt;/p&gt;

&lt;h3&gt;
  
  
  Weekday rhythm
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftmwt0stf6hralf5vl8kp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftmwt0stf6hralf5vl8kp.png" alt="Weekday pattern"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Tuesday alone is ¥692,549 — way above the rest. Probably because that's when most of the subscription auto-charges land.&lt;/p&gt;




&lt;h2&gt;
  
  
  8. Today's takeaways
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Separate "aggregation" from "interpretation"
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;AI is bad at&lt;/th&gt;
&lt;th&gt;AI is good at&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Multi-row sum/average (numbers go wildly off)&lt;/td&gt;
&lt;td&gt;Categorization (interpreting fuzzy meaning)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Percentage math (saw "25% off → 100% off")&lt;/td&gt;
&lt;td&gt;Pattern recognition / hypothesis generation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Distributed aggregation like monthly totals&lt;/td&gt;
&lt;td&gt;Narrative interpretation, savings proposals&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;→ &lt;strong&gt;Aggregation is the spreadsheet tool's job; interpretation is the AI's.&lt;/strong&gt; When you split the work, things go fast and accurate. "Data prep matters before analysis" — yeah, that old saying really is true. Note to self.&lt;/p&gt;

&lt;h3&gt;
  
  
  Sometimes input quality beats raw size
&lt;/h3&gt;

&lt;p&gt;"7B + pre-aggregated input in 50 seconds" outperformed "72B + raw CSV in 12 minutes". &lt;strong&gt;Sometimes you don't need a bigger model — you need cleaner input.&lt;/strong&gt; Felt that one today.&lt;/p&gt;

&lt;h3&gt;
  
  
  The local-LLM angle
&lt;/h3&gt;

&lt;p&gt;Feeding 12 months of raw credit card data to an AI without a single byte going to the cloud — it was surprisingly stress-free. This is one of the spots local LLMs really shine. Got personal info, or anything cloud-uncomfortable? This is the place for them.&lt;/p&gt;




&lt;h2&gt;
  
  
  9. Tech details (Claude explains)
&lt;/h2&gt;

&lt;p&gt;The technical bits, written up by my AI pair.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;SCP transfer to the DGX (mDNS, no IP needed)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;NVIDIA Sync auto-configures a Host alias in &lt;code&gt;~/AppData/Local/NVIDIA Corporation/Sync/config/ssh_config&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ssh"&gt;&lt;code&gt;&lt;span class="k"&gt;Host&lt;/span&gt; spark-XXXX.local
  &lt;span class="k"&gt;Hostname&lt;/span&gt; spark-XXXX.local
  &lt;span class="k"&gt;User&lt;/span&gt; [user]
  &lt;span class="k"&gt;Port&lt;/span&gt; &lt;span class="m"&gt;22&lt;/span&gt;
  &lt;span class="k"&gt;IdentityFile&lt;/span&gt; "...&lt;span class="err"&gt;\\&lt;/span&gt;nvsync.key"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;Which means I can SSH/SCP using &lt;code&gt;spark-XXXX.local&lt;/code&gt; without ever looking up an IP. The &lt;code&gt;.local&lt;/code&gt; suffix uses mDNS (Multicast DNS) for hostname resolution within the LAN.&lt;/p&gt;

&lt;p&gt;Transfer command (one line, from PowerShell on the Windows side):&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;scp&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-r&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"C:\Users\[user]\Desktop\docs\dgx\csv"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;spark-XXXX.local:/home/&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="nx"&gt;/personal/dgx-100-experiments/private-data/credit-card-csv&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;ol&gt;
&lt;li&gt;Ollama install + the sudo-TTY catch + GPU detection log&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Ollama install:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://ollama.com/install.sh | sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;Running this through Claude Code's Bash, it errored at the sudo password prompt — an interactive TTY is required there:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;sudo: a terminal is required to read the password
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;Reopened a separate SSH session, ran the same command manually, and it went through.&lt;/p&gt;

&lt;p&gt;Once installed, systemd auto-starts the service. The GPU detection log via &lt;code&gt;journalctl -u ollama&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="err"&gt;inference&lt;/span&gt; &lt;span class="err"&gt;compute&lt;/span&gt; &lt;span class="py"&gt;id&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;GPU-986c194b... name=CUDA0 description="NVIDIA GB10"&lt;/span&gt;
&lt;span class="py"&gt;total&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"121.7 GiB"&lt;/span&gt; &lt;span class="s"&gt;available="79.0 GiB"&lt;/span&gt;
&lt;span class="py"&gt;default_num_ctx&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;262144&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;VRAM (DGX Spark unified memory): &lt;strong&gt;121.7 GiB&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Default context: &lt;strong&gt;262,144 tokens&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Compared with a typical RTX 4090 (24 GB VRAM, 8K–32K default context), the gap is significant.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Loading both models simultaneously&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama pull qwen2.5:7b   &lt;span class="c"&gt;# 4.7 GB&lt;/span&gt;
ollama pull qwen2.5:72b  &lt;span class="c"&gt;# 47 GB&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;After loading both, &lt;code&gt;ollama ps&lt;/code&gt; shows:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;NAME           SIZE      PROCESSOR    CONTEXT    
qwen2.5:72b    61 GB     100% GPU     32768
qwen2.5:7b     8.2 GB    100% GPU     32768
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;Total ~69 GB used out of 79 GB available. Both models stay resident, switching between them is instant.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Custom CSV parser for the credit card data&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Three quirks needed handling: CP932 encoding, no quotes (commas in some merchant names break parsing), and a trailing summary row in each file.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse_line&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;gt&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;fields&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rstrip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\r\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fields&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;lt&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;fields&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;  &lt;span class="c1"&gt;# skip blank/summary rows
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fields&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;gt&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;merchant&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fields&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="n"&gt;fields&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;fields&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;merchant&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;fields&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;:]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;fields&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;load_one&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;gt&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cp932&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;next&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# skip header (cardholder metadata)
&lt;/span&gt;        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;parsed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;parse_line&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;parsed&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;COLUMNS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;利用日&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;利用日&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="nb"&gt;format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;%Y/%m/%d&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;利用金額&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;利用金額&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;ol&gt;
&lt;li&gt;Japanese fonts in matplotlib&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;code&gt;japanize-matplotlib&lt;/code&gt; doesn't work on Python 3.12 — it imports &lt;code&gt;distutils&lt;/code&gt;, which was removed from the standard library.&lt;/p&gt;

&lt;p&gt;The modern replacement is &lt;code&gt;matplotlib-fontja&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;matplotlib-fontja
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;matplotlib_fontja&lt;/span&gt;  &lt;span class="c1"&gt;# noqa: F401  ← just importing it sets up IPAexGothic
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;ol&gt;
&lt;li&gt;Calling Ollama from Python&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The official &lt;code&gt;ollama&lt;/code&gt; Python client is straightforward:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ollama&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ollama&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;stream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen2.5:72b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SYSTEM_PROMPT&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_prompt&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temperature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flush&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;Streaming makes long generation easier to watch unfold.&lt;/p&gt;




&lt;h2&gt;
  
  
  Tomorrow: Day 4
&lt;/h2&gt;

&lt;p&gt;Day 4 plan: &lt;strong&gt;let a local AI sort 20,000 iPhone photos&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The actual goal is to have a local image-recognition model (CLIP family?) clean up my photo library so I can stop paying iCloud for storage upgrades...!&lt;/p&gt;




&lt;h1&gt;
  
  
  100ExperimentsWithDGX #LocalLLM #Ollama
&lt;/h1&gt;

</description>
      <category>localllm</category>
      <category>ai</category>
      <category>dgxspark</category>
      <category>ollama</category>
    </item>
    <item>
      <title>[Day 2] I Trained an AI on 22 Photos of My Cat — Now It Draws Her in Any Scene</title>
      <dc:creator>PEPPERCORN</dc:creator>
      <pubDate>Tue, 05 May 2026 00:06:00 +0000</pubDate>
      <link>https://dev.to/peppercorn_llm/day-2-i-trained-an-ai-on-22-photos-of-my-cat-now-it-draws-her-in-any-scene-3a92</link>
      <guid>https://dev.to/peppercorn_llm/day-2-i-trained-an-ai-on-22-photos-of-my-cat-now-it-draws-her-in-any-scene-3a92</guid>
      <description>&lt;h1&gt;
  
  
  [Day 2] I Trained an AI on 22 Photos of My Cat — Now It Draws Her in Any Scene
&lt;/h1&gt;

&lt;h2&gt;
  
  
  So, yesterday I generated "some cat"
&lt;/h2&gt;

&lt;p&gt;Day 1 ended with "I made my DGX draw a cat" — but the cat that came out was just "a cat from somewhere". Today, the goal is to teach the AI about my actual cat (who's currently being looked after at my parents' place back in Japan).&lt;/p&gt;

&lt;p&gt;This is what people call LoRA training.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;LoRA: A technique that teaches an AI model "specific features" using a small set of images, without touching the base model itself. Apparently. The output is a small "diff" file (tens of MB).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is experiment #2.&lt;/p&gt;




&lt;h2&gt;
  
  
  The training data
&lt;/h2&gt;

&lt;p&gt;Source material: 22 photos of my cat.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy9tmru213ymne73f61pv.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy9tmru213ymne73f61pv.jpg" alt="Training photo collage" width="800" height="1058"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I picked a mix of angles — front-facing, full body, sleepy poses, varying lighting — to give the AI a fair shot at recognizing the cat's defining features (tuxedo black-and-white pattern, white socks, the black smudge on the nose).&lt;/p&gt;




&lt;h2&gt;
  
  
  Training pipeline
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Pre-processing
&lt;/h3&gt;

&lt;p&gt;iPhone HEIC files don't work directly with most AI tools, so first conversion to JPG. 10 of the 22 were HEIC.&lt;/p&gt;

&lt;p&gt;Then resize to 512px on the short side for training. &lt;strong&gt;This is where I tripped over a sneaky bug&lt;/strong&gt; — details in the collapsible section below.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Captions
&lt;/h3&gt;

&lt;p&gt;Every image gets a text description like "ohwx cat, sitting on a wooden floor, indoor, soft lighting". The four-letter &lt;code&gt;ohwx&lt;/code&gt; is a meaningless token that becomes the trigger word for "my specific cat" after training.&lt;/p&gt;

&lt;p&gt;Drafting 22 captions by hand would be tedious — but Claude can read images directly, so it drafted them while I just reviewed. The accuracy was uncanny. For example:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnmzgmz033je98xlpyilb.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnmzgmz033je98xlpyilb.jpg" alt="Cat on a kitchen counter" width="512" height="683"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;ohwx cat, walking on a metal kitchen counter, side profile, indoor kitchen with spice bottles and shelves in the background&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7nm6622l9qfb0py7z2ag.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7nm6622l9qfb0py7z2ag.jpg" alt="Mid-yawn cat" width="512" height="683"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;ohwx cat, in a loaf pose on a gray carpet, mouth open showing teeth, mid-yawn, indoor with shelves and warm lights in the background&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fisxhfub5fudklbriga1p.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fisxhfub5fudklbriga1p.jpg" alt="Cat by a window" width="683" height="512"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;ohwx cat, sitting on a wooden floor by a balcony window, viewed from behind, sharp sunlight casting long shadows, indoor&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;SUGOI.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Kohya_ss training
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;Kohya_ss&lt;/code&gt; is the de-facto LoRA training tool. Set up a TOML config, run one command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;accelerate launch train_network.py &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--config_file&lt;/span&gt; configs/train.toml &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--dataset_config&lt;/span&gt; configs/dataset.toml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Training logs scroll by, and the loss value gradually drops. Lower loss = the model is learning, apparently.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Done
&lt;/h3&gt;

&lt;p&gt;1100 steps in 13 minutes 3 seconds on the DGX Spark.&lt;/p&gt;




&lt;h2&gt;
  
  
  Result 1: just typing "ohwx cat" gives me my cat
&lt;/h2&gt;

&lt;p&gt;The first thing I tried was a "without LoRA vs with LoRA" comparison. Same prompt — "ohwx cat as a chef in a kitchen, ..." — first without the LoRA, then with it:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjzfjjk84dv92xns5nt3v.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjzfjjk84dv92xns5nt3v.jpg" alt="Without (left) vs With (right) LoRA" width="800" height="598"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Left: no LoRA. Right: with LoRA.&lt;/p&gt;

&lt;p&gt;Without LoRA, &lt;code&gt;ohwx&lt;/code&gt; is gibberish to the model, so it's ignored and only "a chef in a kitchen" carries weight. Result: a human chef. A nice woman cooking in a pink kitchen.&lt;/p&gt;

&lt;p&gt;With LoRA, &lt;code&gt;ohwx&lt;/code&gt; becomes a real token that points at my cat. Same prompt, but now my cat is the chef.&lt;/p&gt;

&lt;p&gt;This was the moment that hit.&lt;/p&gt;




&lt;h2&gt;
  
  
  Result 2: novel scene reproduction
&lt;/h2&gt;

&lt;p&gt;The training set has no photo of the cat sitting on a wooden floor in this exact composition. So I tried it:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fngsfgj9etl9pv39axg2z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fngsfgj9etl9pv39axg2z.png" alt="My cat sitting on a wooden floor" width="512" height="768"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;White socks: present. Nose smudge: present.&lt;/p&gt;




&lt;h2&gt;
  
  
  My cat, in places she's never been
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;ohwx cat&lt;/code&gt; in various scenes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Sunny balcony
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxbhlskto67vvbgmx2hdm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxbhlskto67vvbgmx2hdm.png" alt="Cat on a sunny balcony" width="512" height="768"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Cozy.&lt;/p&gt;

&lt;h3&gt;
  
  
  Chef (reprise)
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2awj3mdsj8u788bedxhl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2awj3mdsj8u788bedxhl.png" alt="Cat as a chef" width="512" height="768"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The chef hat fits suspiciously well. Cooking ability unverified.&lt;/p&gt;

&lt;h3&gt;
  
  
  Autumn forest
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs1l3r9hwnvh3kqk8qc1n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs1l3r9hwnvh3kqk8qc1n.png" alt="Cat in an autumn forest" width="512" height="768"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A painterly take.&lt;/p&gt;

&lt;h3&gt;
  
  
  Astronaut
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frps74rdecajbrews1tz4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frps74rdecajbrews1tz4.png" alt="Cat as an astronaut" width="512" height="768"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A doppelgänger via the helmet glass — but sci-fi all the same.&lt;/p&gt;




&lt;h2&gt;
  
  
  Today's takeaway
&lt;/h2&gt;

&lt;p&gt;"Build your own AI from your own data" turned out to be way more accessible than I'd assumed.&lt;/p&gt;




&lt;h2&gt;
  
  
  Tech details (Claude explains)
&lt;/h2&gt;

&lt;p&gt;The technical bits, written up by my AI pair.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;HEIC → JPG conversion and the EXIF orientation trap&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Reading iPhone HEIC files in Python is straightforward with &lt;code&gt;pillow-heif&lt;/code&gt;. JPG conversion is a few lines:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;PIL&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Image&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ImageOps&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pillow_heif&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;register_heif_opener&lt;/span&gt;
&lt;span class="nf"&gt;register_heif_opener&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;Image&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;IMG_1234.HEIC&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;img&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;oriented&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ImageOps&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exif_transpose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;img&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# ← critical line
&lt;/span&gt;    &lt;span class="n"&gt;rgb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;oriented&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;convert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;RGB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;rgb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;IMG_1234.jpg&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;quality&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;95&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h3&gt;
  
  
  What I tripped on
&lt;/h3&gt;

&lt;p&gt;My first version skipped &lt;code&gt;ImageOps.exif_transpose()&lt;/code&gt;. Result: 8 of 22 photos came out rotated 90° in the resized output.&lt;/p&gt;

&lt;p&gt;iPhones save portrait shots with the actual pixels stored landscape-ways, plus an EXIF Orientation tag saying "rotate 90° on display". Pillow's default &lt;code&gt;Image.open()&lt;/code&gt; ignores that tag — you have to call &lt;code&gt;exif_transpose()&lt;/code&gt; explicitly.&lt;/p&gt;

&lt;p&gt;Caught it before training started. If I hadn't, the LoRA would have learned "sideways cat" and generation would be weird.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Kohya_ss setup on ARM64 (DGX Spark)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;There are two repos commonly referred to as "Kohya_ss":&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;bmaltais/kohya_ss&lt;/code&gt; — GUI wrapper, xformers dependency (clashes with ARM64)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;kohya-ss/sd-scripts&lt;/code&gt; — the actual training engine, CLI/TOML driven&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;DGX Spark is ARM64, so I went with the latter:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone &lt;span class="nt"&gt;--depth&lt;/span&gt; 1 https://github.com/kohya-ss/sd-scripts.git ~/Kohya_ss
&lt;span class="nb"&gt;cd&lt;/span&gt; ~/Kohya_ss
python3 &lt;span class="nt"&gt;-m&lt;/span&gt; venv venv &amp;amp;amp&lt;span class="p"&gt;;&lt;/span&gt;&amp;amp;amp&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nb"&gt;source &lt;/span&gt;venv/bin/activate
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--upgrade&lt;/span&gt; pip
pip &lt;span class="nb"&gt;install &lt;/span&gt;torch torchvision &lt;span class="nt"&gt;--index-url&lt;/span&gt; https://download.pytorch.org/whl/cu128
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;DGX Spark uses CUDA 12.8 + ARM64 (sbsa), so the PyTorch &lt;code&gt;cu128&lt;/code&gt; channel works directly. Surprisingly painless.&lt;/p&gt;
&lt;h3&gt;
  
  
  Training config (TOML)
&lt;/h3&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="c"&gt;# train.toml (excerpt)&lt;/span&gt;
&lt;span class="py"&gt;pretrained_model_name_or_path&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;".../Realistic_Vision_V6.0_NV_B1.safetensors"&lt;/span&gt;
&lt;span class="py"&gt;vae&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;".../vae-ft-mse-840000-ema-pruned.safetensors"&lt;/span&gt;

&lt;span class="py"&gt;network_module&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"networks.lora"&lt;/span&gt;
&lt;span class="py"&gt;network_dim&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;32&lt;/span&gt;
&lt;span class="py"&gt;network_alpha&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;

&lt;span class="py"&gt;optimizer_type&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"AdamW8bit"&lt;/span&gt;
&lt;span class="py"&gt;unet_lr&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1e-4&lt;/span&gt;
&lt;span class="py"&gt;text_encoder_lr&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;5e-5&lt;/span&gt;
&lt;span class="py"&gt;lr_scheduler&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"cosine_with_restarts"&lt;/span&gt;

&lt;span class="py"&gt;max_train_epochs&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;
&lt;span class="py"&gt;save_every_n_epochs&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;

&lt;span class="py"&gt;mixed_precision&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"bf16"&lt;/span&gt;
&lt;span class="py"&gt;sdpa&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="py"&gt;cache_latents&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="c"&gt;# dataset.toml&lt;/span&gt;
&lt;span class="nn"&gt;[general]&lt;/span&gt;
&lt;span class="py"&gt;shuffle_caption&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
&lt;span class="py"&gt;caption_extension&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;".txt"&lt;/span&gt;
&lt;span class="py"&gt;keep_tokens&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;

&lt;span class="nn"&gt;[[datasets]]&lt;/span&gt;
&lt;span class="py"&gt;resolution&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;512&lt;/span&gt;
&lt;span class="py"&gt;batch_size&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
&lt;span class="py"&gt;enable_bucket&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

  &lt;span class="nn"&gt;[[datasets.subsets]]&lt;/span&gt;
  &lt;span class="py"&gt;image_dir&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"/path/to/cat-photos-512"&lt;/span&gt;
  &lt;span class="py"&gt;num_repeats&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;22 photos × 10 repeats × 10 epochs ÷ batch 2 = 1100 steps. 13 minutes.&lt;/p&gt;

&lt;p&gt;Base model: Realistic Vision V6.0 B1 noVAE (a photo-realistic SD 1.5 derivative). External VAE: sd-vae-ft-mse-original. The combination is good at fur detail.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Hitting the ComfyUI HTTP API for batch generation&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Clicking through the GUI for one image at a time gets old fast. ComfyUI exposes an HTTP API that's easy to drive from Python — &lt;code&gt;urllib.request&lt;/code&gt; from the standard library is enough (no extra deps).&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;urllib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;

&lt;span class="n"&gt;COMFY_URL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://127.0.0.1:8188&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;queue_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="p"&gt;}).&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;req&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;urllib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;COMFY_URL&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Content-Type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;application/json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;urllib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;urlopen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;())[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;wait_for_history&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;180&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;lt&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;urllib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;urlopen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;COMFY_URL&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/history/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;prompt_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;prompt_id&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;prompt_id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;The workflow is ComfyUI's API format (a dict of node IDs with their connections). To use a LoRA, insert a &lt;code&gt;LoraLoader&lt;/code&gt; node between the checkpoint loader and KSampler.&lt;/p&gt;

&lt;p&gt;DGX Spark generates one 512×768 image in about 3 seconds. With seed/strength/prompt parametrized in a script, all 12 grid images came out in under a minute.&lt;/p&gt;




&lt;h2&gt;
  
  
  Tomorrow: Day 3
&lt;/h2&gt;

&lt;p&gt;Day 3 plan: have a local AI analyze my credit card history.&lt;/p&gt;

&lt;p&gt;The kind of data I'd rather not send to a cloud AI, but absolutely want to understand. Quintessential local-AI territory.&lt;/p&gt;




&lt;h1&gt;
  
  
  100ExperimentsWithDGX #LocalLLM
&lt;/h1&gt;

</description>
      <category>localllm</category>
      <category>ai</category>
      <category>dgxspark</category>
      <category>lora</category>
    </item>
    <item>
      <title>[Day 1] DGX Spark Came Home — I Made It Draw a Cat</title>
      <dc:creator>PEPPERCORN</dc:creator>
      <pubDate>Mon, 04 May 2026 03:20:48 +0000</pubDate>
      <link>https://dev.to/peppercorn_llm/day-1-dgx-spark-came-home-i-made-it-draw-a-cat-30f7</link>
      <guid>https://dev.to/peppercorn_llm/day-1-dgx-spark-came-home-i-made-it-draw-a-cat-30f7</guid>
      <description>&lt;h1&gt;
  
  
  [Day 1] DGX Spark Came Home — I Made It Draw a Cat
&lt;/h1&gt;

&lt;h2&gt;
  
  
  So... what is "local LLM" again?
&lt;/h2&gt;

&lt;p&gt;Honestly, I'm still figuring out what "local LLM" even means. But somehow, through a series of decisions I won't fully justify here, I ended up buying an NVIDIA DGX Spark — and now it's sitting in my house.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;DGX Spark: NVIDIA's "supercomputer for the home" — a small but seriously expensive box with the latest-gen AI chip inside. Apparently.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;What I really want to figure out is: when should I use local AI vs. cloud AI? Reading articles about it doesn't seem to help, so I'm going full hands-on. Goal: 100 experiments, one per day-ish, until I have an evidence-based answer.&lt;/p&gt;

&lt;p&gt;This is experiment#1.&lt;/p&gt;




&lt;h2&gt;
  
  
  First, the hardware
&lt;/h2&gt;

&lt;p&gt;So this is what showed up at my door — solidly packed in a sturdy cardboard box.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fetwrk2l7jv4q4qg6387t.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fetwrk2l7jv4q4qg6387t.jpg" alt="DGX Spark box"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When I opened it, I was surprised at how small it actually is. "This is the AI machine?" kind of small.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqq8k08mii298z3qtorrr.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqq8k08mii298z3qtorrr.jpg" alt="DGX Spark hardware (mesh sides)"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Boot up → Initial OS setup
&lt;/h2&gt;

&lt;p&gt;Power on, and an Ubuntu-based DGX OS 7.5.0 boots up.&lt;/p&gt;

&lt;h3&gt;
  
  
  Welcome screen
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqn3ibxlwj605xrrk9zfi.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqn3ibxlwj605xrrk9zfi.jpg" alt="Get started screen"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;"Get started" — yes, please.&lt;/p&gt;

&lt;h3&gt;
  
  
  Language and timezone
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkivl91tfq5u5wv55omq2.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkivl91tfq5u5wv55omq2.jpg" alt="Language and timezone"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Standard Linux installer territory — same as Ubuntu?&lt;/p&gt;

&lt;h3&gt;
  
  
  Privacy settings
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8v6rcegdhblxy68puwzv.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8v6rcegdhblxy68puwzv.jpg" alt="Privacy settings"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Diagnostic data sharing prompts.&lt;/p&gt;

&lt;h3&gt;
  
  
  System update
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foaogr37pvykiys42avjz.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foaogr37pvykiys42avjz.jpg" alt="Update started"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The moment I plugged it in, it started updating itself. Modern Linux being Linux.&lt;/p&gt;

&lt;h3&gt;
  
  
  Setup complete
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fho3zwtbliqbvvabytjwt.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fho3zwtbliqbvvabytjwt.jpg" alt="Setup complete"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I picked a username and let the hostname auto-assign. DGX-side prep done.&lt;/p&gt;




&lt;h2&gt;
  
  
  Connecting from my Windows PC
&lt;/h2&gt;

&lt;p&gt;Plugging a monitor into the DGX every time would be tedious, so I want to SSH in from my regular Windows machine (which I've nicknamed "myPC1").&lt;/p&gt;

&lt;p&gt;NVIDIA provides a desktop app called NVIDIA Sync that's supposed to make SSH setup painless. So I install it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq5uezyis10h8hluz2xb0.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq5uezyis10h8hluz2xb0.jpg" alt="NVIDIA Sync install"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;…and that's where I fell into a trap big-time. Windows OpenSSH refused to connect with a "your SSH config has weird permissions, can't trust it" error.&lt;/p&gt;

&lt;p&gt;Full troubleshooting steps are in the collapsible "Tech details" section below.&lt;/p&gt;




&lt;h2&gt;
  
  
  Inside the DGX, finally
&lt;/h2&gt;

&lt;p&gt;After much wrestling, I made it inside. Here's the rough lay of the land:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Item&lt;/th&gt;
&lt;th&gt;Spec&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GPU&lt;/td&gt;
&lt;td&gt;NVIDIA GB10 Grace Blackwell&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory&lt;/td&gt;
&lt;td&gt;128GB (unified between CPU and GPU)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Storage&lt;/td&gt;
&lt;td&gt;4TB SSD (basically empty)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CPU&lt;/td&gt;
&lt;td&gt;20 cores (perf + efficiency combo)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Idle power&lt;/td&gt;
&lt;td&gt;4W (yes, four)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;128GB of memory is apparently 8–16x what's in a typical laptop.&lt;/p&gt;




&lt;h2&gt;
  
  
  Setting up image generation → 🐱
&lt;/h2&gt;

&lt;p&gt;This is the main event. I'm setting up ComfyUI to generate the first cat from this DGX.&lt;/p&gt;

&lt;p&gt;The ComfyUI interface looks like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F53sxhno991z6dni6ozdv.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F53sxhno991z6dni6ozdv.jpg" alt="ComfyUI connected"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The "boxes connected by cables" view is intimidating at first, but the default workflow is pre-wired. You just type a prompt and hit Queue Prompt.&lt;/p&gt;

&lt;p&gt;So:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;a cute fluffy cat sitting on a sunny windowsill, photorealistic, high detail, beautiful lighting, soft fur, cinematic, masterpiece, best quality&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A few seconds later...&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flj7bnj1taf0oplft2itq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flj7bnj1taf0oplft2itq.png" alt="ComfyUI cat 1"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🐱 There it is — the very first cat my DGX has ever drawn!&lt;/p&gt;

&lt;p&gt;Tweaked the prompt and made some more.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy5lb3ptrj1ue8cdejtno.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy5lb3ptrj1ue8cdejtno.png" alt="ComfyUI cat 2"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Eyes a bit unsettling but yeah, fluffy cat.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frrkiqc4px9sjeys77cms.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frrkiqc4px9sjeys77cms.png" alt="ComfyUI cat 3"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Going a touch dark there.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc6e77vf9oc2i6hem4fdm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc6e77vf9oc2i6hem4fdm.png" alt="ComfyUI cat 4"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;…is this a cat? It feels artistic though.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhwdpwdcz28b6dsmhsz51.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhwdpwdcz28b6dsmhsz51.png" alt="ComfyUI cat 5"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Distinctive composition.&lt;/p&gt;

&lt;p&gt;Each masterpiece takes a few to a dozen seconds. That speed means I can iterate on prompts without thinking about cost — which turned out to be quite addictive.&lt;/p&gt;




&lt;h2&gt;
  
  
  Tech details (let the AI explain it)
&lt;/h2&gt;

&lt;p&gt;The rest is the technical stuff. Read on if you're curious.&lt;/p&gt;

&lt;p&gt;I'm a non-engineer poking at this stuff for the first time, so I had Claude (my AI pair programmer for this challenge) write up the technical details. Hopefully useful for anyone walking the same path.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;How to actually get SSH working on Windows&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;NVIDIA Sync should generate an SSH keypair, register the public key on the DGX side at &lt;code&gt;~/.ssh/authorized_keys&lt;/code&gt;, and let you connect without a password.&lt;/p&gt;

&lt;p&gt;If it doesn't work, the cause is usually permissions on Windows SSH config files.&lt;/p&gt;
&lt;h3&gt;
  
  
  Symptom
&lt;/h3&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;ssh spark-XXXX.local
&lt;span class="go"&gt;Bad permissions. Try removing permissions for user: [PC]\CodexSandboxUsers
on file C:/Users/[user]/.ssh/config.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;If you've installed Codex CLI or similar sandboxing tools in the past, the &lt;code&gt;[PC]\CodexSandboxUsers&lt;/code&gt; group may have inherited permissions on &lt;code&gt;~/.ssh/&lt;/code&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  Fix (run from an elevated PowerShell)
&lt;/h3&gt;

&lt;p&gt;Use environment variables to avoid hard-coding your username/PC name.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Take ownership&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;takeown&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;/f&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$&lt;/span&gt;&lt;span class="nn"&gt;env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nv"&gt;USERPROFILE&lt;/span&gt;&lt;span class="s2"&gt;\.ssh\config"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;icacls&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$&lt;/span&gt;&lt;span class="nn"&gt;env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nv"&gt;USERPROFILE&lt;/span&gt;&lt;span class="s2"&gt;\.ssh\config"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;/grant:r&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nv"&gt;$&lt;/span&gt;&lt;span class="nn"&gt;env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nv"&gt;USERNAME&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;:F"&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c"&gt;# Disable inheritance and remove the bad user&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;icacls&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$&lt;/span&gt;&lt;span class="nn"&gt;env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nv"&gt;USERPROFILE&lt;/span&gt;&lt;span class="s2"&gt;\.ssh\config"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;/inheritance:d&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;icacls&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$&lt;/span&gt;&lt;span class="nn"&gt;env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nv"&gt;USERPROFILE&lt;/span&gt;&lt;span class="s2"&gt;\.ssh\config"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;/remove&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$&lt;/span&gt;&lt;span class="nn"&gt;env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nv"&gt;COMPUTERNAME&lt;/span&gt;&lt;span class="s2"&gt;\CodexSandboxUsers"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;Use &lt;code&gt;/inheritance:d&lt;/code&gt; rather than &lt;code&gt;/inheritance:r&lt;/code&gt; — &lt;code&gt;:r&lt;/code&gt; strips all permissions, locking yourself out.&lt;/p&gt;
&lt;h3&gt;
  
  
  NVIDIA Sync's internal config files need the same treatment
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;~/.ssh/config&lt;/code&gt; &lt;code&gt;Include&lt;/code&gt;s an NVIDIA Sync config file, and that one inherits the same problem.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$cfg&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$&lt;/span&gt;&lt;span class="nn"&gt;env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nv"&gt;LOCALAPPDATA&lt;/span&gt;&lt;span class="s2"&gt;\NVIDIA Corporation\Sync\config\ssh_config"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;icacls&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;$cfg&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;/inheritance:d&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;icacls&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;$cfg&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;/remove&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$&lt;/span&gt;&lt;span class="nn"&gt;env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nv"&gt;COMPUTERNAME&lt;/span&gt;&lt;span class="s2"&gt;\CodexSandboxUsers"&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="nv"&gt;$key&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$&lt;/span&gt;&lt;span class="nn"&gt;env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nv"&gt;LOCALAPPDATA&lt;/span&gt;&lt;span class="s2"&gt;\NVIDIA Corporation\Sync\config\nvsync.key"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;icacls&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;$key&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;/inheritance:d&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;icacls&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;$key&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;/remove&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$&lt;/span&gt;&lt;span class="nn"&gt;env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nv"&gt;COMPUTERNAME&lt;/span&gt;&lt;span class="s2"&gt;\CodexSandboxUsers"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h3&gt;
  
  
  Ghost SIDs that icacls can't remove
&lt;/h3&gt;

&lt;p&gt;If you have SIDs from deleted user accounts lingering, &lt;code&gt;icacls /remove&lt;/code&gt; won't touch them. You need PowerShell ACL manipulation:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$cfg&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$&lt;/span&gt;&lt;span class="nn"&gt;env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nv"&gt;LOCALAPPDATA&lt;/span&gt;&lt;span class="s2"&gt;\NVIDIA Corporation\Sync\config\ssh_config"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nv"&gt;$acl&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Get-Acl&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;$cfg&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nv"&gt;$badRules&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;$acl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Access&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Where-Object&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="bp"&gt;$_&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;IdentityReference&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Value&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-like&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"S-1-5-*"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-and&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="bp"&gt;$_&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;IdentityReference&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Translate&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;System.Security.Principal.NTAccount&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-isnot&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;System.Security.Principal.NTAccount&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nv"&gt;$badRules&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ForEach-Object&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;$acl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;RemoveAccessRule&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;$_&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Out-Null&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Set-Acl&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-Path&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;$cfg&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-AclObject&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;$acl&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;After this, &lt;code&gt;ssh spark-XXXX.local&lt;/code&gt; connects on the first try (replace XXXX with your hostname).&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Commands to check DGX specs&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# GPU&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;nvidia-smi
NVIDIA-SMI 580.142    Driver Version: 580.142    CUDA Version: 13.0
GPU 0: NVIDIA GB10    36C    P8    4W / N/A

&lt;span class="c"&gt;# OS&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;uname&lt;/span&gt; &lt;span class="nt"&gt;-a&lt;/span&gt;
Linux spark-XXXX 6.17.0-1014-nvidia ... aarch64 GNU/Linux

&lt;span class="c"&gt;# Memory&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;free &lt;span class="nt"&gt;-h&lt;/span&gt;
Mem: 121Gi  2.6Gi  118Gi

&lt;span class="c"&gt;# Storage&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;df&lt;/span&gt; &lt;span class="nt"&gt;-h&lt;/span&gt;
/dev/nvme0n1p2  3.7T  47G  3.5T  2%  /

&lt;span class="c"&gt;# CPU&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;lscpu
Architecture:  aarch64
CPU&lt;span class="o"&gt;(&lt;/span&gt;s&lt;span class="o"&gt;)&lt;/span&gt;:        20
Model name:    Cortex-X925 + Cortex-A725
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;Notable bits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CUDA 13.0 (latest)&lt;/li&gt;
&lt;li&gt;aarch64 (ARM64) architecture — yes, the DGX is ARM&lt;/li&gt;
&lt;li&gt;121Gi (≈128GB) unified memory&lt;/li&gt;
&lt;li&gt;20 cores in big.LITTLE layout (10 perf + 10 efficient)&lt;/li&gt;
&lt;/ul&gt;

&lt;ol&gt;
&lt;li&gt;ComfyUI installation steps&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Following the official NVIDIA &lt;a href="https://build.nvidia.com/spark" rel="noopener noreferrer"&gt;Comfy UI playbook&lt;/a&gt;.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Virtual environment&lt;/span&gt;
&lt;span class="nb"&gt;cd&lt;/span&gt; ~
python3 &lt;span class="nt"&gt;-m&lt;/span&gt; venv comfyui-env
&lt;span class="nb"&gt;source &lt;/span&gt;comfyui-env/bin/activate

&lt;span class="c"&gt;# PyTorch with CUDA 13.0&lt;/span&gt;
pip3 &lt;span class="nb"&gt;install &lt;/span&gt;torch torchvision &lt;span class="nt"&gt;--index-url&lt;/span&gt; https://download.pytorch.org/whl/cu130

&lt;span class="c"&gt;# ComfyUI itself&lt;/span&gt;
git clone https://github.com/comfyanonymous/ComfyUI.git
&lt;span class="nb"&gt;cd &lt;/span&gt;ComfyUI
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt

&lt;span class="c"&gt;# Model (SD 1.5, ~2GB)&lt;/span&gt;
&lt;span class="nb"&gt;cd &lt;/span&gt;models/checkpoints/
wget https://huggingface.co/Comfy-Org/stable-diffusion-v1-5-archive/resolve/main/v1-5-pruned-emaonly-fp16.safetensors

&lt;span class="c"&gt;# Launch server&lt;/span&gt;
&lt;span class="nb"&gt;cd&lt;/span&gt; ~/ComfyUI
python main.py &lt;span class="nt"&gt;--listen&lt;/span&gt; 0.0.0.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;Key packages installed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;torch 2.11.0+cu130&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;cuDNN 9.19&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;NCCL 2.28&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;transformers 5.7.0&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;comfyui-frontend-package 1.42.15&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Open &lt;code&gt;http://spark-XXXX.local:8188&lt;/code&gt; from your Windows PC's browser to access ComfyUI (XXXX is your hostname).&lt;/p&gt;
&lt;h3&gt;
  
  
  Download speed
&lt;/h3&gt;

&lt;p&gt;The 2GB model came down at 40.6 MB/s in 50 seconds from HuggingFace's CDN. About half of my home 1Gbps LAN.&lt;/p&gt;




&lt;h2&gt;
  
  
  Tomorrow: Day 2
&lt;/h2&gt;

&lt;p&gt;Day 2 plan: Train a LoRA on photos of my actual cat.&lt;/p&gt;

&lt;p&gt;Today's SD 1.5 only knows "some cat from somewhere". With LoRA fine-tuning, I should be able to teach it about my specific cat. That kind of personalization feels like the killer feature of running locally.&lt;/p&gt;




&lt;h1&gt;
  
  
  100ExperimentsWithDGX #LocalLLM
&lt;/h1&gt;

</description>
      <category>localllm</category>
      <category>ai</category>
      <category>dgxspark</category>
      <category>comfyui</category>
    </item>
  </channel>
</rss>
