<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: PEPPERCORN</title>
    <description>The latest articles on DEV Community by PEPPERCORN (@peppercorn_llm).</description>
    <link>https://dev.to/peppercorn_llm</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3910738%2F8084bbca-3641-4d19-85b2-f53a184e1f84.jpg</url>
      <title>DEV Community: PEPPERCORN</title>
      <link>https://dev.to/peppercorn_llm</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/peppercorn_llm"/>
    <language>en</language>
    <item>
      <title>[Day 11] I turned my cat into anime art — and the AI drew a human girl instead. One photo through IPAdapter pulls it back to a cat</title>
      <dc:creator>PEPPERCORN</dc:creator>
      <pubDate>Thu, 04 Jun 2026 04:13:23 +0000</pubDate>
      <link>https://dev.to/peppercorn_llm/day-11-i-turned-my-cat-into-anime-art-and-the-ai-drew-a-human-girl-instead-one-photo-through-4dkp</link>
      <guid>https://dev.to/peppercorn_llm/day-11-i-turned-my-cat-into-anime-art-and-the-ai-drew-a-human-girl-instead-one-photo-through-4dkp</guid>
      <description>&lt;h2&gt;
  
  
  Intro
&lt;/h2&gt;

&lt;p&gt;Day 11! Back to cats 🐱&lt;/p&gt;

&lt;p&gt;I took one photo of my cat (a black-and-white tuxedo boy) as a reference and had AI restyle him into anime, ukiyo-e, oil painting, and more.&lt;/p&gt;

&lt;p&gt;The goal: change only the style while keeping "my cat" recognizable. But left alone, the AI started drawing humans instead of a cat. Here's what I did, step by step.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What I used: my home AI machine (DGX Spark) + an image-generation tool (ComfyUI) + one photo of my cat.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The reference is this one photo
&lt;/h2&gt;

&lt;p&gt;A tomcat my family looks after for me, with yellow eyes and a slightly grumpy look.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp8jxkykpjz4zow2i2hh0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp8jxkykpjz4zow2i2hh0.png" alt="The reference photo of my cat" width="768" height="768"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Love that face. I'll turn him into various styles while keeping him recognizable as "my cat."&lt;/p&gt;




&lt;h2&gt;
  
  
  First, anime from text alone → a human
&lt;/h2&gt;

&lt;p&gt;I started with no photo, just text: "a tuxedo cat, anime key visual." I clearly said &lt;em&gt;cat&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F52m7gp8s46ajvka5ha2x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F52m7gp8s46ajvka5ha2x.png" alt="Anime from text alone, no reference photo" width="768" height="768"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here's what came out. …A human girl.&lt;/p&gt;

&lt;p&gt;Black hair, white collar. My cat's tuxedo pattern (black body, white chest) turned straight into clothing.&lt;/p&gt;




&lt;h2&gt;
  
  
  Next, I added the reference photo → still human
&lt;/h2&gt;

&lt;p&gt;So I hand over the cat photo as a &lt;em&gt;visual reference&lt;/em&gt;. The tool that applies it is IPAdapter.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;What's the reference-photo trick (IPAdapter)?&lt;/strong&gt; A tool that lets you pass a reference &lt;em&gt;image&lt;/em&gt;, separate from the text prompt, and say "make it look like this." It's what preserves my cat's colors and face.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Surely &lt;em&gt;this&lt;/em&gt; makes it a cat… nope. Still human.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Far610w7ixrcand0etspy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Far610w7ixrcand0etspy.png" alt="Even with the reference photo added, still a human" width="768" height="768"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And this habit wasn't limited to anime. Ask the same anime-style model for ukiyo-e or oil painting, and you still get anime-ish humans. It hijacks not just the subject (the cat), but the art style too.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1ccueo6n78a1bm7x221m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1ccueo6n78a1bm7x221m.png" alt="Ask for ukiyo-e or oil painting, you still get anime-style humans" width="800" height="421"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Left: an "ukiyo-e" that's really an anime woman in a kimono. Right: an "oil painting" that's an anime woman in a tuxedo. Both are "humans painted in the cat's colors."&lt;/p&gt;




&lt;h2&gt;
  
  
  I tuned the settings → finally a cat
&lt;/h2&gt;

&lt;p&gt;On top of the photo, I turned up its strength and added "don't draw humans" to the negatives (details below). That's when it finally became a sitting cat.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsge68cvgiaxp3hcaf9z9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsge68cvgiaxp3hcaf9z9.png" alt="Photo plus tuned settings finally gives a cat" width="768" height="768"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Why does it turn into a human?
&lt;/h2&gt;

&lt;p&gt;Two reasons, as far as I can tell.&lt;/p&gt;

&lt;p&gt;One: anime-savvy models tend to draw people, girls especially. Even with "cat" in the prompt, they drift toward a human if you let them.&lt;/p&gt;

&lt;p&gt;Two: my cat's pose. He sits bolt upright, almost like a person, so the harder you push the reference, the more that upright posture rides along — tipping toward an "anthropomorphized" cat. The pop-art piece later is exactly that leftover.&lt;/p&gt;




&lt;h2&gt;
  
  
  Cyberpunk flipped to a cat with the photo alone
&lt;/h2&gt;

&lt;p&gt;The interesting part: whether the photo alone was enough depended on the model. Anime was stubborn and needed tuning, but cyberpunk became a cat just by adding the photo.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4or2x14ierhix5j9bc6c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4or2x14ierhix5j9bc6c.png" alt="Cyberpunk: same prompt, with vs. without the reference photo" width="800" height="421"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Left (no reference): a human man in a neon city. Right (with reference): a cat with glowing ears.&lt;/p&gt;

&lt;p&gt;I didn't change a single character of the prompt — the photo being there or not is the only difference between human and cat.&lt;/p&gt;




&lt;h2&gt;
  
  
  The styles that came out
&lt;/h2&gt;

&lt;p&gt;Here's the gallery after the human problem was fixed — all with the reference photo, my cat as the base.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9glp3evvd2l40s34qbxy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9glp3evvd2l40s34qbxy.png" alt="Gallery of 7 styles" width="800" height="420"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Top row, left to right: anime, ukiyo-e, oil painting (Van Gogh-ish), stained glass. Bottom row: cyberpunk, 3D (Pixar-ish), pop art.&lt;/p&gt;




&lt;h2&gt;
  
  
  "Likeness" and "style" are a tug-of-war
&lt;/h2&gt;

&lt;p&gt;The oddly real 3D Pixar one shows this little trade-off nicely.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F043tslc6hn8dqxw4awbc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F043tslc6hn8dqxw4awbc.png" alt="3D style: without (left) and with (right) the reference" width="800" height="421"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Left (no reference): a cute 3D cat, but "some cat." Right (with reference): it becomes my cat's face, but the 3D look washes out into basically a real photo.&lt;/p&gt;

&lt;p&gt;Weaken the reference and the style shows but it's a different cat; strengthen it and it's my cat but the style fades. Finding that grip per style is what the tuning really is.&lt;/p&gt;




&lt;h2&gt;
  
  
  The boss I couldn't beat: storybook watercolor
&lt;/h2&gt;

&lt;p&gt;"Gentle storybook watercolor" was the one style I never got to be a cat. Here's the result of seven retries.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhtbx4k3etvbyoxdl385o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhtbx4k3etvbyoxdl385o.png" alt="Storybook failures: a person, two cats, a cat-girl" width="800" height="280"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A human, then somehow two cats, then a cat-eared girl holding a cat. "Single + watercolor + cat" wouldn't line up. Lower the reference → human; raise it → two cats. "Storybook" must be soaked in human imagery. Carrying this over.&lt;/p&gt;




&lt;h2&gt;
  
  
  The details
&lt;/h2&gt;

&lt;p&gt;Here are the details.&lt;/p&gt;

&lt;h3&gt;
  
  
  The reference-photo mechanism (IPAdapter)
&lt;/h3&gt;

&lt;p&gt;I added a custom node called &lt;code&gt;ComfyUI_IPAdapter_plus&lt;/code&gt; to ComfyUI. It lets you hand over a reference image as a "visual guide," separate from the text prompt.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Model used: &lt;code&gt;ip-adapter_sd15&lt;/code&gt; (44.6MB, from h94/IP-Adapter)&lt;/li&gt;
&lt;li&gt;The part that reads the image features: &lt;code&gt;CLIP-ViT-H&lt;/code&gt; (reused an existing one)&lt;/li&gt;
&lt;li&gt;The reference photo is cropped to a 768px square before handing it over&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A number called the "reference strength (weight)" controls how closely it mimics. I moved between roughly 0.7 and 0.85 depending on the style.&lt;/p&gt;

&lt;h3&gt;
  
  
  What I did to suppress the "human" problem
&lt;/h3&gt;

&lt;p&gt;I started at weight 0.7 plus words like "key visual" and "big eyes," which strongly invited humans. Three fixes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Raise the reference strength to 0.85&lt;/li&gt;
&lt;li&gt;Add &lt;code&gt;human, girl, person, 1girl, humanoid&lt;/code&gt; to the "things I don't want drawn" list&lt;/li&gt;
&lt;li&gt;Strip human-summoning words from the request and emphasize &lt;code&gt;tuxedo cat, full body, animal&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That corrected anime, ukiyo-e, and oil painting into cats. One catch: the phrase "tuxedo cat" itself tends to put an actual tuxedo (a suit) on the cat, so it cut both ways.&lt;/p&gt;

&lt;h3&gt;
  
  
  The base models I used
&lt;/h3&gt;

&lt;p&gt;I switched the underlying image model by style.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Anime / illustration: &lt;code&gt;AnythingV5&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Realistic / 3D: &lt;code&gt;Realistic Vision V6&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Plain base: &lt;code&gt;SD 1.5&lt;/code&gt; (base)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When storybook failed, switching to the plain base gave a real cat but weak watercolor feel, and raising the strength split it into two cats — a real bind. The base model's "habits" matter a lot.&lt;/p&gt;

&lt;h3&gt;
  
  
  Common generation settings
&lt;/h3&gt;

&lt;p&gt;Across all styles: 768px, 30 steps, sampler &lt;code&gt;dpmpp_2m karras&lt;/code&gt;, cfg 7, seed fixed at 110011. I only varied the text request and the reference strength, keeping everything else equal for a fair comparison. Generation is fired at ComfyUI from a small script I wrote.&lt;/p&gt;




&lt;h2&gt;
  
  
  Next up
&lt;/h2&gt;

&lt;p&gt;Next time it's cats again — and this time I'm planning video generation 🐱&lt;/p&gt;

&lt;h1&gt;
  
  
  100ExperimentsWithDGX #LocalLLM
&lt;/h1&gt;

</description>
      <category>localllm</category>
      <category>ai</category>
      <category>dgxspark</category>
      <category>stablediffusion</category>
    </item>
    <item>
      <title>[Day 10] Building my own personal weather officer AI, and teaching it my body's sense of cold over the next 100 days</title>
      <dc:creator>PEPPERCORN</dc:creator>
      <pubDate>Mon, 01 Jun 2026 01:31:58 +0000</pubDate>
      <link>https://dev.to/peppercorn_llm/day-10-building-my-own-personal-weather-officer-ai-and-teaching-it-my-bodys-sense-of-cold-over-26d3</link>
      <guid>https://dev.to/peppercorn_llm/day-10-building-my-own-personal-weather-officer-ai-and-teaching-it-my-bodys-sense-of-cold-over-26d3</guid>
      <description>&lt;h2&gt;
  
  
  Intro
&lt;/h2&gt;

&lt;p&gt;Day 10!&lt;/p&gt;

&lt;p&gt;This time I'm starting a longer-running experiment. Meet the "weather officer AI" — a bot that texts me every morning saying "wear this today." The plan is to build a weather assistant that's tuned to &lt;em&gt;me&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;What I'm building today is just v0.1 (the very first version). From here through Day 100, I'll keep teaching it "too cold / just right / too warm" every morning, so it gradually learns my preferences. The experiment is: how smart does it get after 100 days?&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What I used: my home AI machine (DGX Spark) + free weather data + a phone messaging app (Telegram)&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Today's task
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What I wanted
&lt;/h3&gt;

&lt;p&gt;I live somewhere with a big daily temperature swing, and "what do I wear today?" is a small but real daily headache. Weather apps tell you the temperature, but whether &lt;em&gt;I&lt;/em&gt; feel cold is a different question.&lt;/p&gt;

&lt;p&gt;So the starting point was: can I build a clothing AI that's tuned to "how I feel," not just "the temperature"?&lt;/p&gt;

&lt;h3&gt;
  
  
  Approach
&lt;/h3&gt;

&lt;p&gt;I kept the design dead simple.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Every morning at 7, automatically fetch today's weather&lt;/li&gt;
&lt;li&gt;Decide "this morning's outfit" from the apparent temperature and push it to my phone&lt;/li&gt;
&lt;li&gt;I just tap back "cold / just right / warm"&lt;/li&gt;
&lt;li&gt;As these feelings pile up, the AI learns "this person runs cold" and corrects its suggestions&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The goal this time
&lt;/h3&gt;

&lt;p&gt;Not a "perfect forecast AI," but a "routine I can actually keep up every day." The smarts get grown over the next 100 days. Today is just laying the rails.&lt;/p&gt;




&lt;h2&gt;
  
  
  📊 How much does the temperature actually move in a day?
&lt;/h2&gt;

&lt;p&gt;Before building anything, I pulled a week of apparent temperatures for where I live and graphed it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgciz2ogzi4dvc7meb2cl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgciz2ogzi4dvc7meb2cl.png" alt="Apparent temperature through the day" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A beautiful zigzag. Every day repeats "cold in the morning → way up by midday → down again at night." The average daily swing is 13°C, and on the biggest day it moved 20°C.&lt;/p&gt;

&lt;p&gt;So I narrowed the suggestion down to "one outfit, at 7 a.m., matched to the apparent temperature at that hour." I record just once in the morning too. To keep something up for 100 days, simplicity matters most.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The AI doesn't learn the temperature swing itself. But by deciding to "focus on the morning," the suggestion and the feedback line up in time, so later I can cleanly check "was the morning suggestion right?"&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  🔧 How it works (the morning loop)
&lt;/h2&gt;

&lt;p&gt;The finished weather officer runs on this loop.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9b01emkwxk15l19ksyh6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9b01emkwxk15l19ksyh6.png" alt="The weather officer's morning loop" width="800" height="531"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At 7 a.m. it grabs the weather, decides the outfit, and pushes a notification. I tap back my feeling, and that gets recorded. Once those records pile up, step 5 — learning "you run cold / warm" — kicks in, and the suggestions gradually become mine.&lt;/p&gt;

&lt;p&gt;Here's what the actual notification looks like.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;☀️ Weather Officer AI — good morning

👕 This morning's outfit: long sleeves

This morning feels like: 13°C (highs up to 20°C today)
Rain: 4%  /  Wind: 13 km/h

How does it feel this morning? ↓ tap to tell me
   [🥶 cold]  [😊 just right]  [🥵 warm]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It suggests one outfit, but adds "highs up to 20°C today" so I can decide whether to throw on a layer myself. Tapping a button changes it to "✅ recorded," and the feeling is saved to my home AI.&lt;/p&gt;

&lt;p&gt;The notifications go through the messaging app I already use (Telegram).&lt;/p&gt;




&lt;h2&gt;
  
  
  🛠️ The details
&lt;/h2&gt;

&lt;p&gt;Below are the specifics.&lt;/p&gt;

&lt;p&gt;The weather data&lt;/p&gt;

&lt;p&gt;Weather comes from &lt;a href="https://open-meteo.com/" rel="noopener noreferrer"&gt;Open-Meteo&lt;/a&gt;, a free weather API. No API key needed, historical data available, and commercial use is OK (CC BY 4.0) — very generous.&lt;/p&gt;

&lt;p&gt;I mostly use "apparent temperature" — not the raw air temperature, but a number adjusted for wind and humidity to reflect how it actually feels, which is better for deciding what to wear. I take the average apparent temperature from 7–9 a.m. as "this morning's feel." The coordinates stay only on my home machine; I don't write the specific location in the article or the code.&lt;/p&gt;

&lt;p&gt;The clothing rule&lt;/p&gt;

&lt;p&gt;A plain rule that splits apparent temperature into 7 bands, each mapped to an outfit (e.g. 13–20°C → long sleeves, 20–26°C → short sleeves).&lt;/p&gt;

&lt;p&gt;There's one "personal offset" number baked in. It's zero for now (v0.1). Going forward, if "cold" keeps coming back, I'll push the offset negative so the same temperature suggests warmer clothes — growing it from the feedback.&lt;/p&gt;

&lt;p&gt;The notification and buttons&lt;/p&gt;

&lt;p&gt;The notification uses a Telegram "Bot" (a thing that sends messages automatically), built with the &lt;code&gt;python-telegram-bot&lt;/code&gt; library to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;send a message at a fixed time every morning at 7&lt;/li&gt;
&lt;li&gt;attach three buttons below it and record which one is pressed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This bot sits waiting inside my home AI machine and fires in the morning.&lt;/p&gt;

&lt;p&gt;The shape of the feeling log (the 100-day foundation)&lt;/p&gt;

&lt;p&gt;Each record is one line per day, with these fields:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;date&lt;/li&gt;
&lt;li&gt;that day's forecast (morning/daytime apparent temperature, wind, rain chance)&lt;/li&gt;
&lt;li&gt;the outfit the AI suggested&lt;/li&gt;
&lt;li&gt;my feeling (cold / just right / warm)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Kept in this consistent shape, I can later graph "monthly hit rate" or "my personal bias." I'll report this trend in the Day 21 / 36 / 54 / 74 / 87 check-ins.&lt;/p&gt;

&lt;p&gt;Keeping it running&lt;/p&gt;

&lt;p&gt;So the 7 a.m. notification reliably fires, I set the bot to launch automatically when the machine boots (systemd). Even after a reboot it comes back on its own, and the morning notification keeps going.&lt;/p&gt;

&lt;p&gt;The feeling log and the notification settings (the Telegram token, etc.) are all stored only on my home machine — none of it goes anywhere external like GitHub.&lt;/p&gt;




&lt;h2&gt;
  
  
  The 100-day growth plan
&lt;/h2&gt;

&lt;p&gt;I'll keep growing this weather officer across the series — it'll pop up here and there.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Milestone&lt;/th&gt;
&lt;th&gt;What happens&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Day 10 (now)&lt;/td&gt;
&lt;td&gt;v0.1 done; the recording loop starts turning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Day 21&lt;/td&gt;
&lt;td&gt;First "my personal bias" report from 11 days of data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Day 36&lt;/td&gt;
&lt;td&gt;Graph the monthly hit rate and take a look&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Day 54 / 74 / 87&lt;/td&gt;
&lt;td&gt;Mid-reviews: seasonal changes in feel, how the correction works&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Day 100&lt;/td&gt;
&lt;td&gt;The 100-day accuracy trend and the finished version&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Just tap a button every morning. I'm curious how "mine" it'll feel after 100 days.&lt;/p&gt;




&lt;h2&gt;
  
  
  Next time: Day 11
&lt;/h2&gt;

&lt;p&gt;Next time it's a hard pivot back to cats 🐱 I'll convert photos of my cat into picture-book, anime, and photorealistic styles — the theme being whether I can keep "that's-our-cat-ness" while changing only the style.&lt;/p&gt;

&lt;h1&gt;
  
  
  LocalLLM #100ExperimentsWithDGX
&lt;/h1&gt;

</description>
      <category>localllm</category>
      <category>ai</category>
      <category>dgxspark</category>
      <category>python</category>
    </item>
    <item>
      <title>[Day 9] A local Japanese sentiment AI (BERT) read 8 years of a LINE chat, and the ups and downs surfaced from numbers alone</title>
      <dc:creator>PEPPERCORN</dc:creator>
      <pubDate>Fri, 29 May 2026 22:39:10 +0000</pubDate>
      <link>https://dev.to/peppercorn_llm/day-9-a-local-japanese-sentiment-ai-bert-read-8-years-of-a-line-chat-and-the-ups-and-downs-4951</link>
      <guid>https://dev.to/peppercorn_llm/day-9-a-local-japanese-sentiment-ai-bert-read-8-years-of-a-line-chat-and-the-ups-and-downs-4951</guid>
      <description>&lt;h2&gt;
  
  
  Intro
&lt;/h2&gt;

&lt;p&gt;Day 9. Today is less about model internals and more of a personal experiment: have a local AI analyze the entire chat history with one LINE friend. (LINE is the dominant messaging app in Japan.)&lt;/p&gt;

&lt;p&gt;When I exported it, 8 years were sitting there — from the very first message to today. It started, we talked a lot, it went quiet for a while, then picked up again. That whole arc is in there.&lt;/p&gt;

&lt;p&gt;Because the content is what it is, nothing left my machine: everything ran locally on my DGX Spark.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What I used: my home AI box (DGX Spark) + a Japanese sentiment model (for tone) + a bigger local model (to guess events from numbers).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Today's setup
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What I wanted to do
&lt;/h3&gt;

&lt;p&gt;Re-reading 8 years of messages one by one isn't realistic. So instead of reading the content, I looked only at the "shape" of the conversation — when, how much, and in what tone we talked.&lt;/p&gt;

&lt;p&gt;Concretely:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;monthly message volume&lt;/li&gt;
&lt;li&gt;the trend of tone (positive / negative)&lt;/li&gt;
&lt;li&gt;then asking an AI to find "when something big happened"&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Heads-up (the result)
&lt;/h3&gt;

&lt;p&gt;From message counts and tone alone, the 8-year arc came out clearly on a chart. Started, went quiet, came back — the flow was visible without me re-reading a thing.&lt;/p&gt;

&lt;h2&gt;
  
  
  🔧 Pipeline
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;LINE chat export (text)
        │
        ▼
 1. Parse: split each message into {datetime, who, type, text}
        │   (from here on, message text never leaves the machine)
        ▼
 2. Aggregate: monthly counts, time-of-day, reply gaps
        │
        ▼
 3. Tone scoring: classify each of 66k messages pos/neu/neg
        │
        ▼
 4. Turning-point detection: from sudden changes in the numbers
        │   + also show ONLY the numbers to a bigger AI and ask it to guess
        ▼
 5. Answer check: compare against the real timeline
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can export a LINE chat as text from the chat screen ("send chat history").&lt;/p&gt;

&lt;p&gt;Data size:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Item&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Span&lt;/td&gt;
&lt;td&gt;~8 years 2 months&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total messages&lt;/td&gt;
&lt;td&gt;87,621&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Text messages&lt;/td&gt;
&lt;td&gt;66,329&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stickers&lt;/td&gt;
&lt;td&gt;15,605&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Photos&lt;/td&gt;
&lt;td&gt;3,982&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;15,605 stickers… that's a lot.&lt;/p&gt;

&lt;h3&gt;
  
  
  The two AIs
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Step&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;th&gt;What it sees&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;3. Tone&lt;/td&gt;
&lt;td&gt;Japanese sentiment model (&lt;code&gt;koheiduck/bert-japanese-finetuned-sentiment&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;scores each message pos/neu/neg&lt;/td&gt;
&lt;td&gt;66k message texts (scores averaged per month)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4. Turning points&lt;/td&gt;
&lt;td&gt;a bigger local model (&lt;code&gt;Qwen2.5&lt;/code&gt; 72B)&lt;/td&gt;
&lt;td&gt;guesses "what happened to these two?"&lt;/td&gt;
&lt;td&gt;only the per-month table of counts + tone scores (no conversation, no words)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Both run locally on my own machine.&lt;/p&gt;

&lt;h2&gt;
  
  
  📊 Results
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The 8-year arc of volume and tone
&lt;/h3&gt;

&lt;p&gt;This chart is the highlight. Top: monthly message count. Bottom: tone (up = positive, down = negative). The x-axis is months since the conversation started. (Axis labels are in Japanese.)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnvlmf9tb0ah7rwakmqo3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnvlmf9tb0ah7rwakmqo3.png" alt="8-year message volume and tone"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Plotted, it isn't a steady climb or a flat line — it splits cleanly into "chapters": ramp-up → an 8-month silence → a second peak → a stable plateau. Four phases, at a glance.&lt;/p&gt;

&lt;p&gt;Tone has two peaks of about +0.6, around the start and around when things resumed (overall mean ≈ 0, slightly negative in the later years). The interesting part: in the month &lt;em&gt;before&lt;/em&gt; the silence, tone had already dropped to −0.1. The mood dimmed before the volume did.&lt;/p&gt;

&lt;p&gt;There are two dips into negative tone. The one before the silence was an "omen." The other is the recent years — not an omen, but the effect of logistics-y messages ("what time are you home?") piling up.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;💡 Mini-note: how is "tone" turned into a number?&lt;br&gt;
The scoring is done by a Japanese sentiment model. Roughly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;pre-trained on lots of Japanese text labeled positive / negative&lt;/li&gt;
&lt;li&gt;judges with context, not just by spotting keywords&lt;/li&gt;
&lt;li&gt;returns a probability of "positive-ness" / "negative-ness" per message&lt;/li&gt;
&lt;li&gt;I used the difference as a per-message score&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  What kinds of messages scored how?
&lt;/h3&gt;

&lt;p&gt;A few actual judgments (short, name- and place-free one-liners):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Message&lt;/th&gt;
&lt;th&gt;Verdict&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;「楽しかったね！」 (that was fun!)&lt;/td&gt;
&lt;td&gt;Positive&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;「これめちゃうまい」 (this is so good)&lt;/td&gt;
&lt;td&gt;Positive&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;「おはようございます」 (good morning)&lt;/td&gt;
&lt;td&gt;Neutral&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;「もうお家？」 (home already?)&lt;/td&gt;
&lt;td&gt;Neutral&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;「全く集中できない」 (can't focus at all)&lt;/td&gt;
&lt;td&gt;Negative&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;「それは悔しいな、、」 (that's frustrating…)&lt;/td&gt;
&lt;td&gt;Negative&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;(a long trip-planning message)&lt;/td&gt;
&lt;td&gt;Neutral&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;(a snappy one-liner sent in a huff)&lt;/td&gt;
&lt;td&gt;Negative&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Plain happy lines score positive; logistics ("good morning", "home already?") score neutral; tiredness or irritation scores negative. Even long, businesslike planning messages lean neutral.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mornings are when we talk
&lt;/h3&gt;

&lt;p&gt;Message density by weekday × hour (brighter = more).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzdnp6m5ynqfxirqyvw9i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzdnp6m5ynqfxirqyvw9i.png" alt="weekday × hour density"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A clear concentration at 7–9 a.m.!&lt;/p&gt;

&lt;h3&gt;
  
  
  Could the AI guess the turning points?
&lt;/h3&gt;

&lt;p&gt;First, the simple method: mechanically pick the points where message volume jumped or dropped, then check against the real timeline.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Real event&lt;/th&gt;
&lt;th&gt;Auto-detected timing&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;When it started&lt;/td&gt;
&lt;td&gt;exact match&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;When it went quiet&lt;/td&gt;
&lt;td&gt;exact match&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;When it resumed&lt;/td&gt;
&lt;td&gt;exact match&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;When it got lively again&lt;/td&gt;
&lt;td&gt;a few months off&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;A big life milestone&lt;/td&gt;
&lt;td&gt;hard to detect (barely shows in counts)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Sharp volume changes were nailed. But "a big life milestone" got missed. So I showed the &lt;em&gt;same numbers&lt;/em&gt; to the bigger local model and asked "what happened?" — and got back:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"around when it started" → roughly matches&lt;/li&gt;
&lt;li&gt;"a stretch of going silent" → matches the quiet period&lt;/li&gt;
&lt;li&gt;"a major life change" → almost exactly before the real milestone&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Rather than hunting for a single spike, it reads the whole sequence of numbers as a "flow," so it could pick up even an event that barely moves the counts.&lt;/p&gt;

&lt;h2&gt;
  
  
  💡 Takeaways
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Volume + tone alone reveal the arc
&lt;/h3&gt;

&lt;p&gt;Counts and tone were enough to see the 8-year shape. Silence marks the quiet stretch; a surge marks the resumption — straight off the chart.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. A local model reads a story out of numbers
&lt;/h3&gt;

&lt;p&gt;Given only monthly numbers, the model inferred even a barely-visible event ("something big around here"), and it lined up with reality. It connects scattered points into one flow.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. A "negative" tone doesn't mean a bad relationship
&lt;/h3&gt;

&lt;p&gt;The slight negative lean in later years isn't about getting along badly. Logistics messages ("what time are you home?") just don't score high. Low score ≠ trouble. It isn't that sentiment analysis is poor — the scores need to be read together with context.&lt;/p&gt;

&lt;h2&gt;
  
  
  🛠️ Technical details
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Parsing &amp;amp; aggregation
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;LINE export format is a date header plus &lt;code&gt;time&amp;lt;TAB&amp;gt;name&amp;lt;TAB&amp;gt;text&lt;/code&gt;. Multi-line messages (4,987 of them) are merged back into the previous message.&lt;/li&gt;
&lt;li&gt;Speakers normalized to "A / B" by message count (no real names in anything public). Temporary group members and system lines excluded.&lt;/li&gt;
&lt;li&gt;Messages tagged by type (text / sticker / photo / call / unsent…). Tone uses text only; volume counts use all types.&lt;/li&gt;
&lt;li&gt;Aggregation and plotting in Python (pandas / matplotlib).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Tone (sentiment)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;koheiduck/bert-japanese-finetuned-sentiment&lt;/code&gt;, a 3-class (pos / neu / neg) Japanese model.&lt;/li&gt;
&lt;li&gt;66,329 texts scored on GPU in batches; per message I take P(pos) − P(neg) in [−1, +1], then average per month.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Turning-point detection
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Rule-based: long near-zero stretches (silence), large month-over-month surges, and tone peaks — all from numbers only.&lt;/li&gt;
&lt;li&gt;Plus: the per-month table of counts + tone scores fed to a bigger local model (Qwen2.5-72B via ollama) to guess events. No message text was given.&lt;/li&gt;
&lt;li&gt;Real event dates were kept in a local note only, used for annotation and the answer check.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Privacy
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Every file containing message text (raw export, parsed data, scores) stays in a non-public folder.&lt;/li&gt;
&lt;li&gt;Only aggregate numbers and charts are published. The chart x-axis is relativized to "months since the conversation started," hiding actual dates.&lt;/li&gt;
&lt;li&gt;Apart from a few short, name- and place-free one-liners shown as scoring examples, no conversation content, real names, specific dates, or long text appears in the article or charts.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Tomorrow: Day 10
&lt;/h2&gt;

&lt;p&gt;Weather forecasts say one temperature, but everyone &lt;em&gt;feels&lt;/em&gt; it differently. Same degrees, different "do I need a coat?" So next I'm building my own personal "weather officer" AI: from past weather data, it'll tell me each morning something like "coat + beanie today." Over the next 100 days I'll teach it my own sense of cold — the start of a longer project.&lt;/p&gt;

&lt;h1&gt;
  
  
  100ExperimentsWithDGX #LocalLLM
&lt;/h1&gt;

</description>
      <category>localllm</category>
      <category>ai</category>
      <category>dgxspark</category>
      <category>privacy</category>
    </item>
    <item>
      <title>[Day 8] Pushing Looped Transformers Beyond Addition: OpenMythos on Bracket-Matching Depth</title>
      <dc:creator>PEPPERCORN</dc:creator>
      <pubDate>Fri, 29 May 2026 06:27:04 +0000</pubDate>
      <link>https://dev.to/peppercorn_llm/day-8-pushing-looped-transformers-beyond-addition-openmythos-on-bracket-matching-depth-4bgd</link>
      <guid>https://dev.to/peppercorn_llm/day-8-pushing-looped-transformers-beyond-addition-openmythos-on-bracket-matching-depth-4bgd</guid>
      <description>&lt;h1&gt;
  
  
  [Day 8] Pushing Looped Transformers Beyond Addition: OpenMythos on Bracket-Matching Depth
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Intro
&lt;/h2&gt;

&lt;p&gt;Day 8!&lt;/p&gt;

&lt;p&gt;A direct follow-up to &lt;a href="https://dev.to/peppercorn_llm/day-7-openmythos-loop-debate"&gt;Day 7&lt;/a&gt;: same OpenMythos-style mini model (3.4M params), same training pipeline, &lt;strong&gt;one task change&lt;/strong&gt; — multi-digit addition swapped for nested-bracket parsing. The goal was to ask two follow-up questions Day 7 left open:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Does the "training-time loop count is the peak" finding generalize across tasks?&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;If we increase the structural complexity of the input (deeper nesting), does inference-time loop count start to matter?&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;Tools used: my home AI machine (DGX Spark, GB10) + &lt;a href="https://github.com/kyegomez/OpenMythos" rel="noopener noreferrer"&gt;OpenMythos&lt;/a&gt; (PyTorch reconstruction of the rumored Claude Mythos architecture) + synthetic bracket sequences.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Today's setup
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Why bracket matching?
&lt;/h3&gt;

&lt;p&gt;Day 7's task was 2-5 digit addition. Addition tests "carry propagation from low to high digit" — a fundamentally local, left-to-right state update. To probe whether looped depth helps with a different kind of structural reasoning, I wanted a task where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The output depends on &lt;em&gt;left-to-right state tracking&lt;/em&gt; (rules out attention-based global aggregation shortcuts).&lt;/li&gt;
&lt;li&gt;The task admits an explicit notion of &lt;em&gt;depth&lt;/em&gt; I can vary as a controlled difficulty knob.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Bracket matching fits both. The standard linear-time algorithm is push-on-open / pop-on-close with a stack. A model that has internalized that algorithm should scale gracefully with depth — and one that hasn't will visibly fall over.&lt;/p&gt;

&lt;h3&gt;
  
  
  Task: first-break-position prediction
&lt;/h3&gt;

&lt;p&gt;Input: a string of &lt;code&gt;( ) [ ] { }&lt;/code&gt; characters, terminated by &lt;code&gt;=&lt;/code&gt;.&lt;br&gt;
Output: the &lt;strong&gt;left-most position at which the bracket structure breaks&lt;/strong&gt;, as 2 digits, terminated by &lt;code&gt;$&lt;/code&gt;. If the sequence is balanced, output &lt;code&gt;--$&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Examples:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;((()))=          → --$       (balanced)
([{}])=          → --$       (balanced)
([)]=            → 02$       ()` at position 2 doesn't match preceding `[`)
(()(=            → 04$       (stack non-empty at end of string, position = len)
))=              → 00$       (close on empty stack at position 0)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The "break position" is defined by a stack parser scanning left-to-right:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Close bracket whose type ≠ stack top → return that close position.&lt;/li&gt;
&lt;li&gt;Close bracket on empty stack → return that close position.&lt;/li&gt;
&lt;li&gt;End of string with non-empty stack → return &lt;code&gt;len(s)&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Otherwise balanced → return &lt;code&gt;-1&lt;/code&gt; (output &lt;code&gt;--&lt;/code&gt;).&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Why not just binary balanced / imbalanced?
&lt;/h3&gt;

&lt;p&gt;That was the original plan. A first smoke run with &lt;code&gt;T&lt;/code&gt;/&lt;code&gt;F&lt;/code&gt; output saturated to 100% accuracy across all depths (up to 10) by step 4,000. There are too many shortcut signals — length parity, open/close count, etc. — for a transformer to learn the actual stack algorithm.&lt;/p&gt;

&lt;p&gt;The first-break-position output forces the model to commit to a specific character position, which can only be answered by tracking state left-to-right. After this change, smoke results at 5,000 steps showed clean depth-dependent difficulty (d=2: 100%, d=20: 71%) and the loss had room to keep dropping. That's the signal I needed to study loop-count behavior meaningfully.&lt;/p&gt;

&lt;h3&gt;
  
  
  Difficulty knob: depth
&lt;/h3&gt;

&lt;p&gt;I trained and evaluated across depths &lt;code&gt;{2, 4, 6, 8, 10, 12, 16, 20}&lt;/code&gt;, with pair count capped at &lt;code&gt;min(2 * depth, 50)&lt;/code&gt; so the 2-digit position output stays in range. Balanced and imbalanced sequences mixed 50/50; imbalanced sequences generated by deleting a close (30%), deleting an open (30%), or substituting a bracket (40%).&lt;/p&gt;

&lt;h3&gt;
  
  
  Architectural changes from Day 7
&lt;/h3&gt;

&lt;p&gt;Minimal — only what the new vocab and longer sequences required:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Day 7 (addition)&lt;/th&gt;
&lt;th&gt;Day 8 (brackets)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;vocab_size&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;max_seq_len&lt;/td&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;td&gt;128&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;max_loop_iters (train)&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Difficulty axis&lt;/td&gt;
&lt;td&gt;2-5 digits&lt;/td&gt;
&lt;td&gt;depth 2-20&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Answer tokens&lt;/td&gt;
&lt;td&gt;1-6 (digits + &lt;code&gt;$&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;2 + &lt;code&gt;$&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total params&lt;/td&gt;
&lt;td&gt;3.39M&lt;/td&gt;
&lt;td&gt;3.39M&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Same &lt;code&gt;MythosConfig&lt;/code&gt; template otherwise. Same hyperparameters (AdamW, max LR 3e-4, warmup 2000, cosine decay, 30k steps, fp32, 4 seeds in parallel).&lt;/p&gt;

&lt;h3&gt;
  
  
  Headline finding
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The Day 7 "peak at training loop count" finding generalizes.&lt;/strong&gt; With training &lt;code&gt;max_loop_iters=4&lt;/code&gt;, accuracy peaks at exactly T=4 again, and decays in both directions — including at every depth I tested.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;But the peak height is much lower.&lt;/strong&gt; Best accuracy was 66% at depth 2; depth 20 caps at ~36%. Day 7 hit 100% at d=5; brackets at the same parameter budget plateau dozens of points short.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inference-time loop extrapolation does NOT improve deep-nesting performance.&lt;/strong&gt; The hypothesis "deeper inputs benefit from more loops" did not reproduce — T&amp;gt;4 hurts at every depth, just as in Day 7.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fixed-point reproduced, slightly later.&lt;/strong&gt; Cosine similarity between consecutive hidden states reaches ~0.95 by T=3 and ~0.99 by T=4 — a step or two later than addition (which got there by T=2).&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🪢 The task in pictures
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Input:  ( ( [ ) ] ) =
Pos:    0 1 2 3 4 5

stack walk:
  pos 0: '(' → push '('             stack: ( 
  pos 1: '(' → push '('             stack: ( (
  pos 2: '[' → push '['             stack: ( ( [
  pos 3: ')' → top is '[', mismatch!  → first break at position 3

Expected output: 03$
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The interesting thing about this task vs. addition: the answer can be &lt;strong&gt;anywhere from 0 to ~40&lt;/strong&gt; depending on the input, and the model has to &lt;em&gt;commit to a specific integer&lt;/em&gt;. There's no global-aggregation shortcut — you have to walk left-to-right and remember what you've seen.&lt;/p&gt;




&lt;h2&gt;
  
  
  🔧 Pipeline
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;OpenMythos tiny (3.4M params, same as Day 7 modulo vocab + max_seq_len)
  ↓
Train 4 seeds in parallel, 30k steps, fp32 on DGX Spark (GB10)
  ↓
Experiment A: greedy autoregressive accuracy
              loops ∈ {1, 2, 4, 8, 16, 32}  ×  depth ∈ {2, 4, 6, 8, 10, 12, 16, 20}
  ↓
Experiment B: cosine similarity between consecutive hidden states
              ⇒ does the recurrent block reach a fixed-point?
              ⇒ does the fixed-point timing depend on depth?
  ↓
Compare against Day 7 (digits) along the same axes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Training throughput note (vs Day 7)
&lt;/h3&gt;

&lt;p&gt;Day 7's 4-seed parallel training was fast because &lt;code&gt;max_seq_len=32&lt;/code&gt; left the GPU underutilized per process. With &lt;code&gt;max_seq_len=128&lt;/code&gt;, a single process already saturates the GB10 — 4-seed parallel drops per-process throughput from ~60K tok/s to ~12.8K tok/s (a -79% per-process penalty). Aggregate parallel throughput is actually ~15% &lt;em&gt;slower&lt;/em&gt; than sequential 4-seed.&lt;/p&gt;

&lt;p&gt;I let it run in parallel anyway because it was overnight and I had no other DGX usage scheduled. Worth noting for anyone planning similar replications: longer sequences kill the "free" benefit of multi-seed parallelism on a single GPU.&lt;/p&gt;

&lt;p&gt;GPU draw stayed at 51W / 72°C / 95% utilization throughout — comfortable enough to leave running.&lt;/p&gt;




&lt;h2&gt;
  
  
  📊 Results
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Experiment A: accuracy heatmap
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F356tfullz8sf5zb06gc5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F356tfullz8sf5zb06gc5.png" alt="accuracy heatmap of bracket-matching across loop counts and depths"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Mean exact-match accuracy across 4 seeds, 500 eval samples per condition:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Inference loops&lt;/th&gt;
&lt;th&gt;d=2&lt;/th&gt;
&lt;th&gt;d=4&lt;/th&gt;
&lt;th&gt;d=6&lt;/th&gt;
&lt;th&gt;d=8&lt;/th&gt;
&lt;th&gt;d=10&lt;/th&gt;
&lt;th&gt;d=12&lt;/th&gt;
&lt;th&gt;d=16&lt;/th&gt;
&lt;th&gt;d=20&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;0.11&lt;/td&gt;
&lt;td&gt;0.05&lt;/td&gt;
&lt;td&gt;0.03&lt;/td&gt;
&lt;td&gt;0.02&lt;/td&gt;
&lt;td&gt;0.01&lt;/td&gt;
&lt;td&gt;0.01&lt;/td&gt;
&lt;td&gt;0.01&lt;/td&gt;
&lt;td&gt;0.02&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;0.32&lt;/td&gt;
&lt;td&gt;0.20&lt;/td&gt;
&lt;td&gt;0.13&lt;/td&gt;
&lt;td&gt;0.08&lt;/td&gt;
&lt;td&gt;0.08&lt;/td&gt;
&lt;td&gt;0.08&lt;/td&gt;
&lt;td&gt;0.07&lt;/td&gt;
&lt;td&gt;0.07&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;4 (train)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.66&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.56&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.50&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.45&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.44&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.41&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.41&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.36&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;0.58&lt;/td&gt;
&lt;td&gt;0.56&lt;/td&gt;
&lt;td&gt;0.51&lt;/td&gt;
&lt;td&gt;0.47&lt;/td&gt;
&lt;td&gt;0.46&lt;/td&gt;
&lt;td&gt;0.44&lt;/td&gt;
&lt;td&gt;0.39&lt;/td&gt;
&lt;td&gt;0.34&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;0.55&lt;/td&gt;
&lt;td&gt;0.51&lt;/td&gt;
&lt;td&gt;0.44&lt;/td&gt;
&lt;td&gt;0.41&lt;/td&gt;
&lt;td&gt;0.40&lt;/td&gt;
&lt;td&gt;0.38&lt;/td&gt;
&lt;td&gt;0.36&lt;/td&gt;
&lt;td&gt;0.32&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;td&gt;0.55&lt;/td&gt;
&lt;td&gt;0.48&lt;/td&gt;
&lt;td&gt;0.42&lt;/td&gt;
&lt;td&gt;0.40&lt;/td&gt;
&lt;td&gt;0.39&lt;/td&gt;
&lt;td&gt;0.37&lt;/td&gt;
&lt;td&gt;0.36&lt;/td&gt;
&lt;td&gt;0.31&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Observations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Peak at T=4 across every depth column.&lt;/strong&gt; Day 7's "loops help only in a narrow window centered on training" finding generalizes: no depth I tested has its best accuracy at T≠4.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Depth scaling is graceful but the ceiling is low.&lt;/strong&gt; Going from d=2 to d=20 at T=4, accuracy degrades smoothly (0.66 → 0.36), but the absolute numbers stay far from saturation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The "deeper input ⇒ more loops" hypothesis does not hold.&lt;/strong&gt; I'd hoped to see T=8 or T=16 begin to dominate at d=20, indicating inference-time scaling could rescue depth. Instead, every depth column peaks at T=4 and decays — same shape as Day 7's digit-count columns, just stretched lower.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;T=8 is unusually competitive at mid-depths.&lt;/strong&gt; At d=4 through d=10, T=8 is within ~1pt of T=4 (sometimes slightly higher). Possibly two adjacent settings of test-time depth around the training value are both near-optimal.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Experiment B: fixed-point analysis
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg1e79hbd96ki1x96wt0l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg1e79hbd96ki1x96wt0l.png" alt="fixed-point cosine similarity curve across loop steps and depths"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Mean cosine similarity between consecutive hidden states &lt;code&gt;cos(h_t, h_{t-1})&lt;/code&gt; measured at the first-answer-token position, averaged across 4 seeds, 200 samples per depth:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;t&lt;/th&gt;
&lt;th&gt;d=2&lt;/th&gt;
&lt;th&gt;d=4&lt;/th&gt;
&lt;th&gt;d=8&lt;/th&gt;
&lt;th&gt;d=12&lt;/th&gt;
&lt;th&gt;d=16&lt;/th&gt;
&lt;th&gt;d=20&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;0.85&lt;/td&gt;
&lt;td&gt;0.89&lt;/td&gt;
&lt;td&gt;0.92&lt;/td&gt;
&lt;td&gt;0.92&lt;/td&gt;
&lt;td&gt;0.93&lt;/td&gt;
&lt;td&gt;0.91&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;0.91&lt;/td&gt;
&lt;td&gt;0.91&lt;/td&gt;
&lt;td&gt;0.94&lt;/td&gt;
&lt;td&gt;0.95&lt;/td&gt;
&lt;td&gt;0.95&lt;/td&gt;
&lt;td&gt;0.97&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;0.94&lt;/td&gt;
&lt;td&gt;0.94&lt;/td&gt;
&lt;td&gt;0.92&lt;/td&gt;
&lt;td&gt;0.92&lt;/td&gt;
&lt;td&gt;0.94&lt;/td&gt;
&lt;td&gt;0.95&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;0.95&lt;/td&gt;
&lt;td&gt;0.97&lt;/td&gt;
&lt;td&gt;0.96&lt;/td&gt;
&lt;td&gt;0.95&lt;/td&gt;
&lt;td&gt;0.93&lt;/td&gt;
&lt;td&gt;0.93&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;0.998&lt;/td&gt;
&lt;td&gt;0.995&lt;/td&gt;
&lt;td&gt;0.998&lt;/td&gt;
&lt;td&gt;0.996&lt;/td&gt;
&lt;td&gt;0.996&lt;/td&gt;
&lt;td&gt;0.992&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;0.9994&lt;/td&gt;
&lt;td&gt;0.9996&lt;/td&gt;
&lt;td&gt;0.9989&lt;/td&gt;
&lt;td&gt;0.9985&lt;/td&gt;
&lt;td&gt;0.9976&lt;/td&gt;
&lt;td&gt;0.9979&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;td&gt;0.9998&lt;/td&gt;
&lt;td&gt;0.9998&lt;/td&gt;
&lt;td&gt;0.9998&lt;/td&gt;
&lt;td&gt;0.9997&lt;/td&gt;
&lt;td&gt;0.9995&lt;/td&gt;
&lt;td&gt;0.9996&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Three things to note:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Fixed-point timing is slightly later than Day 7.&lt;/strong&gt; Day 7 reached ~0.95 by T=2; brackets reach ~0.95 at T=3 and ~0.99 at T=4. About one extra loop step on this metric. Possibly the more complex left-to-right state needs a beat longer to settle.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Depth dependence is small.&lt;/strong&gt; d=20 traces almost on top of d=2, again echoing Day 7 (where digit-count had only marginal effect on fixed-point timing). "Harder problem ⇒ slower fixed-point" did not appear.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Hidden state stops moving by T=4 (cosine ~0.99) while accuracy starts decaying.&lt;/strong&gt; Same paradox as Day 7: extra loops are computation without information. Either the late-loop perturbations are small but logit-relevant drift away from a converged answer, or this is purely a distribution-shift artifact of training only at T=4.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Comparison with Day 7
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Axis&lt;/th&gt;
&lt;th&gt;Day 7 (addition)&lt;/th&gt;
&lt;th&gt;Day 8 (brackets)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Loop-count peak at T=train (=4)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best accuracy at peak&lt;/td&gt;
&lt;td&gt;100% (all digits)&lt;/td&gt;
&lt;td&gt;66% (d=2), 36% (d=20)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Inference-time loop extrapolation&lt;/td&gt;
&lt;td&gt;Hurts&lt;/td&gt;
&lt;td&gt;Hurts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cosine fixed-point arrival&lt;/td&gt;
&lt;td&gt;~T=2&lt;/td&gt;
&lt;td&gt;~T=3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Depth/digit dependence on fixed-point&lt;/td&gt;
&lt;td&gt;Small&lt;/td&gt;
&lt;td&gt;Small&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Training dynamics&lt;/td&gt;
&lt;td&gt;Grokking (sudden phase transition)&lt;/td&gt;
&lt;td&gt;Smooth slow climb&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Day 8 reproduces all the &lt;strong&gt;qualitative&lt;/strong&gt; findings of Day 7. What changes is the &lt;strong&gt;quantitative ceiling&lt;/strong&gt;: at the same parameter budget and the same training compute, structure-tracking caps far below saturation while addition saturates.&lt;/p&gt;




&lt;h2&gt;
  
  
  💡 Tying back to the three perspectives
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://dev.to/peppercorn_llm/day-7-openmythos-loop-debate"&gt;Day 7&lt;/a&gt; tested looped transformers against three published views:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Saunshi et al.&lt;/strong&gt; — loops can match deeper fixed-depth networks on algorithmic tasks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Geiping et al. (Huginn)&lt;/strong&gt; — at scale, extra loops give marginal gains&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Micheal Bee&lt;/strong&gt; — loops plateau early at small scale (T=2 fixed-point)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Day 8 adds three more data points to the picture:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The "peak at training loop count" pattern persists across qualitatively different algorithmic tasks&lt;/strong&gt; (addition vs. bracket parsing). This is consistent with Saunshi's framing but argues against naive depth-extrapolation at inference.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The fixed-point arrives at slightly different times for different tasks.&lt;/strong&gt; Bee's "T=2" appears to be a property of the specific task and training recipe, not a universal property of looped transformers. Brackets need ~T=3-4 to plateau, addition needs ~T=2.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Task structural complexity matters more than loop count.&lt;/strong&gt; At a fixed budget, the ceiling on accuracy is set by something else (model capacity? loss landscape? data efficiency?), not by the number of inference loops. Adding more loops can't compensate.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;A useful refinement: &lt;strong&gt;looped transformers carry compute up to a depth bounded by the task's algorithmic complexity and the model's expressive capacity. Beyond that, the hidden state stops moving meaningfully and additional loops are computation without information.&lt;/strong&gt; Day 7 showed this for a task within capacity (addition saturates); Day 8 shows it for a task that bumps against capacity (bracket parsing caps short).&lt;/p&gt;




&lt;h2&gt;
  
  
  🛠️ Technical details
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Smoke history (why the task definition changed)
&lt;/h3&gt;

&lt;p&gt;Initial smoke: balanced/imbalanced binary classification, depths 2-10.&lt;br&gt;
Result: 100% accuracy across all depths by step 4,000.&lt;br&gt;
Diagnosis: too many shortcut signals (length parity, open/close count) for the model to learn the stack algorithm — even with mutations that should defeat counting shortcuts. The 2-bit output gives the model no incentive to track position-by-position state.&lt;/p&gt;

&lt;p&gt;Second smoke: first-break-position output, depths 2-20.&lt;br&gt;
Result at 5,000 steps: d=2 100%, d=20 71%, with loss still trending down (0.32 → still falling).&lt;br&gt;
Diagnosis: depth-dependent difficulty visible, room to scale training to expose loop-count effects.&lt;/p&gt;

&lt;p&gt;Lesson worth recording: &lt;strong&gt;output information density matters as much as task structure for studying loop behavior&lt;/strong&gt;. A binary classifier with global-aggregation shortcuts is a weak probe of recurrent depth.&lt;/p&gt;

&lt;h3&gt;
  
  
  Config and hyperparameters
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nc"&gt;MythosConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;vocab_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;         &lt;span class="c1"&gt;# 6 brackets + '=' + '$' + space + '-' + '0'-'9'
&lt;/span&gt;    &lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;n_heads&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;n_kv_heads&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;          &lt;span class="c1"&gt;# GQA
&lt;/span&gt;    &lt;span class="n"&gt;max_seq_len&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;       &lt;span class="c1"&gt;# Day 7 was 32
&lt;/span&gt;    &lt;span class="n"&gt;max_loop_iters&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;prelude_layers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;coda_layers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;attn_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gqa&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;n_experts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;           &lt;span class="c1"&gt;# MoE FFN inside recurrent block
&lt;/span&gt;    &lt;span class="n"&gt;n_shared_experts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;n_experts_per_tok&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;expert_dim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;lora_rank&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;rope_theta&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;10000.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Total parameters: &lt;strong&gt;3,394,338&lt;/strong&gt; (~3.4M, matches Day 7 to within rounding).&lt;/p&gt;

&lt;p&gt;Training:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Optimizer: AdamW, betas (0.9, 0.95), wd 0.1&lt;/li&gt;
&lt;li&gt;LR: max 3e-4, warmup 2000 steps, cosine decay to 1e-5&lt;/li&gt;
&lt;li&gt;Grad clip: 1.0&lt;/li&gt;
&lt;li&gt;Batch size: 128&lt;/li&gt;
&lt;li&gt;Max steps: 30000&lt;/li&gt;
&lt;li&gt;dtype: fp32 (same RoPE-complex-buffer reason as Day 7)&lt;/li&gt;
&lt;li&gt;4 seeds {0, 1, 2, 3} in parallel&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Data generation
&lt;/h3&gt;

&lt;p&gt;On-the-fly synthetic. For each sample:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sample depth &lt;code&gt;d ∈ {2, 4, 6, 8, 10, 12, 16, 20}&lt;/code&gt; uniformly&lt;/li&gt;
&lt;li&gt;Sample pair count &lt;code&gt;n_pairs ~ U[max(1, d-1), min(2*d, 50)]&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Generate balanced parenthesization (random bracket types, nested or sequential)&lt;/li&gt;
&lt;li&gt;With prob 0.5, apply a mutation: delete close (30%), delete open (30%), substitute (40%)&lt;/li&gt;
&lt;li&gt;Compute first-break position with the stack parser; format output&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Loss is applied only at positions following &lt;code&gt;=&lt;/code&gt; (i.e., on the 2-digit answer + &lt;code&gt;$&lt;/code&gt;).&lt;/p&gt;

&lt;h3&gt;
  
  
  Evaluation
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Experiment A: greedy autoregressive generation, exact 3-token match (position digits + &lt;code&gt;$&lt;/code&gt;). 500 samples per &lt;code&gt;(seed, n_loops, depth)&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Experiment B: re-implementation of OpenMythos forward to expose per-loop hidden states. Cosine similarity at the first answer-token position. 200 samples per &lt;code&gt;(seed, depth)&lt;/code&gt;, 32 loop iterations.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What I'd want to try next
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Increase training-time loop count and re-measure.&lt;/strong&gt; Does the peak track with training depth (suggesting it's purely a distribution-shift artifact) or does extrapolation stay broken?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scale model dim while keeping loops fixed.&lt;/strong&gt; Does a 10x bigger model break through the ~66% / ~36% bracket ceiling, or does the structure-tracking task itself need a different inductive bias?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mix tasks in training.&lt;/strong&gt; Train on addition + brackets jointly and see if there's interference or transfer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inject explicit halting (ACT).&lt;/strong&gt; Let the model choose how many loops per token. Does it match the empirical optimum or settle elsewhere?&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  References
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/kyegomez/OpenMythos" rel="noopener noreferrer"&gt;OpenMythos GitHub (Kye Gomez)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://red.anthropic.com/2026/mythos-preview/" rel="noopener noreferrer"&gt;Claude Mythos Preview (Anthropic, 2026-04-07)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2502.17416" rel="noopener noreferrer"&gt;Reasoning with Latent Thoughts (Saunshi et al., ICLR 2025)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2502.05171" rel="noopener noreferrer"&gt;Scaling up Test-Time Compute with Latent Reasoning (Geiping et al., Huginn)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://medium.com/@mbonsign/testing-the-openmythos-hypothesis-emergent-subspace-selectivity-in-looped-transformers-711f8ca0236c" rel="noopener noreferrer"&gt;Testing the OpenMythos Hypothesis (Micheal Bee)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2604.12946" rel="noopener noreferrer"&gt;Parcae — Scaling Laws for Stable Looped Language Models&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Training and evaluation scripts: &lt;a href="https://github.com/SAETAG/dgx-100-experiments/tree/main/days/day08-bracket-matching/scripts" rel="noopener noreferrer"&gt;https://github.com/SAETAG/dgx-100-experiments/tree/main/days/day08-bracket-matching/scripts&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Tomorrow: Day 9
&lt;/h2&gt;

&lt;p&gt;Switching gears to something much more personal — handing private chat data to a local model and seeing what it surfaces…!&lt;/p&gt;

&lt;h1&gt;
  
  
  100ExperimentsWithDGX #LocalLLM
&lt;/h1&gt;

</description>
      <category>localllm</category>
      <category>ai</category>
      <category>dgxspark</category>
      <category>transformers</category>
    </item>
    <item>
      <title>[Day 7] Does Giving an AI More 'Thinking Time' Really Make It Smarter? Training an OpenMythos-Style Mini Model on DGX</title>
      <dc:creator>PEPPERCORN</dc:creator>
      <pubDate>Tue, 19 May 2026 03:17:51 +0000</pubDate>
      <link>https://dev.to/peppercorn_llm/day-7-does-giving-an-ai-more-thinking-time-really-make-it-smarter-training-an-openmythos-style-1epk</link>
      <guid>https://dev.to/peppercorn_llm/day-7-does-giving-an-ai-more-thinking-time-really-make-it-smarter-training-an-openmythos-style-1epk</guid>
      <description>&lt;h1&gt;
  
  
  [Day 7] Does Giving an AI More "Thinking Time" Really Make It Smarter? Training an OpenMythos-Style Mini Model on DGX
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Intro
&lt;/h2&gt;

&lt;p&gt;Day 7!&lt;/p&gt;

&lt;p&gt;Reddit kept surfacing this new project called &lt;strong&gt;OpenMythos&lt;/strong&gt; in my feed with "12 days to replicate frontier AI, ASI is near" headlines, and I got curious enough to dig in.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Tools used: my home AI machine (DGX Spark) + &lt;a href="https://github.com/kyegomez/OpenMythos" rel="noopener noreferrer"&gt;OpenMythos&lt;/a&gt; (PyTorch reconstruction of the rumored Claude Mythos architecture) + synthetic multi-digit addition.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The question: &lt;strong&gt;does giving an AI more "thinking time" (= more recurrent loops at inference) actually make it smarter?&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Today's setup
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The hype
&lt;/h3&gt;

&lt;p&gt;On 2026-04-07, Anthropic announced &lt;strong&gt;Claude Mythos&lt;/strong&gt;. Press coverage highlights zero-day discovery capabilities — reportedly 271 zero-days in Firefox and a 27-year-old bug in OpenBSD — but the model's architecture and weights remain unreleased. Anthropic kept Mythos itself behind a limited-access coalition (&lt;strong&gt;Project Glasswing&lt;/strong&gt; — AWS, Apple, Microsoft, Google, CrowdStrike, Palo Alto, ~40 organizations) rather than releasing it publicly.&lt;/p&gt;

&lt;p&gt;Twelve days later, Kye Gomez (Swarms) released &lt;strong&gt;OpenMythos&lt;/strong&gt;, a PyTorch reconstruction of the &lt;em&gt;suspected&lt;/em&gt; architecture. The repo is explicit upfront:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"an independent, community-driven theoretical reconstruction based solely on publicly available research and speculation. It is not affiliated with, endorsed by, or connected to Anthropic"&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So OpenMythos is &lt;strong&gt;not&lt;/strong&gt; Mythos. It's a hypothesis-in-code: a Recurrent-Depth Transformer (RDT) with MoE FFNs and MLA/GQA attention, capable of being trained from scratch on standard text data. No leaked weights, no distillation.&lt;/p&gt;

&lt;p&gt;Reddit's "ASI is near" framing skips this critical distinction. The interesting question, once you set the hype aside, is whether the &lt;strong&gt;architectural idea&lt;/strong&gt; — recurrent depth — actually works.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note for this article&lt;/strong&gt;: OpenMythos is not Claude Mythos — it's a theoretical reconstruction inspired by looped-transformer research. The experiments below are not "Claude Mythos capability tests" but rather "how does a looped / recurrent-depth structure behave on a small synthetic task."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Three perspectives on looped transformers
&lt;/h3&gt;

&lt;p&gt;Browsing the literature, I found three different studies giving different pictures of how looped transformers behave:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Source&lt;/th&gt;
&lt;th&gt;Scale&lt;/th&gt;
&lt;th&gt;Claim&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Saunshi et al. 2025&lt;/strong&gt; (ICLR, research paper)&lt;/td&gt;
&lt;td&gt;tens of M params, synthetic&lt;/td&gt;
&lt;td&gt;Loops work: k layers looped L times approximately matches kL-layer fixed-depth, on addition / p-hop induction / math&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Geiping et al. 2025&lt;/strong&gt; (Huginn, research paper)&lt;/td&gt;
&lt;td&gt;3.5B params, 800B tokens&lt;/td&gt;
&lt;td&gt;Task-dependent: at scale on natural-language benchmarks, gains can be marginal (T=4 → T=32 only +1.82 points on GSM8K), though effects vary by task and compute regime&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Micheal Bee 2026-04&lt;/strong&gt; (Medium, independent experiment blog)&lt;/td&gt;
&lt;td&gt;17M params, 12 GPU-hours on RTX 5070 Ti&lt;/td&gt;
&lt;td&gt;Loops plateau at T=2 in this small-scale setup: hidden state reaches a fixed-point that subsequent iterations cannot escape&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Theory, large-scale empirics, and an independent solo replication give different pictures. I wanted to add a fourth data point from my own DGX Spark on a clean, controlled task — multi-digit addition.&lt;/p&gt;

&lt;h3&gt;
  
  
  What I'd hoped to see
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Does training-time accuracy phase-transition (grok) at some step? (Saunshi 3-stage prediction)&lt;/li&gt;
&lt;li&gt;Does test-time loop count matter? At what point does it stop helping?&lt;/li&gt;
&lt;li&gt;Does the hidden state actually keep evolving across loops, or does it hit a fixed-point early? (the Bee question)&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Headline finding
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Loops help, but only within a narrow window centered on the training loop count.&lt;/strong&gt; With training-time &lt;code&gt;max_loop_iters=4&lt;/code&gt;, accuracy peaks at exactly T=4 (100% across all digit counts) and decays in &lt;em&gt;both&lt;/em&gt; directions — fewer loops underthink, more loops overthink.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bee's "T=2 fixed-point" reproduced.&lt;/strong&gt; Cosine similarity between consecutive hidden states jumps from ~0.72 to ~0.95 at T=2, then climbs slowly to ~0.99 by T=4 and stays flat through T=32.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Striking per-seed grokking variance.&lt;/strong&gt; Same hyperparameters, four seeds: seeds 1 and 3 solve 5-digit addition by step 4,000; seed 2 takes 10,000; seed 0 stalls at &amp;lt;10% until step 16,000, then jumps to 100%.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No depth extrapolation in this setup.&lt;/strong&gt; Saunshi's claim that training at T=4 should generalize to deeper T at inference does &lt;em&gt;not&lt;/em&gt; reproduce here — instead, T&amp;gt;4 hurts.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🌀 What is a "looped" transformer?
&lt;/h2&gt;

&lt;p&gt;A standard transformer (GPT-4, Llama, most local LLMs) routes input tokens through a stack of distinct layers, each used exactly once per forward pass. To make it "think deeper," you stack more layers — increasing parameter count.&lt;/p&gt;

&lt;p&gt;A looped transformer reuses &lt;strong&gt;the same&lt;/strong&gt; parameters across multiple iterations. The model has a &lt;code&gt;Prelude → Recurrent Block × T → Coda&lt;/code&gt; structure: a few standard layers up front, then one block iterated T times with input injection at every step, then a few more standard layers.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Input tokens
   ↓
[Prelude P]          — standard layers, run once
   ↓
[Recurrent Block R]  — one block looped T times
   ↑_______↓          h_{t+1} = A·h_t + B·e + Transformer(h_t, e)
   ↓
[Coda C]             — standard layers, run once
   ↓
Output logits
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At each loop iteration &lt;code&gt;t&lt;/code&gt;, the hidden state updates via the LTI injection rule, and the encoded input &lt;code&gt;e&lt;/code&gt; (Prelude output) is re-injected to keep the original signal alive across arbitrary depth. The injection parameters are constrained so that spectral radius ρ(A) &amp;lt; 1, which prevents divergence over many loops (Parcae stability framework).&lt;/p&gt;

&lt;p&gt;The key claim: &lt;strong&gt;more loops at inference = deeper reasoning, without adding parameters&lt;/strong&gt;. This is conceptually analogous to chain-of-thought scaling — except the "thinking" happens in continuous latent space rather than discrete token space.&lt;/p&gt;




&lt;h2&gt;
  
  
  🔧 Experimental setup
&lt;/h2&gt;

&lt;p&gt;I trained a deliberately tiny OpenMythos variant on multi-digit addition. The model is small enough to run 4 seeds in parallel on a single GPU but large enough to exhibit the looped-transformer phenomena.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;OpenMythos tiny (3.4M params)
  ↓
Train 4 seeds in parallel, 30k steps each, fp32 on DGX Spark (GB10)
  ↓
Experiment A: greedy autoregressive accuracy
              loops ∈ {1, 2, 4, 8, 16, 32}  ×  digits ∈ {2, 3, 4, 5}
  ↓
Experiment B: cosine similarity between consecutive hidden states
              ⇒ does the recurrent block reach a fixed-point?
  ↓
Compare against Saunshi / Huginn / Bee
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Model config
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nc"&gt;MythosConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;vocab_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;         &lt;span class="c1"&gt;# digits 0-9 + '+', '=', pad, eos
&lt;/span&gt;    &lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;n_heads&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;n_kv_heads&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;          &lt;span class="c1"&gt;# GQA
&lt;/span&gt;    &lt;span class="n"&gt;max_seq_len&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_loop_iters&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;      &lt;span class="c1"&gt;# training depth; inference varies
&lt;/span&gt;    &lt;span class="n"&gt;prelude_layers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;coda_layers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;attn_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gqa&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;n_experts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;           &lt;span class="c1"&gt;# MoE FFN inside recurrent block
&lt;/span&gt;    &lt;span class="n"&gt;n_shared_experts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;n_experts_per_tok&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;expert_dim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;lora_rank&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;           &lt;span class="c1"&gt;# depth-wise LoRA per loop step
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Total parameters: &lt;strong&gt;3,386,658&lt;/strong&gt; (~3.4M).&lt;/p&gt;

&lt;h3&gt;
  
  
  Data
&lt;/h3&gt;

&lt;p&gt;On-the-fly synthetic addition. Operands are uniformly sampled from &lt;code&gt;[10^(d-1), 10^d - 1]&lt;/code&gt; for digit count &lt;code&gt;d ∈ {2, 3, 4, 5}&lt;/code&gt;. Sequence format &lt;code&gt;"A+B=R$"&lt;/code&gt;, where &lt;code&gt;R = str(A+B)[::-1]&lt;/code&gt; (reverse-order answer, following Saunshi's convention so left-to-right autoregressive generation can carry digits naturally).&lt;/p&gt;

&lt;p&gt;Loss is applied only at positions following the &lt;code&gt;=&lt;/code&gt; token (i.e., on the answer tokens).&lt;/p&gt;

&lt;h3&gt;
  
  
  Training
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Optimizer: AdamW, betas (0.9, 0.95), wd 0.1&lt;/li&gt;
&lt;li&gt;LR: max 3e-4, warmup 2000 steps, cosine decay to 1e-5&lt;/li&gt;
&lt;li&gt;Grad clip: 1.0&lt;/li&gt;
&lt;li&gt;Batch size: 128&lt;/li&gt;
&lt;li&gt;Max steps: 30000&lt;/li&gt;
&lt;li&gt;dtype: &lt;strong&gt;fp32&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Initially I tried bf16 to use the GB10 efficiently, but OpenMythos stores RoPE frequencies as &lt;code&gt;complex64&lt;/code&gt; buffers, and &lt;code&gt;model.to(bfloat16)&lt;/code&gt; silently drops the imaginary part, breaking attention. For a 3.4M-param model on 128 GB of unified memory, fp32 is fine — the bottleneck is not memory but parallel scheduling.&lt;/p&gt;

&lt;p&gt;Four seeds {0, 1, 2, 3} run in parallel on the same GPU. Per-seed throughput drops to ~12K tok/s (vs ~50K solo), but wall-clock time for all four is approximately equivalent to one solo run.&lt;/p&gt;




&lt;h2&gt;
  
  
  📊 Results
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Experiment A: accuracy heatmap
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F62vulnao2x15psie2uv8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F62vulnao2x15psie2uv8.png" alt="accuracy heatmap of OpenMythos addition across loop counts and digit counts" width="800" height="682"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Mean fully-correct rate across 4 seeds, 500 eval samples per condition:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Inference loops&lt;/th&gt;
&lt;th&gt;d=2&lt;/th&gt;
&lt;th&gt;d=3&lt;/th&gt;
&lt;th&gt;d=4&lt;/th&gt;
&lt;th&gt;d=5&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;0.38 ± 0.12&lt;/td&gt;
&lt;td&gt;0.19 ± 0.09&lt;/td&gt;
&lt;td&gt;0.09 ± 0.07&lt;/td&gt;
&lt;td&gt;0.02 ± 0.02&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;0.53 ± 0.17&lt;/td&gt;
&lt;td&gt;0.50 ± 0.12&lt;/td&gt;
&lt;td&gt;0.16 ± 0.08&lt;/td&gt;
&lt;td&gt;0.21 ± 0.16&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;4 (train)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.00&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.00&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.00&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.00&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;0.98 ± 0.01&lt;/td&gt;
&lt;td&gt;0.98 ± 0.01&lt;/td&gt;
&lt;td&gt;0.94 ± 0.03&lt;/td&gt;
&lt;td&gt;0.86 ± 0.08&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;0.91 ± 0.04&lt;/td&gt;
&lt;td&gt;0.91 ± 0.05&lt;/td&gt;
&lt;td&gt;0.75 ± 0.10&lt;/td&gt;
&lt;td&gt;0.56 ± 0.16&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;td&gt;0.62 ± 0.12&lt;/td&gt;
&lt;td&gt;0.65 ± 0.13&lt;/td&gt;
&lt;td&gt;0.45 ± 0.13&lt;/td&gt;
&lt;td&gt;0.26 ± 0.17&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Observations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Peak is exactly at training-time loop count (T=4), 100% across all digit counts.&lt;/li&gt;
&lt;li&gt;One step of inference-time extrapolation (T=8) is near-peak but already shows degradation at d=5 (86%).&lt;/li&gt;
&lt;li&gt;Beyond T=8, accuracy collapses monotonically. At T=32, even 2-digit addition drops to 62%.&lt;/li&gt;
&lt;li&gt;Under-looping (T=1, T=2) hurts more at higher digit counts, consistent with depth being needed to chain carries.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Experiment B: fixed-point analysis
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2lcx3dpp8c8s4e4dab63.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2lcx3dpp8c8s4e4dab63.png" alt="fixed-point cosine similarity curve across loop steps" width="800" height="495"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Mean cosine similarity between consecutive hidden states &lt;code&gt;cos(h_t, h_{t-1})&lt;/code&gt; over answer positions, averaged across 4 seeds, 200 samples per digit:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;t&lt;/th&gt;
&lt;th&gt;d=2&lt;/th&gt;
&lt;th&gt;d=3&lt;/th&gt;
&lt;th&gt;d=4&lt;/th&gt;
&lt;th&gt;d=5&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;0.711&lt;/td&gt;
&lt;td&gt;0.726&lt;/td&gt;
&lt;td&gt;0.745&lt;/td&gt;
&lt;td&gt;0.744&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;0.961&lt;/td&gt;
&lt;td&gt;0.967&lt;/td&gt;
&lt;td&gt;0.957&lt;/td&gt;
&lt;td&gt;0.946&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;0.985&lt;/td&gt;
&lt;td&gt;0.986&lt;/td&gt;
&lt;td&gt;0.977&lt;/td&gt;
&lt;td&gt;0.971&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;0.993&lt;/td&gt;
&lt;td&gt;0.992&lt;/td&gt;
&lt;td&gt;0.986&lt;/td&gt;
&lt;td&gt;0.983&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;0.999&lt;/td&gt;
&lt;td&gt;0.999&lt;/td&gt;
&lt;td&gt;0.998&lt;/td&gt;
&lt;td&gt;0.996&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;0.9995&lt;/td&gt;
&lt;td&gt;0.9996&lt;/td&gt;
&lt;td&gt;0.9992&lt;/td&gt;
&lt;td&gt;0.998&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;td&gt;0.9995&lt;/td&gt;
&lt;td&gt;0.9996&lt;/td&gt;
&lt;td&gt;0.999&lt;/td&gt;
&lt;td&gt;0.998&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Bee's T=2 fixed-point claim is reproduced in spirit but not literally: cosine similarity jumps to ~0.95 at T=2 (vs. Bee's near-1.0), then asymptotes to ~0.99 by T=4 and stays flat through T=32.&lt;/p&gt;

&lt;p&gt;The difference vs. accuracy is telling: &lt;strong&gt;hidden state is effectively static (by cosine similarity) from T=4 onwards, yet accuracy collapses at T=16-32&lt;/strong&gt;. Two non-exclusive interpretations: (a) overthinking — late loops drift away from a converged solution; (b) distribution shift — training used T=4, so T&amp;gt;&amp;gt;4 is simply an out-of-distribution use of the model. Worth noting that cosine similarity ≈ 1 doesn't prove the hidden state is doing nothing — small logit-relevant deltas may still accumulate.&lt;/p&gt;

&lt;p&gt;Digit-count dependence on fixed-point timing is small (d=5 lags d=2 by ~0.01 in cosine sim). "Harder problems take more loops to converge" is &lt;em&gt;not&lt;/em&gt; observed here — they converge at the same rate but the converged state is just less accurate at higher digit counts.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bonus: training dynamics
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6z2j3kgsk8jpcdm3iq3i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6z2j3kgsk8jpcdm3iq3i.png" alt="training loss and teacher-forced accuracy curves per seed" width="800" height="274"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The most striking thing in the training curves is &lt;strong&gt;seed-dependent grokking timing&lt;/strong&gt;. Four runs of identical hyperparameters:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;seed 1: loss → 0 by step 3,000, all digits ≥88% by step 4,000&lt;/li&gt;
&lt;li&gt;seed 3: loss → 0 by step 4,000, all digits ≥87% by step 4,000&lt;/li&gt;
&lt;li&gt;seed 2: stuck at loss ~0.35 plateau until step 8,000, then collapses to 0 by step 10,000; d=4/5 jump from &amp;lt;10% to 99% in 2,000 steps&lt;/li&gt;
&lt;li&gt;seed 0: stuck at loss ~0.30 plateau until step 15,000, then collapses; d=4 groks at step 12,000-14,000, d=5 groks at step 16,000&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is textbook Saunshi-style three-stage grokking (memorization → in-distribution → systematic), with the third-stage trigger varying by a factor of &lt;strong&gt;4x in step count&lt;/strong&gt; purely on random init. The largest seed gap (seed 0 vs. seed 1) is ~12,000 steps, roughly 1 hour of wall-clock on this DGX.&lt;/p&gt;

&lt;p&gt;If you trained a single seed and stopped early, you might conclude "OpenMythos can't generalize beyond d=3" — which would be wrong. The architecture &lt;em&gt;can&lt;/em&gt; solve all 4 digit buckets; some random seeds just need much longer to find the systematic-generalization solution.&lt;/p&gt;




&lt;h2&gt;
  
  
  💡 What this means for the three perspectives
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Where my data point lands
&lt;/h3&gt;

&lt;p&gt;My single-DGX small-scale result lands somewhere between Bee and a partial refutation of Saunshi:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Bee's fixed-point at small T is reproduced.&lt;/strong&gt; Hidden state effectively stops evolving by T=4 (cosine sim ≥ 0.99) and certainly by T=8.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Saunshi's depth-extrapolation does NOT reproduce.&lt;/strong&gt; Inference at T &amp;gt; train_T does &lt;em&gt;not&lt;/em&gt; improve accuracy — it harms it. T=8 is already at 86% on d=5 (vs. 100% at T=4), and T=32 collapses to 26%. The "train at depth k, infer at depth k·L" recipe assumes the recurrent block has learned to keep refining; in my setup it has not.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Huginn's limited-gain finding is consistent at small scale.&lt;/strong&gt; Extra inference loops give negative ROI rather than diminishing positive ROI.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;New observation: seed-dependent grokking with up to 12K-step variance.&lt;/strong&gt; This is an under-emphasized variable in the public looped-transformer discourse — single-seed studies (Bee's solo replication, individual rows in Saunshi's tables) may be substantially under- or over-estimating typical behavior.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Reconciliation attempt
&lt;/h3&gt;

&lt;p&gt;Theory (Saunshi), large-scale empirics (Huginn), and independent replication (Bee) may not actually be in contradiction — they may be measuring different facets of the same phenomenon at different scales:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Saunshi&lt;/strong&gt;: shows loops &lt;em&gt;can&lt;/em&gt; work on the right kind of problem (algorithmic, depth-bounded reasoning) at the right kind of scale (small synthetic).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Huginn&lt;/strong&gt;: shows that loops trained at 3.5B / 800B token scale on natural-language data give only marginal gains on a benchmark (GSM8K) that already favors CoT.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bee&lt;/strong&gt;: shows that within a particular small-scale training recipe, the recurrent block's hidden state stops evolving very early in inference.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These three findings are compatible with a unified picture: &lt;strong&gt;loops carry compute, but only up to a depth bounded by the task's algorithmic complexity and the model's expressive capacity&lt;/strong&gt;. Beyond that depth, the hidden state stops moving meaningfully, and additional loops are computation without information.&lt;/p&gt;

&lt;h3&gt;
  
  
  What I'd watch next
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Increase loop count during training (here I used 4) and see if the inference-time scaling extends further&lt;/li&gt;
&lt;li&gt;Try ACT halting more aggressively to see how the model self-regulates loop depth per token&lt;/li&gt;
&lt;li&gt;Add task heterogeneity (mix p-hop induction or parity) to test whether the fixed-point timing varies by problem class&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🛠️ Technical details
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Reproducing this experiment
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/kyegomez/OpenMythos
&lt;span class="nb"&gt;cd &lt;/span&gt;OpenMythos
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;

&lt;span class="c"&gt;# Data, training, evaluation scripts (this Day 7 folder):&lt;/span&gt;
python scripts/train.py &lt;span class="nt"&gt;--seed&lt;/span&gt; 0 &lt;span class="nt"&gt;--max_steps&lt;/span&gt; 30000
python scripts/eval_accuracy.py &lt;span class="nt"&gt;--seeds&lt;/span&gt; 0 1 2 3
python scripts/eval_fixedpoint.py &lt;span class="nt"&gt;--seeds&lt;/span&gt; 0 1 2 3
python scripts/plot.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The training and evaluation scripts are at &lt;a href="https://github.com/SAETAG/dgx-100-experiments/tree/main/days/day07-openmythos-loop-debate/scripts" rel="noopener noreferrer"&gt;https://github.com/SAETAG/dgx-100-experiments/tree/main/days/day07-openmythos-loop-debate/scripts&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  What went wrong (and was fixed)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;bf16 broke complex RoPE buffer&lt;/strong&gt;: switched to fp32; fine at 3.4M parameters&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Initial training-time max_loop_iters too small&lt;/strong&gt;: kept at 4 per Saunshi's recipe; future experiments could vary this&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Greedy generation is slow at high loop counts&lt;/strong&gt;: each batch repeats &lt;code&gt;n_loops&lt;/code&gt; forward passes through the recurrent block; for loops=32 this is non-trivial&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Hyperparameter choices: why these
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;dim=256, expert_dim=512, 1 prelude / 1 coda layer&lt;/code&gt;: smallest config that still exhibits looping behavior; matches Saunshi's scale&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;n_experts=4&lt;/code&gt;: enough to demonstrate MoE routing without bloating params&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;lora_rank=8&lt;/code&gt;: depth-wise LoRA lets each loop iteration adapt slightly without breaking weight-sharing&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;max_seq_len=32&lt;/code&gt;: tight bound — d=5 addition fits in ~18 chars&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  References
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/kyegomez/OpenMythos" rel="noopener noreferrer"&gt;OpenMythos GitHub (Kye Gomez)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://red.anthropic.com/2026/mythos-preview/" rel="noopener noreferrer"&gt;Claude Mythos Preview (Anthropic, 2026-04-07)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.anthropic.com/glasswing" rel="noopener noreferrer"&gt;Project Glasswing&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2502.17416" rel="noopener noreferrer"&gt;Reasoning with Latent Thoughts (Saunshi et al., ICLR 2025)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2502.05171" rel="noopener noreferrer"&gt;Scaling up Test-Time Compute with Latent Reasoning (Geiping et al., Huginn)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://medium.com/@mbonsign/testing-the-openmythos-hypothesis-emergent-subspace-selectivity-in-looped-transformers-711f8ca0236c" rel="noopener noreferrer"&gt;Testing the OpenMythos Hypothesis (Micheal Bee)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2604.12946" rel="noopener noreferrer"&gt;Parcae — Scaling Laws for Stable Looped Language Models&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2604.07822" rel="noopener noreferrer"&gt;Loop, Think, &amp;amp; Generalize (Implicit Reasoning in Recurrent-Depth Transformers)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Tomorrow: Day 8
&lt;/h2&gt;

&lt;p&gt;A follow-up to Day 7, pushing looped thinking one step further into something harder…!&lt;/p&gt;

&lt;h1&gt;
  
  
  100ExperimentsWithDGX #LocalLLM
&lt;/h1&gt;

</description>
      <category>localllm</category>
      <category>ai</category>
      <category>dgxspark</category>
      <category>transformers</category>
    </item>
    <item>
      <title>[Day 6] I Had an AI Look at 25,000 iPhone Photos and It Decided My Mom and I Are the Same Person</title>
      <dc:creator>PEPPERCORN</dc:creator>
      <pubDate>Tue, 12 May 2026 18:10:12 +0000</pubDate>
      <link>https://dev.to/peppercorn_llm/day-6-i-had-an-ai-look-at-25000-iphone-photos-and-it-decided-my-mom-and-i-are-the-same-person-1epo</link>
      <guid>https://dev.to/peppercorn_llm/day-6-i-had-an-ai-look-at-25000-iphone-photos-and-it-decided-my-mom-and-i-are-the-same-person-1epo</guid>
      <description>&lt;h1&gt;
  
  
  [Day 6] I Had an AI Look at 25,000 iPhone Photos and It Decided My Mom and I Are the Same Person
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Intro
&lt;/h2&gt;

&lt;p&gt;Day 6!&lt;/p&gt;

&lt;p&gt;On Day 4, I had a local AI sort through 25,000 photos on my iPhone (&lt;a href="https://dev.to/peppercorn_llm/day-4-i-had-a-local-ai-sort-through-25000-photos-on-my-iphone-545p"&gt;Day 4 article&lt;/a&gt;). Today is the follow-up — I wanted to go one level deeper and have AI look at my &lt;strong&gt;behavioral patterns&lt;/strong&gt; over time.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Tools used: my home AI machine (DGX Spark) + a face recognition AI (&lt;a href="https://github.com/timesler/facenet-pytorch" rel="noopener noreferrer"&gt;FaceNet&lt;/a&gt;) + a summarization LLM (&lt;a href="https://qwenlm.github.io/" rel="noopener noreferrer"&gt;Qwen2.5 72B&lt;/a&gt; running on Ollama).&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Today's setup
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What I actually did
&lt;/h3&gt;

&lt;p&gt;Take 5 years of photos (25,000) and have an AI summarize my day-to-day life from them. Two phases:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Phase 1&lt;/strong&gt;: aggregate by capture date + camera model + photo category (cat, food, scenery, etc.), then ask the LLM to read it&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Phase 2&lt;/strong&gt;: add face recognition AI to answer "who is in each photo," then ask the LLM again&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  The key bit of today
&lt;/h3&gt;

&lt;p&gt;The face recognition AI treated me and my mom as &lt;strong&gt;the same person&lt;/strong&gt; — but the interesting part is that all the other misclassifications were "different people with the same expression," whereas in our case it was "&lt;strong&gt;different expressions, same person&lt;/strong&gt;" despite my mom being straight-faced and me grinning with teeth showing.&lt;/p&gt;

&lt;p&gt;The AI gets fooled by expressions, but it also seems to pick up on something &lt;strong&gt;beyond expressions&lt;/strong&gt; (bone structure? face shape?). That's today's headline.&lt;/p&gt;




&lt;h2&gt;
  
  
  🔧 How I went about it
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;25,000 photos (already categorized on Day 4)
   ↓
Phase 1: aggregate "capture date / camera model / category" only
   → ask the LLM to summarize year by year
   ↓
Phase 2: add face recognition AI to label "who is in each photo"
   → ask the LLM to summarize again
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;GPS data ("where") had been stripped during the iCloud export, so I substituted &lt;strong&gt;camera model&lt;/strong&gt; as a proxy (iPhone = daily life, Olympus TG = travel, DJI handheld = video shoots, etc.).&lt;/p&gt;

&lt;p&gt;(The tools and detailed steps are in the "Technical details" section at the end.)&lt;/p&gt;




&lt;h2&gt;
  
  
  📊 Results
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Phase 1: date / camera / category only
&lt;/h3&gt;

&lt;p&gt;I pulled "capture date + camera model + category (sorted on Day 4)" out of the 25,000 photos and turned it into &lt;strong&gt;four heatmaps&lt;/strong&gt; showing year-over-year patterns. Then handed those to the LLM.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;What's a heatmap?&lt;/strong&gt; = A table where the rows × columns are filled with color intensity based on count. Dense color = a hotspot of activity, visible at a glance.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  Photo count per year
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fygvu2bt3oc4fr0yptwhi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fygvu2bt3oc4fr0yptwhi.png" alt="Photo count per year" width="799" height="333"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;2019 was a clear outlier at 4,931 photos — 2-3x the other years.&lt;/p&gt;

&lt;h4&gt;
  
  
  Year × Category
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F72jxhygv36szmklu6we2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F72jxhygv36szmklu6we2.png" alt="Year × Category heatmap" width="800" height="367"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Cat photos exploded starting 2021 → matches the year my cat joined the household.&lt;/p&gt;

&lt;h4&gt;
  
  
  Year × Month
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv9wp7ew6mhgh8jsnj6zl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv9wp7ew6mhgh8jsnj6zl.png" alt="Year × Month activity heatmap" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;August 2019 was the single highest month at 1,082 photos.&lt;/p&gt;

&lt;h4&gt;
  
  
  Year × Camera model
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fttxshy9hvrzp4x7rp598.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fttxshy9hvrzp4x7rp598.png" alt="Year × Camera heatmap" width="799" height="333"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Olympus TG dropped off sharply in 2020 (matches the COVID period). The DJI handheld shows up starting 2025.&lt;/p&gt;

&lt;p&gt;When I handed this to the LLM and asked for a yearly summary, the output was along the lines of "this might have been a busy year" or "looks like an active year." Well, of course — the only info I gave the LLM was "when, what camera, what kind of subject." That's the ceiling for what it can say.&lt;/p&gt;

&lt;p&gt;So the next question: what happens if you add &lt;strong&gt;who is in each photo&lt;/strong&gt;? That's Phase 2.&lt;/p&gt;




&lt;h3&gt;
  
  
  Phase 2: adding "who's in the photo"
&lt;/h3&gt;

&lt;p&gt;I ran face recognition AI over the 25,000 photos, detected 21,000 faces, and grouped similar-looking ones into 209 groups (&lt;code&gt;C1, C2, …, C209&lt;/code&gt;). Plotting those over time:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;What's a "similar-face group"?&lt;/strong&gt; = a group the face recognition AI thinks contains "the same person" (technically called a "cluster"). The AI only manages them as numbered IDs, so a human still has to look at each group and label "this is person X."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  Person cluster × Year heatmap
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4ht5lmqaq6183qaq75qh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4ht5lmqaq6183qaq75qh.png" alt="Person cluster × Year" width="799" height="333"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This heatmap turned out to be interesting:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Long-spanning groups&lt;/strong&gt; (C1, C2, C3) → likely family or myself&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Short-spanning groups&lt;/strong&gt; → likely acquaintances from a specific period&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;…which gives you a working guess. When I fed this back to the LLM, the summary turned much more concrete: "C3 is a new appearance," "C2 is decreasing in frequency," etc.&lt;/p&gt;




&lt;h2&gt;
  
  
  💡 Today's biggest finding
&lt;/h2&gt;

&lt;p&gt;I went through the face clusters one by one and saw that the AI's groupings landed in a mix of "worked great," "fair enough," and "failed":&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;✅ Worked great&lt;/strong&gt;: grouped the same person across different angles and expressions (one group had all 4 photos of the same family member)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🤔 Fair enough&lt;/strong&gt;: burst shots end up grouped (multiple groups were just consecutive frames of the same moment)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;⚠️ Failure pattern A&lt;/strong&gt;: grouped different people who happened to share a similar smile (happened in several groups)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;😳 Failure pattern B&lt;/strong&gt;: grouped me and my mom despite our totally different expressions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The most striking one was &lt;strong&gt;failure pattern B&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Failure patterns A and B are misclassifications for opposite reasons
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Failure pattern A: different people, same expression
&lt;/h4&gt;

&lt;p&gt;Different people grouped together because of a &lt;strong&gt;similar smile&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9sgadvv259yhrvqq2v34.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9sgadvv259yhrvqq2v34.png" alt="Different people grouped due to similar smile (illustration)" width="800" height="352"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Three different people — but when smiles are similar, the AI calls them "same person."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;※The actual experiment used real photos. The illustrations here are AI-generated stand-ins for privacy.&lt;/em&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Failure pattern B: parent and child, different expressions
&lt;/h4&gt;

&lt;p&gt;My mom and I in the same group — despite the expression difference (I'm grinning with teeth showing, she's neutral).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb0un4gou9kejrh38y116.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb0un4gou9kejrh38y116.png" alt="Parent and child grouped despite different expressions (illustration)" width="800" height="529"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Parent and child with clearly different expressions — but the AI still says "same person."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;※The actual experiment used real photos. The illustrations here are AI-generated stand-ins for privacy.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In the groups I eyeballed, "different expressions but same person" only happened in our case. Every other misclassification was "same expression, different people."&lt;/p&gt;

&lt;p&gt;So my mom and I are a different kind of mistake. Either the AI is picking up on &lt;strong&gt;genetic facial similarity&lt;/strong&gt;, or there's some other mechanism at work (I'll touch on this in the technical details). Hard to be definitive, but a fascinating case.&lt;/p&gt;




&lt;h3&gt;
  
  
  Summary: how the AI "sees" faces
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Situation&lt;/th&gt;
&lt;th&gt;AI judgment&lt;/th&gt;
&lt;th&gt;Likely reason&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Same person, different angle &amp;amp; expression&lt;/td&gt;
&lt;td&gt;◯ to △&lt;/td&gt;
&lt;td&gt;Bone structure matches well&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Different people, same expression&lt;/td&gt;
&lt;td&gt;✕ (often grouped)&lt;/td&gt;
&lt;td&gt;Pulled in by expression noise&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Parent &amp;amp; child, different expressions&lt;/td&gt;
&lt;td&gt;✕ (sometimes grouped)&lt;/td&gt;
&lt;td&gt;Bone structure similarity outweighs expression difference&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The AI gets fooled by expressions, but seems to actually pick up on something &lt;strong&gt;beyond expressions&lt;/strong&gt; (bone structure? face shape?) — that was the most interesting observation of the day.&lt;/p&gt;




&lt;h2&gt;
  
  
  🛠️ Technical details
&lt;/h2&gt;

&lt;p&gt;:::details Tools used&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;EXIF extraction&lt;/strong&gt;: Python &lt;code&gt;pillow_heif&lt;/code&gt; + &lt;code&gt;PIL.Image.getexif()&lt;/code&gt; (HEIC-aware)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Face recognition&lt;/strong&gt;: &lt;code&gt;facenet-pytorch&lt;/code&gt; (&lt;code&gt;InceptionResnetV1&lt;/code&gt;, vggface2-pretrained)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Clustering&lt;/strong&gt;: scikit-learn &lt;code&gt;DBSCAN&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM summarization&lt;/strong&gt;: Qwen2.5 72B via Ollama&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compute&lt;/strong&gt;: DGX Spark (lots of GPU memory)&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;What's EXIF?&lt;/strong&gt; = the camera metadata embedded in each photo file (capture time, camera model, GPS, etc.).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's FaceNet?&lt;/strong&gt; = an AI that converts a face photo into a 512-dimensional vector. Same person's faces are close vectors, different people are far apart.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's DBSCAN?&lt;/strong&gt; = a classic ML clustering method that automatically groups similar items. You don't need to specify the number of groups in advance.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;:::details EXIF extraction script (parallelized, 6 seconds total)&lt;/p&gt;

&lt;p&gt;&lt;code&gt;pillow_heif&lt;/code&gt; to support HEIC, &lt;code&gt;PIL.Image.getexif()&lt;/code&gt; to read EXIF. Parallelized with &lt;code&gt;concurrent.futures.ProcessPoolExecutor&lt;/code&gt; (12 processes).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pillow_heif&lt;/span&gt;
&lt;span class="n"&gt;pillow_heif&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;register_heif_opener&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;PIL&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Image&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ExifTags&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_one&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;Image&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;img&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;exif&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;img&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getexif&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;inner&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;exif&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_ifd&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;34665&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# ExifIFD
&lt;/span&gt;        &lt;span class="c1"&gt;# DateTimeOriginal lives inside ExifIFD
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="mi"&gt;36867&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;inner&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;dt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;parse_exif_datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inner&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;36867&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="n"&gt;make&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;exif&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;271&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;exif&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;272&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;gps&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;exif&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_ifd&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;34853&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Photos with no EXIF date (screenshots, etc.) fall back to file mtime, but that's just "the day I copied the file," so I excluded those from year-level aggregation.&lt;/p&gt;

&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;:::details Tuning DBSCAN's eps&lt;/p&gt;

&lt;p&gt;Distance between embeddings is cosine distance (1 - dot product).&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;eps&lt;/th&gt;
&lt;th&gt;Clusters&lt;/th&gt;
&lt;th&gt;Largest cluster size&lt;/th&gt;
&lt;th&gt;Verdict&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0.4 (loose)&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;21,310&lt;/td&gt;
&lt;td&gt;Everyone in one group&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0.3&lt;/td&gt;
&lt;td&gt;73&lt;/td&gt;
&lt;td&gt;17,234&lt;/td&gt;
&lt;td&gt;Still big lumps&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0.25&lt;/td&gt;
&lt;td&gt;146&lt;/td&gt;
&lt;td&gt;12,905&lt;/td&gt;
&lt;td&gt;Still too big&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;0.2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;209&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;4,582&lt;/td&gt;
&lt;td&gt;◎ chosen&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0.18&lt;/td&gt;
&lt;td&gt;216&lt;/td&gt;
&lt;td&gt;3,131&lt;/td&gt;
&lt;td&gt;Too tight — single people split into multiple clusters&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.cluster&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DBSCAN&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.preprocessing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;normalize&lt;/span&gt;

&lt;span class="n"&gt;embeds_n&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;normalize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embeds&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;norm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;l2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;db&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DBSCAN&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;eps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;min_samples&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;metric&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cosine&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_jobs&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;labels&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit_predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embeds_n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;min_samples=5&lt;/code&gt; means only people who show up 5+ times across photos get clustered.&lt;/p&gt;

&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;:::details Why parent and child tend to land in the same cluster&lt;/p&gt;

&lt;p&gt;&lt;code&gt;facenet-pytorch&lt;/code&gt;'s &lt;code&gt;InceptionResnetV1&lt;/code&gt; (vggface2-pretrained) produces 512-dim embeddings that are designed to capture geometric (bone structure) features. Lighting, angle, and expression noise also leak in.&lt;/p&gt;

&lt;p&gt;Parent and child share genetic bone structure, so their embeddings can be closer than you'd get between random different people. This is a known phenomenon in face recognition research — several papers have demonstrated it.&lt;/p&gt;

&lt;p&gt;DBSCAN is density-based: if "A→B is close" and "B→C is close," then A and C end up in the same cluster even if A and C aren't directly close. If there's one photo of me looking especially like my mom that sits in between, that single bridge photo can connect us into one cluster.&lt;/p&gt;

&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;:::details Generating representative face thumbnails for manual labeling&lt;/p&gt;

&lt;p&gt;Clusters are just IDs (C0, C1, …), so a human has to look at them and label "this is person X."&lt;/p&gt;

&lt;p&gt;I wrote a script that crops the largest face from each cluster's representative photos and lays them out as a diagnostic image:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;facenet_pytorch&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;MTCNN&lt;/span&gt;

&lt;span class="n"&gt;mtcnn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;MTCNN&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;keep_all&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;min_face_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;40&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cuda&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;boxes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mtcnn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;detect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;img&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;boxes&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;biggest&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;boxes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
    &lt;span class="n"&gt;crop&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;img&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;crop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;biggest&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This image contains real faces of family and friends, so I kept it strictly local in &lt;code&gt;private-data/day06-timeline/&lt;/code&gt; (gitignored). Opened it via VS Code Remote-SSH to label by eye.&lt;/p&gt;

&lt;p&gt;:::&lt;/p&gt;




&lt;h2&gt;
  
  
  📝 Today's takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Handing the LLM only "when / what camera / what category" yields a blurry overview&lt;/li&gt;
&lt;li&gt;Adding "who is in the photo" jumps the resolution of the analysis up several notches&lt;/li&gt;
&lt;li&gt;Face recognition AI is sensitive to expression noise but does pick up something beyond expressions (bone structure / face shape)&lt;/li&gt;
&lt;li&gt;Because of that, parent-child being grouped "despite different expressions" became the one unique case in my dataset&lt;/li&gt;
&lt;li&gt;Keeping sensitive face data off the cloud is a big advantage of running this locally&lt;/li&gt;
&lt;li&gt;Processing 25,000 photos in one go is also realistic on a local setup — no API costs to worry about&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Tomorrow's preview: Day 7
&lt;/h2&gt;

&lt;p&gt;Day 7 plan: &lt;strong&gt;local AI vs cloud AI, 5-round showdown&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Going to take the tasks I usually do with local AI (photo classification, credit card analysis, code completion, etc.), run them on both sides, and build a head-to-head matrix.&lt;/p&gt;

&lt;p&gt;To be continued &amp;gt;&amp;gt;&amp;gt;&lt;/p&gt;




&lt;h1&gt;
  
  
  100ExperimentsWithDGX #LocalLLM #ImageAnalysis #FaceNet
&lt;/h1&gt;

</description>
      <category>localllm</category>
      <category>ai</category>
      <category>dgxspark</category>
      <category>facenet</category>
    </item>
    <item>
      <title>[Day 5] I Trained My Cat-LoRA on 22 vs 213 Photos and the Results Were Basically Identical</title>
      <dc:creator>PEPPERCORN</dc:creator>
      <pubDate>Mon, 11 May 2026 01:23:26 +0000</pubDate>
      <link>https://dev.to/peppercorn_llm/day-5-my-cat-lora-got-worse-with-45x-more-photos-so-i-figured-out-why-and-fixed-it-i6m</link>
      <guid>https://dev.to/peppercorn_llm/day-5-my-cat-lora-got-worse-with-45x-more-photos-so-i-figured-out-why-and-fixed-it-i6m</guid>
      <description>&lt;h1&gt;
  
  
  [Day 5] I Trained My Cat-LoRA on 22 vs 213 Photos and the Results Were Basically Identical
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Intro
&lt;/h2&gt;

&lt;p&gt;Day 5!&lt;/p&gt;

&lt;p&gt;Today was originally going to be "have AI analyze a year of my Amazon order history," but downloading the Amazon purchase history just wouldn't work no matter what I tried. So that was a bust.&lt;/p&gt;

&lt;p&gt;Pivoted.&lt;/p&gt;

&lt;p&gt;On Day 2, I trained an AI to memorize my cat from 22 photos (&lt;a href="https://dev.to/peppercorn_llm/day-2-i-trained-an-ai-on-22-photos-of-my-cat-now-it-draws-her-in-any-scene-3a92"&gt;Day 2 article&lt;/a&gt;). That thing is called a "LoRA."&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;What's a LoRA?&lt;/strong&gt; = A small add-on that teaches an AI to recognize a specific subject. Pair photos with a trigger word like &lt;code&gt;ohwx cat&lt;/code&gt;, train, and then writing &lt;code&gt;ohwx cat&lt;/code&gt; in any prompt makes the AI draw my cat.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;On Day 4, I had AI sort through 25,000 photos on my iPhone (&lt;a href="https://dev.to/peppercorn_llm/day-4-i-had-a-local-ai-sort-through-25000-photos-on-my-iphone-545p"&gt;Day 4 article&lt;/a&gt;). It found &lt;strong&gt;999 photos&lt;/strong&gt; it identified as cats.&lt;/p&gt;

&lt;p&gt;Today's experiment: &lt;strong&gt;Will using those 999 photos make my cat-LoRA stronger?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A simple expectation, really. 22 photos → 999 photos is &lt;strong&gt;45x more data&lt;/strong&gt;. Surely the LoRA gets stronger, right?&lt;/p&gt;




&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;Spoiler-free version:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Training with 999 photos made things &lt;strong&gt;worse&lt;/strong&gt;, not better&lt;/li&gt;
&lt;li&gt;After removing "other people's cats" from the dataset (down to &lt;strong&gt;213 photos&lt;/strong&gt;), I got LoRA quality matching my original 22-photo version&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;22 photos and 213 photos produced basically the same quality&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I came in thinking "more photos = stronger LoRA." Turns out &lt;strong&gt;that's not really how it works&lt;/strong&gt;, and today I learned why.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I actually did
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Trained on 999 photos → got worse (v2)
&lt;/h3&gt;

&lt;p&gt;Same base model and trigger word (&lt;code&gt;ohwx cat&lt;/code&gt;) as Day 2. Just bumped the photo count from 22 to 999. Kohya_ss training, 14 minutes. Calling this &lt;strong&gt;v2&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Generated test images and…&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fftwv75qe21hy51ishsaz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fftwv75qe21hy51ishsaz.png" alt="No LoRA / v1 / v2 comparison" width="800" height="419"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Photorealistic scene (left: no LoRA, center: v1=22 photos, right: v2=999 photos). &lt;strong&gt;v2 looks barely different from no-LoRA.&lt;/strong&gt; 45x more data, but the cat identity is gone.&lt;/p&gt;

&lt;p&gt;Creative prompts were worse:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzbkxtftcyq7hedns58r1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzbkxtftcyq7hedns58r1.png" alt="Chef v1 vs v2" width="800" height="628"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Prompt: "ohwx cat as a cute chef." v2 produced &lt;strong&gt;a human woman&lt;/strong&gt; as the chef, with the cat reduced to a tiny illustration on her apron.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvxm6ineeqblhoyid8p0p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvxm6ineeqblhoyid8p0p.png" alt="Astronaut v1 vs v2" width="800" height="628"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Prompt: "ohwx cat as an astronaut." v2 produced &lt;strong&gt;a tabby (orange-striped) cat&lt;/strong&gt; — the fur color is straight up wrong. My cat is black and white.&lt;/p&gt;

&lt;p&gt;→ &lt;strong&gt;More data made the LoRA broadly worse&lt;/strong&gt;, across both photorealistic and creative prompts.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cause: "other cats" had snuck into the dataset
&lt;/h3&gt;

&lt;p&gt;Once I thought about it, it was obvious.&lt;/p&gt;

&lt;p&gt;Day 4's classifier labels images as &lt;strong&gt;"contains a cat or not"&lt;/strong&gt; — it does NOT verify "is this MY cat." So the 999-photo "cat" folder included:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;My cat&lt;/li&gt;
&lt;li&gt;Friends' and family's cats&lt;/li&gt;
&lt;li&gt;Stray cats from around town&lt;/li&gt;
&lt;li&gt;Cats at pet stores&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All mixed together. When I trained with the label &lt;code&gt;ohwx cat = my cat&lt;/code&gt;, the model basically learned &lt;code&gt;ohwx cat ≈ generic cat-shape&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pulled out just my cat → 213 photos (v3)
&lt;/h3&gt;

&lt;p&gt;To curate, I borrowed another AI — &lt;strong&gt;CLIP&lt;/strong&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;What's CLIP?&lt;/strong&gt; = An OpenAI image-understanding model. Show it two images and it returns a similarity score.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I used the 22 confirmed-my-cat photos from Day 2 as a reference set, then asked CLIP to score how similar each of the 999 candidates was. Sorted by score, threw the thumbnails into a single HTML page, and went through visually — checking "this one's a different cat", "this has a person in it", and so on, marking exclusions as I went.&lt;/p&gt;

&lt;p&gt;Final cut: &lt;strong&gt;213 photos, all confirmed to be my cat&lt;/strong&gt;. Re-trained → &lt;strong&gt;v3&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Result:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbflgk6b8xso3b7rt6vwq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbflgk6b8xso3b7rt6vwq.png" alt="No LoRA / v1 / v2 / v3" width="800" height="314"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;v3 is &lt;strong&gt;as sharp as v1&lt;/strong&gt;. Tuxedo pattern, white chest, the works.&lt;/p&gt;

&lt;p&gt;Creative prompts came back too:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbp038vlynrt2w9x48nhr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbp038vlynrt2w9x48nhr.png" alt="Chef v1 / v2 / v3" width="800" height="419"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The human chef from v2 is gone, replaced by my cat. The astronaut and forest cat similarly snapped back (more comparisons in the collapsible section below).&lt;/p&gt;

&lt;p&gt;→ &lt;strong&gt;Cleaning the data was enough to fix everything.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Bonus: also tried natural-language captions (v4)
&lt;/h3&gt;

&lt;p&gt;One more thing I wanted to test.&lt;/p&gt;

&lt;p&gt;v1 (Day 2) and v3 (today) differ in their &lt;strong&gt;captions&lt;/strong&gt; — the text labels paired with each training photo:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;v1: hand-written natural sentences (&lt;code&gt;ohwx cat, walking on a metal kitchen counter, side profile, indoor kitchen with spice bottles...&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;v3: just the trigger word (&lt;code&gt;ohwx cat&lt;/code&gt;) repeated for every image&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;What's a caption?&lt;/strong&gt; = A short English text describing what's in each photo, paired with that photo during training.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Would adding richer captions on top of clean data push v3 further? Hand-writing 213 captions wasn't realistic, so I had &lt;strong&gt;another AI (Qwen2-VL) auto-generate them&lt;/strong&gt;. Calling this &lt;strong&gt;v4&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Result: &lt;strong&gt;v4 looked basically identical to v3.&lt;/strong&gt; Small differences here and there but nothing substantial.&lt;/p&gt;

&lt;p&gt;→ &lt;strong&gt;Caption granularity barely matters once the data is clean.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The actual question: does more data make a stronger LoRA?
&lt;/h2&gt;

&lt;p&gt;Now for the real comparison. &lt;strong&gt;v1 (22 photos)&lt;/strong&gt; vs &lt;strong&gt;v4 (213 photos)&lt;/strong&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Photos&lt;/th&gt;
&lt;th&gt;Data purity&lt;/th&gt;
&lt;th&gt;Captions&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;v1&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;22&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;My cat only&lt;/td&gt;
&lt;td&gt;Hand-written natural language&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;v4&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;213&lt;/strong&gt; (10x!)&lt;/td&gt;
&lt;td&gt;My cat only&lt;/td&gt;
&lt;td&gt;VLM natural language (same style)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The only meaningful difference is &lt;strong&gt;photo count&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Five-way comparison:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2jqdzppduw2zva026hv7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2jqdzppduw2zva026hv7.png" alt="No LoRA / v1 / v2 / v3 / v4" width="799" height="251"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Left to right: no LoRA, &lt;strong&gt;v1 (22)&lt;/strong&gt;, v2 (999, contaminated), v3 (213, trigger-only), &lt;strong&gt;v4 (213, natural captions)&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;v1 and v4 are essentially the same quality.&lt;/strong&gt; To my eye, v1 has a slightly more painterly feel on the chef prompt, but otherwise — same.&lt;/p&gt;

&lt;p&gt;Same pattern across all the other prompts:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwssthpa295likl4hflsz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwssthpa295likl4hflsz.png" alt="Chef v1 / v2 / v3 / v4" width="800" height="314"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;→ &lt;strong&gt;10x more photos. No visible improvement.&lt;/strong&gt; This was today's main finding.&lt;/p&gt;




&lt;h2&gt;
  
  
  After the fact, I looked it up. Turns out this is common knowledge.
&lt;/h2&gt;

&lt;p&gt;I found "more photos doesn't help" interesting enough to look up afterward, and:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Character LoRAs are typically trained on &lt;strong&gt;25–40 images&lt;/strong&gt;, with 40–80 as a soft cap&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;"Over 30 images shows diminishing returns; dataset quality matters more than dataset size"&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;"15–20 well-curated images beat 50 mediocre ones"&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Too many images can actually overfit and degrade the result&lt;/li&gt;
&lt;li&gt;DreamBooth (a closely related technique) was designed around &lt;strong&gt;3–5 images&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;→ It's &lt;strong&gt;established consensus in the field&lt;/strong&gt;: photo count saturates fast, and dataset purity is the real lever.&lt;/p&gt;

&lt;p&gt;Day 2's 22 photos? Turns out that was already a healthy amount.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I learned today
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Quality &amp;gt; Quantity, apparently
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;22 photos (v1) ≈ 213 photos (v4): photo count doesn't push quality much&lt;/li&gt;
&lt;li&gt;999 photos (v2): contamination made things worse&lt;/li&gt;
&lt;li&gt;213 photos (v3): cleaning brought everything back&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;"More photos = better LoRA" runs out of road fast. What actually moves the needle is &lt;strong&gt;the right photos, not more photos&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  A working playbook (so far)
&lt;/h3&gt;

&lt;p&gt;From today's experiments:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Source photos that match the goal&lt;/strong&gt; (photos of MY cat, not "any cat")&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Aim for 20–30 photos&lt;/strong&gt; — past that, diminishing returns&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Captions help, but don't sweat the wording&lt;/strong&gt; — auto-generated is fine&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;If you must use a big dataset, curate aggressively first&lt;/strong&gt; — contamination is brutal&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  💡 Tip: when you want to use a big dataset anyway
&lt;/h3&gt;

&lt;p&gt;If you're starting from a large unfiltered pile and want to keep it that way, &lt;strong&gt;pre-curation is essential&lt;/strong&gt;. The approach that worked today:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pick a small "ground truth" set (~20 confirmed examples)&lt;/li&gt;
&lt;li&gt;Use &lt;strong&gt;CLIP image similarity&lt;/strong&gt; to score the big pile against the ground truth&lt;/li&gt;
&lt;li&gt;Browse thumbnails sorted by score, eyeball-exclude the misses&lt;/li&gt;
&lt;li&gt;Train on what's left&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Details in the collapsible section below.&lt;/p&gt;




&lt;h2&gt;
  
  
  Technical details (the AI explains)
&lt;/h2&gt;

&lt;p&gt;The implementation details, walked through by Claude.&lt;/p&gt;

&lt;p&gt;:::details 1. More v2 failure examples&lt;/p&gt;

&lt;p&gt;Skipped from the main body for length, but worth seeing:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiu953mrzwnoiwb63zrig.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiu953mrzwnoiwb63zrig.png" alt="Fantasy forest v1 vs v2" width="800" height="628"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Prompt: "ohwx cat in a magical forest." v2 produced &lt;strong&gt;a black-bear-style illustration&lt;/strong&gt; — the cat identity is completely gone.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2tfqvxk2cudqz5n056kl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2tfqvxk2cudqz5n056kl.png" alt="Balcony v1 vs v2" width="800" height="628"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The one photorealistic-ish prompt where v2 sort-of held it together.&lt;/p&gt;

&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;:::details 2. Data prep and CLIP similarity ranking&lt;/p&gt;

&lt;p&gt;Day 4's &lt;code&gt;_review/cat/&lt;/code&gt; had 1,009 symlinks (503 HEIC, 505 JPG, 1 other). Resized to short-side 512px:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3 shared/utils/resize-shortside.py &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--src&lt;/span&gt; private-data/iphone-photos-classified/_review/cat &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--dst&lt;/span&gt; private-data/cat-lora-v2/images-512 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--short-side&lt;/span&gt; 512
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;1,009 → 999 after collisions (9 stem collisions where &lt;code&gt;IMG_XXXX.HEIC&lt;/code&gt; and &lt;code&gt;IMG_XXXX.JPG&lt;/code&gt; produced the same &lt;code&gt;.jpg&lt;/code&gt; name) and 1 resize failure.&lt;/p&gt;

&lt;p&gt;CLIP similarity scoring with &lt;code&gt;openai/clip-vit-base-patch32&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;CLIPModel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CLIPProcessor&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;CLIPModel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai/clip-vit-base-patch32&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cuda&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ref_feats&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;embed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;processor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ref_paths&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;     &lt;span class="c1"&gt;# 22 refs
&lt;/span&gt;&lt;span class="n"&gt;cand_feats&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;embed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;processor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cand_paths&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# 999 candidates
&lt;/span&gt;&lt;span class="n"&gt;sim&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cand_feats&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt; &lt;span class="n"&gt;ref_feats&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;                     &lt;span class="c1"&gt;# (999, 22)
&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                            &lt;span class="c1"&gt;# (999,)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Score distribution:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Score band&lt;/th&gt;
&lt;th&gt;Contents&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;≥ 0.85&lt;/td&gt;
&lt;td&gt;Almost all solo shots of my cat&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0.76 – 0.85&lt;/td&gt;
&lt;td&gt;Mostly my cat, with occasional other-cat or human contamination&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&amp;lt; 0.76&lt;/td&gt;
&lt;td&gt;Mostly other cats or photos with people&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Cut at 0.76 and reviewed everything above visually. 312 manual exclusions later: &lt;strong&gt;213 photos&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;:::details 3. Browser-based curation UI&lt;/p&gt;

&lt;p&gt;A single HTML page laying out all 999 thumbnails in score order, served via &lt;code&gt;python3 -m http.server&lt;/code&gt;. Each thumbnail has a checkbox:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight html"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;class=&lt;/span&gt;&lt;span class="s"&gt;"cell"&lt;/span&gt; &lt;span class="na"&gt;data-name=&lt;/span&gt;&lt;span class="s"&gt;"IMG_2906.jpg"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;img&lt;/span&gt; &lt;span class="na"&gt;src=&lt;/span&gt;&lt;span class="s"&gt;"thumbs-256/IMG_2906.jpg"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;class=&lt;/span&gt;&lt;span class="s"&gt;"meta"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;#1 0.871&lt;span class="nt"&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;input&lt;/span&gt; &lt;span class="na"&gt;type=&lt;/span&gt;&lt;span class="s"&gt;"checkbox"&lt;/span&gt; &lt;span class="na"&gt;onchange=&lt;/span&gt;&lt;span class="s"&gt;"toggleExclude(this)"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;script&amp;gt;&lt;/span&gt;
&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;exportExcluded&lt;/span&gt;&lt;span class="p"&gt;(){&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;names&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[...&lt;/span&gt;&lt;span class="nb"&gt;document&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;querySelectorAll&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;.cell.excluded&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;c&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nf"&gt;download&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;excluded.txt&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;names&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/script&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Click "Export excluded list" to download &lt;code&gt;excluded.txt&lt;/code&gt;, then use that to filter the training dir.&lt;/p&gt;

&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;:::details 4. Training configs (Kohya_ss / TOML)&lt;/p&gt;

&lt;p&gt;The training config is identical across v1/v2/v3/v4 — only the dataset and output name change:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="py"&gt;output_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"ohwx_cat_v3"&lt;/span&gt;   &lt;span class="c"&gt;# or v4&lt;/span&gt;
&lt;span class="py"&gt;max_train_epochs&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
&lt;span class="py"&gt;network_dim&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;32&lt;/span&gt;
&lt;span class="py"&gt;network_alpha&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;
&lt;span class="py"&gt;unet_lr&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1e-4&lt;/span&gt;
&lt;span class="py"&gt;text_encoder_lr&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;5e-5&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Step count is also matched:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Math&lt;/th&gt;
&lt;th&gt;Steps&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;v1&lt;/td&gt;
&lt;td&gt;22 × 10 × 10 ÷ 2&lt;/td&gt;
&lt;td&gt;1,100&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;v2&lt;/td&gt;
&lt;td&gt;999 × 1 × 2 ÷ 2&lt;/td&gt;
&lt;td&gt;999&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;v3 / v4&lt;/td&gt;
&lt;td&gt;213 × 5 × 2 ÷ 2&lt;/td&gt;
&lt;td&gt;1,065&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;All within ~1,000 steps, so the only variables in play are photo count and caption granularity.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; ~/Kohya_ss &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;source &lt;/span&gt;venv/bin/activate
accelerate launch &lt;span class="nt"&gt;--num_cpu_threads_per_process&lt;/span&gt; 8 train_network.py &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--config_file&lt;/span&gt; configs/train_v3.toml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--dataset_config&lt;/span&gt; configs/dataset_v3.toml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;DGX Spark, 1.4 it/s, ~14 minutes per training run.&lt;/p&gt;

&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;:::details 5. Qwen2-VL caption auto-generation&lt;/p&gt;

&lt;p&gt;Reusing Day 4's Qwen2-VL 7B Instruct setup. The prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Describe what is happening in this cat photo using short comma-separated
phrases. Cover: (1) the cat's pose or action, (2) the view angle,
(3) the setting and notable background details. Keep it under 25 words.
Do NOT describe the cat's appearance (color, breed, fur, markings) — focus
only on the scene. Output the description directly without any preamble.
Example: walking on a metal kitchen counter, side profile, indoor kitchen
with spice bottles and shelves in the background
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The "do not describe the cat's appearance" line is intentional: identity is supposed to come from the trigger word &lt;code&gt;ohwx cat&lt;/code&gt;, so captions should only describe context.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;desc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;vlm_caption&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;processor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;img&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;caption&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ohwx cat, &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;desc&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;txt_path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;caption&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;213 captions in 6 minutes. Sample output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ohwx cat, sitting, side view, indoor setting, wooden floor,
folding chair, curtain, air conditioner
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Stylistically very close to Day 2's hand-written captions.&lt;/p&gt;

&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;:::details 6. Version summary&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;v1 (Day 2)&lt;/th&gt;
&lt;th&gt;v2&lt;/th&gt;
&lt;th&gt;v3&lt;/th&gt;
&lt;th&gt;v4&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Photos&lt;/td&gt;
&lt;td&gt;22&lt;/td&gt;
&lt;td&gt;999&lt;/td&gt;
&lt;td&gt;213&lt;/td&gt;
&lt;td&gt;213&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cat content&lt;/td&gt;
&lt;td&gt;My cat only&lt;/td&gt;
&lt;td&gt;My cat + many others&lt;/td&gt;
&lt;td&gt;My cat only&lt;/td&gt;
&lt;td&gt;My cat only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Captions&lt;/td&gt;
&lt;td&gt;Hand-written natural&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;ohwx cat&lt;/code&gt; only&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;ohwx cat&lt;/code&gt; only&lt;/td&gt;
&lt;td&gt;VLM natural&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total steps&lt;/td&gt;
&lt;td&gt;1,100&lt;/td&gt;
&lt;td&gt;999&lt;/td&gt;
&lt;td&gt;1,065&lt;/td&gt;
&lt;td&gt;1,065&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Training time&lt;/td&gt;
&lt;td&gt;13m 3s&lt;/td&gt;
&lt;td&gt;14m 0s&lt;/td&gt;
&lt;td&gt;14m 0s&lt;/td&gt;
&lt;td&gt;14m 0s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;What each pair isolates:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;v2 vs v3&lt;/strong&gt; → effect of data purity (same captions, only purity differs)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;v3 vs v4&lt;/strong&gt; → effect of caption granularity (same data, only captions differ)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;v1 vs v4&lt;/strong&gt; → effect of photo count (clean data, natural captions, only count differs)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;:::details 7. References on LoRA training dataset size&lt;/p&gt;

&lt;p&gt;The "diminishing returns past ~30 photos" claim has multiple sources:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;20–30 photos saturates; dataset quality &amp;gt; dataset size&lt;/strong&gt; (&lt;a href="https://civitai.com/articles/699/large-dataset-lora-tips-and-tricks-google-colab-sd-15-optimized" rel="noopener noreferrer"&gt;Civitai: Large Dataset LoRA Tips&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;15–20 well-curated images beat 50 mediocre ones&lt;/strong&gt; (same)&lt;/li&gt;
&lt;li&gt;Over-training and "overcooked" LoRAs from too much data (&lt;a href="https://huggingface.co/blog/FPHam/lora-secrets-1" rel="noopener noreferrer"&gt;Hugging Face Blog: After 500+ LoRAs&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;DreamBooth (the original subject-finetuning technique) was designed around 3–5 images (&lt;a href="https://dreambooth.github.io/" rel="noopener noreferrer"&gt;DreamBooth project page&lt;/a&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;:::&lt;/p&gt;




&lt;h2&gt;
  
  
  Tomorrow's preview: Day 6
&lt;/h2&gt;

&lt;p&gt;Day 6: still undecided. Decision tomorrow morning.&lt;/p&gt;




&lt;h1&gt;
  
  
  100ExperimentsWithDGX #LocalLLM #LoRA #StableDiffusion
&lt;/h1&gt;

</description>
      <category>localllm</category>
      <category>ai</category>
      <category>dgxspark</category>
      <category>lora</category>
    </item>
    <item>
      <title>[Day 4] I Had a Local AI Sort Through 25,000 Photos on My iPhone</title>
      <dc:creator>PEPPERCORN</dc:creator>
      <pubDate>Thu, 07 May 2026 23:58:45 +0000</pubDate>
      <link>https://dev.to/peppercorn_llm/day-4-i-had-a-local-ai-sort-through-25000-photos-on-my-iphone-545p</link>
      <guid>https://dev.to/peppercorn_llm/day-4-i-had-a-local-ai-sort-through-25000-photos-on-my-iphone-545p</guid>
      <description>&lt;h1&gt;
  
  
  [Day 4] I Had a Local AI Sort Through 25,000 Photos on My iPhone
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Intro
&lt;/h2&gt;

&lt;p&gt;Day 4: I'm going to hand the 25,000 photos sitting on my iPhone over to a local AI for sorting.&lt;/p&gt;

&lt;p&gt;This is experiment #4.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What I'm using today: DGX Spark + &lt;a href="https://github.com/openai/CLIP" rel="noopener noreferrer"&gt;CLIP&lt;/a&gt; (image-understanding AI from OpenAI) + &lt;a href="https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct" rel="noopener noreferrer"&gt;Qwen2-VL&lt;/a&gt; (a vision-language model that can chat about images, from Alibaba).&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Today's setup
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data&lt;/strong&gt;: 25,382 photos and videos sitting on my iPhone (96 GB).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Goal&lt;/strong&gt;: Have AI find unnecessary photos so I can drop my phone storage subscription.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Approach&lt;/strong&gt;:

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Stage 1&lt;/strong&gt;: Quickly classify all 25K with CLIP.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stage 2&lt;/strong&gt;: Have Qwen2-VL (a VLM) grade CLIP's classifications.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Comparison axis&lt;/strong&gt;: Lightweight + fast classifier (CLIP) vs. heavyweight + smart conversational AI (VLM).&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Bottom line&lt;/strong&gt;: Overall agreement of &lt;strong&gt;84.5%&lt;/strong&gt; when the VLM grades CLIP's classifications. &lt;strong&gt;People detection: 99.2%&lt;/strong&gt; — only 59 misses out of 7,195 photos. Documents and screenshots ended up wrong about half the time. Oh, and I gave up midway and just dumped everything into Amazon Photos because I'd just learned Prime members get unlimited photo storage. Five years a Prime member, never knew.&lt;/p&gt;




&lt;h2&gt;
  
  
  🔧 Steps
&lt;/h2&gt;

&lt;p&gt;Big picture flow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;iPhone
   ↓ ① Sync via iCloud for Windows
myPC1 (Windows)
   ↓ ② scp transfer to DGX (96 GB)
DGX (Linux)
   ├─ ③ Split photos and videos by extension
   │     └ Photos 24,497 / Videos 884
   ├─ ④ Classify with CLIP (~20 min)
   │     └ Sorted into 8 categories
   └─ ⑤ Have VLM grade "is this category right?" (~3 hours)
         └ Overall agreement: 84.5%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's walk through each step.&lt;/p&gt;

&lt;h3&gt;
  
  
  Getting photos onto the DGX (the biggest hurdle)
&lt;/h3&gt;

&lt;p&gt;iPhone → myPC1 (a Windows laptop I use day-to-day) → DGX, a two-leg relay.&lt;/p&gt;

&lt;p&gt;The first leg started at &lt;strong&gt;0.5 MB/s&lt;/strong&gt;, with the ETA showing "6 days." After realizing my Wi-Fi was the bottleneck, I switched to wired LAN, fixed the hostname-resolution path, and got it up to &lt;strong&gt;80 MB/s (~160x faster)&lt;/strong&gt;. Burned half a day. More technical details in the collapsible section below.&lt;/p&gt;

&lt;h3&gt;
  
  
  Splitting photos and videos
&lt;/h3&gt;

&lt;p&gt;The 25,382 transferred files broke down like this:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Extension&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;HEIC&lt;/td&gt;
&lt;td&gt;13,107&lt;/td&gt;
&lt;td&gt;Photo (Apple's format)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;JPG / JPEG&lt;/td&gt;
&lt;td&gt;10,721&lt;/td&gt;
&lt;td&gt;Photo&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PNG&lt;/td&gt;
&lt;td&gt;660&lt;/td&gt;
&lt;td&gt;Photo (mostly screenshots)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;WEBP&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;Photo&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MOV&lt;/td&gt;
&lt;td&gt;799&lt;/td&gt;
&lt;td&gt;Video&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MP4&lt;/td&gt;
&lt;td&gt;85&lt;/td&gt;
&lt;td&gt;Video&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ini&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;System file (ignored)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;I had Claude write a small script that splits photos and videos into separate folders by extension (one command, takes a few minutes — details in the collapsible section).&lt;/p&gt;

&lt;p&gt;Result:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Photos: &lt;strong&gt;24,497&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Videos: 884&lt;/li&gt;
&lt;li&gt;Photos are the focus from here.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What is CLIP?
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;CLIP&lt;/strong&gt; ＝ an image-understanding AI from OpenAI, apparently. You hand it a photo and ask "is this a cat? a landscape? a screenshot?" with multiple labels, and it returns a &lt;strong&gt;similarity score&lt;/strong&gt; for each. Lightweight and fast is its specialty, supposedly.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Stage 1: Classifying all 25K photos with CLIP
&lt;/h3&gt;

&lt;p&gt;I set up 8 categories:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Trash candidates&lt;/strong&gt;: screenshot / document / blank&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Keep&lt;/strong&gt;: food / landscape / other&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Review&lt;/strong&gt;: people / cat&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For each category, I prepared multiple English captions (e.g., "a screenshot of an app", "a photo of a cat") and used the maximum similarity. Also: &lt;strong&gt;anything below 0.5 confidence goes into the uncertain bucket&lt;/strong&gt; for manual review.&lt;/p&gt;

&lt;p&gt;Batch size 64, ~20 minutes of GPU time, all done. Results in the next section!&lt;/p&gt;

&lt;h3&gt;
  
  
  The "How accurate is it?" question
&lt;/h3&gt;

&lt;p&gt;CLIP did the classification, but &lt;strong&gt;how accurate is it really?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Normally you'd verify by manual inspection, but &lt;strong&gt;eyeballing 25,000 photos is not realistic&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;So I decided to have &lt;strong&gt;a smarter AI grade CLIP's classifications&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is a VLM?
&lt;/h3&gt;

&lt;p&gt;A &lt;strong&gt;VLM (Vision-Language Model)&lt;/strong&gt; is an AI that can hold a conversation about images, apparently.&lt;/p&gt;

&lt;p&gt;How it differs from CLIP:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;CLIP&lt;/th&gt;
&lt;th&gt;VLM&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;What it does&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Category classification (returns probabilities)&lt;/td&gt;
&lt;td&gt;Can &lt;strong&gt;describe&lt;/strong&gt; image content in natural language&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Smartness&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Lightweight, fast, coarse&lt;/td&gt;
&lt;td&gt;Heavy, slow, smart&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Size&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~400 MB&lt;/td&gt;
&lt;td&gt;~16 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;I picked &lt;strong&gt;Qwen2-VL 7B Instruct&lt;/strong&gt; (Alibaba). Apache 2.0 licensed for commercial use, no Hugging Face authentication required for download — those were the selection criteria.&lt;/p&gt;

&lt;p&gt;The plan: ask the VLM "is this a screenshot? answer yes or no" for each image and record the answer.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stage 2: Grading all 25K photos with VLM
&lt;/h3&gt;

&lt;p&gt;Started at &lt;strong&gt;16 seconds per image&lt;/strong&gt; (~5 days for the full set). The cause was image size — resizing to 448px on the short side dropped it to &lt;strong&gt;0.3 sec/image (~54x faster)&lt;/strong&gt;. Even with one-image-at-a-time inference, the full set takes ~2-3 hours.&lt;/p&gt;

&lt;p&gt;Started before bed, woke up to 24,496 graded results.&lt;/p&gt;




&lt;h2&gt;
  
  
  📊 Results
&lt;/h2&gt;

&lt;h3&gt;
  
  
  CLIP's classification results
&lt;/h3&gt;

&lt;p&gt;After CLIP processed 24,496 photos, the distribution looked like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;private-data/iphone-photos-classified/
├── _trash-candidate/      Trash candidates
│   ├── screenshots/    (981)
│   ├── documents/    (1,804)
│   └── blank/           (59)
├── _review/                Review
│   ├── people/       (7,195)
│   ├── cat/          (1,009)
│   └── uncertain/    (7,700)
└── _keep/                  Keep
    ├── food/         (1,682)
    ├── landscape/    (1,991)
    └── other/        (2,075)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;th&gt;Share&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;people&lt;/td&gt;
&lt;td&gt;7,195&lt;/td&gt;
&lt;td&gt;29.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;uncertain (low confidence)&lt;/td&gt;
&lt;td&gt;7,700&lt;/td&gt;
&lt;td&gt;31.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;other&lt;/td&gt;
&lt;td&gt;2,075&lt;/td&gt;
&lt;td&gt;8.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;landscape&lt;/td&gt;
&lt;td&gt;1,991&lt;/td&gt;
&lt;td&gt;8.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;document&lt;/td&gt;
&lt;td&gt;1,804&lt;/td&gt;
&lt;td&gt;7.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;food&lt;/td&gt;
&lt;td&gt;1,682&lt;/td&gt;
&lt;td&gt;6.9%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;cat&lt;/td&gt;
&lt;td&gt;1,009&lt;/td&gt;
&lt;td&gt;4.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;screenshot&lt;/td&gt;
&lt;td&gt;981&lt;/td&gt;
&lt;td&gt;4.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;blank&lt;/td&gt;
&lt;td&gt;59&lt;/td&gt;
&lt;td&gt;0.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That's a lot of cat photos...&lt;/p&gt;

&lt;p&gt;Let's see how CLIP actually judged some of these.&lt;/p&gt;

&lt;h4&gt;
  
  
  🎯 Big wins
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmanel40irqlut2qjmg29.jpg" alt="cat" width="800" height="1067"&gt;&lt;/th&gt;
&lt;th&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb8wj5qdxday12val3ffc.jpg" alt="food" width="800" height="600"&gt;&lt;/th&gt;
&lt;th&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flhhxkvszzn2qau4moc4w.jpg" alt="screenshot" width="800" height="1067"&gt;&lt;/th&gt;
&lt;th&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh02jazjunc37kcau34z9.jpg" alt="landscape" width="800" height="600"&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;My cat&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;A meal&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;App screenshot&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Mountain (landscape)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;cat &lt;strong&gt;0.97&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;food &lt;strong&gt;0.999&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;screenshot &lt;strong&gt;0.74&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;landscape &lt;strong&gt;0.98&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;CLIP nailed the cat without hesitation, food at 0.999, screenshots and landscapes too. Reliable.&lt;/p&gt;

&lt;h4&gt;
  
  
  ✨ Subtly impressive recognition
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwv4k8lfjqi4w8q5dpif7.jpg" alt="keychain" width="800" height="1422"&gt;&lt;/th&gt;
&lt;th&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6iyz231ir399ggft9tsc.jpg" alt="coffee" width="800" height="600"&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cat keychain&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Close-up of coffee beans&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;cat &lt;strong&gt;0.64&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;food &lt;strong&gt;0.53&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Even the keychain got recognized as "cat." And coffee beans up close as "food." Quietly impressive.&lt;/p&gt;

&lt;h4&gt;
  
  
  🤔 Funny misclassifications (CLIP's quirks)
&lt;/h4&gt;

&lt;p&gt;Browsing thumbnails by category, some interesting patterns emerged.&lt;/p&gt;

&lt;h5&gt;
  
  
  Food edition: "Trash sorting chart" beats "homemade cake" for being food-like
&lt;/h5&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs2n0aw55oh1vtxogal5g.jpg" alt="cake" width="800" height="1067"&gt;&lt;/th&gt;
&lt;th&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fddkhzur3sde5y4nhbd57.jpg" alt="trash chart" width="800" height="600"&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;My homemade cake&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Trash sorting chart&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;food &lt;strong&gt;0.57&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;food &lt;strong&gt;0.83&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Both ended up in the "food" category. Apparently &lt;strong&gt;the trash sorting chart looks more food-like to CLIP than my homemade cake&lt;/strong&gt;. Reacting to the text? The table layout? Mystery.&lt;/p&gt;

&lt;h5&gt;
  
  
  People edition: "A doodle" beats "Mona Lisa" for being people-like
&lt;/h5&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhqgqjays89q846vhw966.jpg" alt="Mona Lisa" width="800" height="1067"&gt;&lt;/th&gt;
&lt;th&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffhwg26apqs72z6ni99k9.jpg" alt="doodle" width="800" height="1067"&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;The Mona Lisa&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;A face I doodled myself&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;people &lt;strong&gt;0.50&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;people &lt;strong&gt;0.52&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Both in the "people" category. &lt;strong&gt;My crappy doodle edges out da Vinci's Mona Lisa for being more "people-like"&lt;/strong&gt; (just barely).&lt;/p&gt;

&lt;p&gt;CLIP's quirks — kind of charming.&lt;/p&gt;




&lt;h3&gt;
  
  
  VLM's grading results
&lt;/h3&gt;

&lt;p&gt;I asked the VLM, one photo at a time, whether CLIP's category was correct. For example, photos in the cat folder got "is this a cat?", food folder got "is this food?" — yes/no answers.&lt;/p&gt;

&lt;p&gt;Summary by final destination bucket:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Final bucket&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;th&gt;VLM agreement&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;people&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;7,195&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;99.2%&lt;/strong&gt; 🎯&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;food&lt;/td&gt;
&lt;td&gt;1,682&lt;/td&gt;
&lt;td&gt;95.3% 🎯&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;cat&lt;/td&gt;
&lt;td&gt;1,009&lt;/td&gt;
&lt;td&gt;95.0% 🎯&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;other&lt;/td&gt;
&lt;td&gt;2,075&lt;/td&gt;
&lt;td&gt;93.6% 🎯&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;landscape&lt;/td&gt;
&lt;td&gt;1,991&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;83.5%&lt;/strong&gt; ⚠️&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;screenshot&lt;/td&gt;
&lt;td&gt;981&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;75.2%&lt;/strong&gt; ⚠️&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;document&lt;/td&gt;
&lt;td&gt;1,804&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;67.4%&lt;/strong&gt; ⚠️&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;blank&lt;/td&gt;
&lt;td&gt;59&lt;/td&gt;
&lt;td&gt;52.5% ⚠️&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OVERALL&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;24,496&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;84.5%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;People detection at &lt;strong&gt;99.2%&lt;/strong&gt; is quietly amazing. Out of 7,195 photos, the VLM said "no" to only &lt;strong&gt;59&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Documents and screenshots, on the other hand, came back "no" about half the time. CLIP-only confidence isn't enough for those. Out of 24,496 photos, &lt;strong&gt;3,808&lt;/strong&gt; got a "no" from the VLM — that's the part CLIP alone wouldn't have caught.&lt;/p&gt;




&lt;h2&gt;
  
  
  💡 Today's discoveries
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Multimodal AI runs at home
&lt;/h3&gt;

&lt;p&gt;Both CLIP (400 MB, classifier) and Qwen2-VL (16 GB, conversational) ran fine on my home machine. Reassuring.&lt;/p&gt;

&lt;h3&gt;
  
  
  CLIP's confidence is a reliable signal
&lt;/h3&gt;

&lt;p&gt;VLM agreement broken down by CLIP confidence:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;CLIP confidence&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;th&gt;VLM agreement&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0.9+ (super confident)&lt;/td&gt;
&lt;td&gt;3,555&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;96.5%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0.7–0.9&lt;/td&gt;
&lt;td&gt;6,285&lt;/td&gt;
&lt;td&gt;93.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0.5–0.7&lt;/td&gt;
&lt;td&gt;6,956&lt;/td&gt;
&lt;td&gt;86.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&amp;lt;0.5 (uncertain)&lt;/td&gt;
&lt;td&gt;7,700&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;70.1%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Boring but important: &lt;strong&gt;when an AI says it's confident, you can trust it&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  CLIP's weak spots
&lt;/h3&gt;

&lt;p&gt;Things that &lt;strong&gt;clearly appear in photos&lt;/strong&gt; — people, food, cats, objects — score 95%+. Abstract or compound subjects — documents, screenshots, landscapes — drop to 60-80%.&lt;/p&gt;

&lt;p&gt;Documents at 67.4% in particular. That's where VLM re-grading earns its keep.&lt;/p&gt;

&lt;h3&gt;
  
  
  Role split: lightweight model × smart model
&lt;/h3&gt;

&lt;p&gt;Use CLIP to triage everything quickly, VLM to grade the suspicious cases — a two-layer setup. &lt;strong&gt;Best of both worlds in speed and accuracy&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Day 3 had the same pattern: &lt;strong&gt;"aggregation = tools, interpretation = AI."&lt;/strong&gt; Today's variant: &lt;strong&gt;"rough sorting = CLIP, accuracy check = VLM."&lt;/strong&gt; Picking the right AI for the right task pays off in both performance and cost.&lt;/p&gt;

&lt;h3&gt;
  
  
  "Input quality matters more than model size" struck again
&lt;/h3&gt;

&lt;p&gt;In Day 3 (credit card analysis), I learned &lt;strong&gt;"input quality &amp;gt; model size."&lt;/strong&gt; The same pattern showed up today:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;VLM with &lt;strong&gt;original-resolution images&lt;/strong&gt;: 16 sec/image (5 days for full run)&lt;/li&gt;
&lt;li&gt;VLM with &lt;strong&gt;resized 448px images&lt;/strong&gt;: 0.3 sec/image (2 hours)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Just by tidying up the input, &lt;strong&gt;54x speedup&lt;/strong&gt; — small change, huge impact.&lt;/p&gt;

&lt;p&gt;Not "biggest model possible" or "raw original" — &lt;strong&gt;clean up the input before sending it to the AI&lt;/strong&gt;. This worked in Day 3 and Day 4 in a row.&lt;/p&gt;

&lt;h3&gt;
  
  
  Heart broken, switched to Amazon Photos
&lt;/h3&gt;

&lt;p&gt;I tried to verify the trash candidate folder, then realized I'd need to cross-reference VLM scores too, then realized &lt;strong&gt;I never set clear criteria for "what to delete" in the first place&lt;/strong&gt;. Couldn't finalize the cleanup, and morale broke.&lt;/p&gt;

&lt;p&gt;Right then I learned that &lt;strong&gt;Amazon Prime members get unlimited photo storage&lt;/strong&gt;, so I just dumped everything into Amazon Photos. Lol.&lt;/p&gt;

&lt;p&gt;That said, I really should have &lt;strong&gt;defined the deletion criteria&lt;/strong&gt; before starting.&lt;/p&gt;

&lt;p&gt;The classified data on the DGX is a useful resource for future Day experiments.&lt;/p&gt;




&lt;h2&gt;
  
  
  🛠️ How I actually did this
&lt;/h2&gt;

&lt;p&gt;:::details Wi-Fi 0.5 MB/s → wired LAN 80 MB/s journey&lt;/p&gt;

&lt;p&gt;myPC1 → DGX over 96 GB started at &lt;strong&gt;236 KB/s&lt;/strong&gt; via WinSCP (ETA: 6 days). The cause was myPC1 being on Wi-Fi.&lt;/p&gt;

&lt;p&gt;I plugged the PC into the router with a LAN cable → ping dropped close to 0 ms. But WinSCP was still stuck at &lt;strong&gt;500 KB/s&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;PowerShell &lt;code&gt;ping spark-XXXX.local&lt;/code&gt; revealed the address resolved to &lt;strong&gt;DGX's Wi-Fi-side IP&lt;/strong&gt;. The DGX was dual-homed (wired + Wi-Fi), and mDNS was returning the old route.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Failure (routes through Wi-Fi)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;scp&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-r&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"C:\Users\[user]\Pictures\iCloud Photos\Photos"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="err"&gt;@&lt;/span&gt;&lt;span class="nx"&gt;spark-XXXX.local:...&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c"&gt;# Success (direct IP over wired LAN)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;scp&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-r&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"C:\Users\[user]\Pictures\iCloud Photos\Photos"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="err"&gt;@&lt;/span&gt;&lt;span class="nx"&gt;10.0.0.205:...&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Switched from hostname to &lt;strong&gt;explicit IP&lt;/strong&gt; and watched it scream:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;IMG_0190.HEIC                100% 1812KB  84.3MB/s   00:00
IMG_0190.MOV                 100%   17MB 102.4MB/s   00:00
IMG_0192.HEIC                100% 2256KB  81.6MB/s   00:00
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Also discovered WinSCP (SFTP-based) struggles with many small files, while &lt;strong&gt;scp (stream transfer) is much faster&lt;/strong&gt;. With 25,382 files, scp won by a landslide.&lt;/p&gt;

&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;:::details Splitting photos and videos by extension&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;PHOTO_EXTS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.jpg&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.jpeg&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.heic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.heif&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.png&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.webp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;VIDEO_EXTS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.mov&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.mp4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.m4v&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;src&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;input_dir&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rglob&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;is_file&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="k"&gt;continue&lt;/span&gt;
    &lt;span class="n"&gt;ext&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;suffix&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;ext&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;PHOTO_EXTS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;shutil&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;move&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;photos_out&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;ext&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;VIDEO_EXTS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;shutil&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;move&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;videos_out&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Simple. Caught one snag: right after transfer, the directory permission was &lt;code&gt;dr-x------&lt;/code&gt; (read-only), so the first &lt;code&gt;shutil.move&lt;/code&gt; died with &lt;code&gt;PermissionError&lt;/code&gt;. &lt;code&gt;chmod u+w&lt;/code&gt; fixed it.&lt;/p&gt;

&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;:::details CLIP classification script&lt;/p&gt;

&lt;p&gt;Used &lt;code&gt;transformers&lt;/code&gt; to load &lt;code&gt;openai/clip-vit-base-patch32&lt;/code&gt;. For each category, multiple captions are prepared, and the max softmax score is used:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;CATEGORIES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;screenshot&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a screenshot of an app&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a phone screenshot&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a screenshot of a website or chat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;document&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a photo of a document or paper&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a photo of a receipt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a QR code or barcode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a photo of an ID card or driver&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s license&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;people&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a photo of a person&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a photo of people&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a portrait of someone&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a photo of a cat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;food&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a photo of food or a meal&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;landscape&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a photo of a landscape or scenery&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a photo of a building or city&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;other&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a photo of an object or item&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;inputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;processor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;text_prompts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;images&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;images&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                   &lt;span class="n"&gt;return_tensors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;padding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;no_grad&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;outputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;probs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;logits_per_image&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;softmax&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Anything below 0.5 confidence goes into &lt;code&gt;_review/uncertain/&lt;/code&gt;. Near-black/near-white images get caught by a brightness check and routed to &lt;code&gt;_trash-candidate/blank/&lt;/code&gt; before they reach CLIP.&lt;/p&gt;

&lt;p&gt;All per-image category scores are also saved to JSON. That JSON is what the VLM evaluation step consumes later.&lt;/p&gt;

&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;:::details The 54x speedup from image resizing for VLM&lt;/p&gt;

&lt;p&gt;Qwen2-VL's vision token count scales with input resolution. Original-size images (several thousand pixels) consume hundreds to thousands of tokens, slowing inference dramatically.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;processor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoProcessor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Qwen/Qwen2-VL-7B-Instruct&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;min_pixels&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;224&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;224&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_pixels&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;448&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;448&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# ← cap here
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Belt and suspenders — also pre-resize the image
&lt;/span&gt;&lt;span class="n"&gt;img&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Image&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;convert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;RGB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;img&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ImageOps&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exif_transpose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;img&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;img&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;thumbnail&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="mi"&gt;448&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;448&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That took 16 sec/image → &lt;strong&gt;0.3 sec/image&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The verification prompt is dead simple:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;CATEGORY_PROMPTS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;screenshot&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Is this image a screenshot of a phone screen, an app, or a website? Answer with one word: yes or no.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;document&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Is this image primarily a document, receipt, ID card, or QR code? Answer with one word: yes or no.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;people&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;     &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Does this image clearly show one or more human persons? Answer with one word: yes or no.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="c1"&gt;# ...
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;max_new_tokens=5&lt;/code&gt; means only yes/no comes back. Minimal design.&lt;/p&gt;

&lt;p&gt;:::&lt;/p&gt;

&lt;p&gt;:::details Resumable checkpointing&lt;/p&gt;

&lt;p&gt;Running 24,000 images for 3 hours straight, you really want recovery if something hiccups:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;CHECKPOINT_INTERVAL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;todo&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# ... inference ...
&lt;/span&gt;    &lt;span class="nf"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;CHECKPOINT_INTERVAL&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;save_checkpoint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And a &lt;code&gt;--resume&lt;/code&gt; flag that picks up where the JSON left off:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;resume&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;is_file&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Resumed from &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; existing entries&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;todo&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;clip_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Essential for any overnight job.&lt;/p&gt;

&lt;p&gt;:::&lt;/p&gt;




&lt;h2&gt;
  
  
  Next up: Day 5
&lt;/h2&gt;

&lt;p&gt;Tomorrow: &lt;strong&gt;have an AI analyze a year of my Amazon purchase history&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Switching to Amazon Photos for storage made me realize Amazon also has my entire purchase history. &lt;strong&gt;What if I asked AI "what kind of person am I, based on this?"&lt;/strong&gt; — see what patterns emerge that I never noticed myself.&lt;/p&gt;

&lt;p&gt;To be continued ＞＞＞&lt;/p&gt;




&lt;h1&gt;
  
  
  100ExperimentsWithDGX #LocalLLM #ImageClassification #CLIP
&lt;/h1&gt;

</description>
      <category>localllm</category>
      <category>ai</category>
      <category>dgxspark</category>
      <category>clip</category>
    </item>
    <item>
      <title>[Day 3] I Had a Local LLM Analyze a Year of My Credit Card Statements</title>
      <dc:creator>PEPPERCORN</dc:creator>
      <pubDate>Tue, 05 May 2026 22:52:50 +0000</pubDate>
      <link>https://dev.to/peppercorn_llm/day-3-i-had-a-local-llm-analyze-a-year-of-my-credit-card-statements-4eab</link>
      <guid>https://dev.to/peppercorn_llm/day-3-i-had-a-local-llm-analyze-a-year-of-my-credit-card-statements-4eab</guid>
      <description>&lt;h1&gt;
  
  
  [Day 3] I Had a Local LLM Analyze a Year of My Credit Card Statements
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Intro
&lt;/h2&gt;

&lt;p&gt;Day 3: I'm going to hand a year of credit card statements over to a local LLM and see what it can do.&lt;/p&gt;

&lt;p&gt;This is experiment #3.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What I'm using today: DGX Spark + &lt;a href="https://ollama.com/" rel="noopener noreferrer"&gt;Ollama&lt;/a&gt; + &lt;a href="https://qwenlm.github.io/" rel="noopener noreferrer"&gt;Qwen2.5&lt;/a&gt; (comparing 7B vs 72B). Ollama is the de-facto local-LLM runtime, and Qwen2.5 is a multilingual model from Alibaba (China) that handles Japanese reasonably well, apparently.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Today's setup
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data&lt;/strong&gt;: 12 months of credit card statements from a single card.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Volume&lt;/strong&gt;: 383 transactions, ¥2,761,555 in total spend.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Goal&lt;/strong&gt;: get the AI to spot waste patterns and propose savings.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Comparison axes&lt;/strong&gt;:

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Model size&lt;/strong&gt;: 7B (light) vs 72B (heavy)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Input format&lt;/strong&gt;: raw CSV vs pandas-aggregated summary&lt;/li&gt;
&lt;li&gt;→ &lt;strong&gt;4 patterns&lt;/strong&gt; total&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Takeaway&lt;/strong&gt;: "If you ask an AI to aggregate raw data, the numbers come out way off." / "If you pre-aggregate with a spreadsheet tool first and then feed the AI, you get fast and accurate results." A small but practical finding.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Get the CSVs onto the DGX
&lt;/h2&gt;

&lt;p&gt;Log into the credit card company's web statements page on myPC1 (my Windows laptop), download 12 months of CSVs, then push them to the DGX.&lt;/p&gt;

&lt;p&gt;I deliberately skipped GitHub for the transfer this time — once you push something, it's in the history forever, and credit card data shouldn't be there even briefly. Instead, I used &lt;strong&gt;direct PC-to-PC transfer over SSH&lt;/strong&gt; (one command, finishes in seconds; details in the collapsibles at the end). The &lt;code&gt;.gitignore&lt;/code&gt; excludes &lt;code&gt;private-data/&lt;/code&gt; too, so accidental commits are ruled out.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Install Ollama
&lt;/h2&gt;

&lt;p&gt;Ollama is the de-facto runtime for local LLMs. One command should be enough.&lt;/p&gt;

&lt;p&gt;There was a small password hiccup during install (details below), but eventually it was up and running.&lt;/p&gt;

&lt;p&gt;The DGX Spark specs really show through:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Memory: 121 GB&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Default context window: ~262,144 tokens&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In other words: "throw a whole book at it, no problem" territory. Reassuring.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Two model sizes: Qwen2.5 7B vs 72B
&lt;/h2&gt;

&lt;p&gt;The strategy: &lt;strong&gt;same model family, different sizes&lt;/strong&gt;. That way the differences come from size, not architecture.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;7B (light)&lt;/strong&gt;: ~4.7 GB, downloads in 5 minutes. Fast.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;72B (heavy)&lt;/strong&gt;: ~47 GB, 25 minutes to download. Slow but smart.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;What does "B" mean?&lt;/strong&gt; Short for &lt;em&gt;Billion&lt;/em&gt;. It's the number of "weights" inside the AI — more weights, more it remembers, basically. So &lt;strong&gt;7B has 7 billion weights, 72B has 72 billion&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Loading both onto the DGX simultaneously, memory usage looks like:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;AI model&lt;/th&gt;
&lt;th&gt;Memory occupied&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;qwen2.5:72b&lt;/td&gt;
&lt;td&gt;61 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;qwen2.5:7b&lt;/td&gt;
&lt;td&gt;8.2 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;69 GB&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;69 GB. Spacious!&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Prepping the CSVs
&lt;/h2&gt;

&lt;p&gt;Once I had the CSVs in hand, &lt;strong&gt;three small headaches&lt;/strong&gt; before they were ready for the AI:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Headache 1&lt;/strong&gt;: An older encoding (Windows Japanese flavor) → needs converting to modern UTF-8&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Headache 2&lt;/strong&gt;: Some merchant names contain commas, which breaks naive CSV parsing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Headache 3&lt;/strong&gt;: Each file has a "monthly total" line at the end that isn't actually data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Details in the collapsible. After cleanup, the 12 files merge into a single dataset:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Item&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Transactions&lt;/td&gt;
&lt;td&gt;383&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Period&lt;/td&gt;
&lt;td&gt;12 months (1 year)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total spend&lt;/td&gt;
&lt;td&gt;¥2,761,555&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Avg per tx&lt;/td&gt;
&lt;td&gt;¥7,210&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Median per tx&lt;/td&gt;
&lt;td&gt;¥3,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Largest single tx&lt;/td&gt;
&lt;td&gt;¥209,283 (overseas flight)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Smallest&lt;/td&gt;
&lt;td&gt;¥-3,980 (refund)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Now to feed this to 7B and 72B and see what each of them says.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Experiment 1: Throw the raw CSV at the AI
&lt;/h2&gt;

&lt;p&gt;No tricks: &lt;strong&gt;all 383 rows, straight at the AI&lt;/strong&gt;. Prompt is the full ask: "As a household budget consultant, output category breakdown / monthly trend / waste patterns / savings suggestions / lifestyle hypothesis."&lt;/p&gt;

&lt;h3&gt;
  
  
  7B's answer (75 seconds)
&lt;/h3&gt;

&lt;p&gt;...this is where &lt;strong&gt;the numbers go wildly off&lt;/strong&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Item&lt;/th&gt;
&lt;th&gt;What 7B said&lt;/th&gt;
&lt;th&gt;Real data&lt;/th&gt;
&lt;th&gt;Match?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Amazon total&lt;/td&gt;
&lt;td&gt;¥2,014,386 (257 tx)&lt;/td&gt;
&lt;td&gt;¥693,663 (166 tx)&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Amazon Downloads&lt;/td&gt;
&lt;td&gt;¥2,014,386 (257 tx)&lt;/td&gt;
&lt;td&gt;¥80,323 (50 tx)&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Outdoor brand&lt;/td&gt;
&lt;td&gt;¥495,740&lt;/td&gt;
&lt;td&gt;¥154,820&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;A local recreation venue&lt;/td&gt;
&lt;td&gt;"¥49,574" cited&lt;/td&gt;
&lt;td&gt;(a different small charge actually exists)&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;None of the numbers line up. Amazon total is roughly 3× off, Amazon Downloads about 25× off, and the cited venue context is a different charge entirely.&lt;/p&gt;

&lt;p&gt;Reading 383 rows of CSV and computing totals turned out to be a heavy lift for the 7B model.&lt;/p&gt;

&lt;h3&gt;
  
  
  72B's answer (12m 9s)
&lt;/h3&gt;

&lt;p&gt;What if we throw size at the problem? After 12 minutes of patience:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Item&lt;/th&gt;
&lt;th&gt;What 72B said&lt;/th&gt;
&lt;th&gt;Real data&lt;/th&gt;
&lt;th&gt;Match?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Amazon total&lt;/td&gt;
&lt;td&gt;¥635,792 (104 tx)&lt;/td&gt;
&lt;td&gt;¥693,663 (166 tx)&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AI/dev tools&lt;/td&gt;
&lt;td&gt;¥193,629 (21 tx)&lt;/td&gt;
&lt;td&gt;¥176,850 (24 tx)&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Travel&lt;/td&gt;
&lt;td&gt;¥487,555 (43 tx)&lt;/td&gt;
&lt;td&gt;¥416,268 (8 tx)&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Not exact, but the off-by amounts are within ~10%, and there are no fabricated venues. A real improvement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;However — when asked about the monthly trend, here's what 72B said:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Month 1: ¥316,789 → Month 2: ¥229,600 → Month 3: ¥237,500 → ... → Month 12: ¥291,500&lt;br&gt;
(Gradually increasing.)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The actual range is ¥69,961 (low) to ¥493,072 (high) — a chaotic up-and-down waveform. "Gradually increasing" isn't quite right. Even 72B isn't great at aggregating distributed data over a long CSV.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Experiment 2: Aggregate first, then feed the AI
&lt;/h2&gt;

&lt;p&gt;If the AI struggles with aggregation, do the aggregation in a different tool first and only hand the AI the result.&lt;/p&gt;

&lt;p&gt;The flow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;📥 Raw CSV (22,132 chars, 383 rows)
       ↓
🔧 Pre-aggregate with a spreadsheet tool (Python's pandas)
       ↓
📋 Aggregate summary (1,884 chars, ~90% smaller)
       ↓
🤖 Hand it to the AI (let it interpret and propose)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;Python's &lt;strong&gt;pandas&lt;/strong&gt; = a spreadsheet-like library, but ~10,000× more powerful than Excel functions, used for tabular data analysis.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  7B + pre-aggregated input (50 seconds)
&lt;/h3&gt;

&lt;p&gt;Numbers are &lt;strong&gt;fully accurate&lt;/strong&gt; now.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Item&lt;/th&gt;
&lt;th&gt;What 7B said&lt;/th&gt;
&lt;th&gt;Real data&lt;/th&gt;
&lt;th&gt;Match?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Amazon total&lt;/td&gt;
&lt;td&gt;¥693,663&lt;/td&gt;
&lt;td&gt;¥693,663&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AI/dev tools&lt;/td&gt;
&lt;td&gt;¥176,850&lt;/td&gt;
&lt;td&gt;¥176,850&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Monthly max&lt;/td&gt;
&lt;td&gt;¥493,072&lt;/td&gt;
&lt;td&gt;¥493,072&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Monthly min&lt;/td&gt;
&lt;td&gt;¥69,961&lt;/td&gt;
&lt;td&gt;¥69,961&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Quoting straight from the pre-aggregated numbers, the hallucinations vanished.&lt;/p&gt;

&lt;p&gt;And 7B did this in 50 seconds — better quality than the 72B + raw CSV at 12 minutes. Quietly remarkable.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Before (raw CSV)&lt;/th&gt;
&lt;th&gt;After (aggregated)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Time&lt;/td&gt;
&lt;td&gt;75s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;50s&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Numbers&lt;/td&gt;
&lt;td&gt;wildly off&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;exact&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Verdict&lt;/td&gt;
&lt;td&gt;not usable as-is&lt;/td&gt;
&lt;td&gt;quote directly&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  72B + pre-aggregated input (12m 13s)
&lt;/h3&gt;

&lt;p&gt;72B's numbers also match exactly (well, since they're being quoted from pre-aggregated data, that's expected). The proposal quality was the strongest of the four patterns:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Reduce Amazon dependency&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Current: online shopping (Amazon family) is 25.1% of total (¥693,663).&lt;/li&gt;
&lt;li&gt;Suggestion: stick to essentials only, regular review, avoid impulse buys.&lt;/li&gt;
&lt;li&gt;Expected savings: ¥57,805/month average (25% reduction) → ¥693,660/year&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;...wait, hold on. Annual Amazon spend was ¥693,663. The "savings" 72B suggests is ¥693,660. That's basically the &lt;strong&gt;same number&lt;/strong&gt;. So the proposal is effectively "stop buying on Amazon entirely (100%)" — definitely not 25%. Apparently 72B's percentage arithmetic isn't bulletproof either.&lt;/p&gt;

&lt;p&gt;That aside, the &lt;strong&gt;lifestyle hypothesis&lt;/strong&gt; section was kind of striking. Here's what 72B observed:&lt;/p&gt;

&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Heavy reliance on apps and subscriptions&lt;/strong&gt;: "App/subscription" category is 10.5% of total&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Frequent international travel&lt;/strong&gt;: "Travel/airline" is 15.1%, with notable overseas charges&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Frequent online shopping&lt;/strong&gt;: "Online (Amazon)" is 25.1% of total&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;It's just one card's data, so this isn't a complete picture — but if I fed an AI my full household financials, &lt;strong&gt;the analysis and advice would probably go a lot deeper&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Summary: 4 patterns
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Input&lt;/th&gt;
&lt;th&gt;Time&lt;/th&gt;
&lt;th&gt;Numerical accuracy&lt;/th&gt;
&lt;th&gt;Proposal quality&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;7B&lt;/td&gt;
&lt;td&gt;Raw CSV&lt;/td&gt;
&lt;td&gt;75s&lt;/td&gt;
&lt;td&gt;❌ Numbers way off&lt;/td&gt;
&lt;td&gt;△&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;72B&lt;/td&gt;
&lt;td&gt;Raw CSV&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;12m 9s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;△ Misread monthly trend&lt;/td&gt;
&lt;td&gt;○&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;7B&lt;/td&gt;
&lt;td&gt;Aggregated&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;50s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅ Exact&lt;/td&gt;
&lt;td&gt;○ Some repetition&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;72B&lt;/td&gt;
&lt;td&gt;Aggregated&lt;/td&gt;
&lt;td&gt;12m 13s&lt;/td&gt;
&lt;td&gt;✅ Exact&lt;/td&gt;
&lt;td&gt;◎ Best (mind the % math)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Quietly notable: &lt;strong&gt;72B takes ~12 minutes regardless of input size&lt;/strong&gt; (shrinking the prompt didn't change wall-clock time much). Output generation is the bottleneck. Which strengthens the case for "small model + pre-aggregate" as the cost-effective default.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. Cross-check: the actual graphs
&lt;/h2&gt;

&lt;p&gt;Before trusting any of the AI output, let me put the real numbers on charts using the spreadsheet tool (pandas).&lt;/p&gt;

&lt;h3&gt;
  
  
  Monthly spending
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3wvfzqh0st6qv1323fgr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3wvfzqh0st6qv1323fgr.png" alt="Monthly spending" width="800" height="359"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Average ¥230,130/month, but the range is ¥69,961 (lowest) to ¥493,072 (highest) — about a 7× spread. The 72B's "gradually increasing" claim was a bit off the mark; the reality is bouncy.&lt;/p&gt;

&lt;h3&gt;
  
  
  Category share
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5wepa7rudozlx1igsp4o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5wepa7rudozlx1igsp4o.png" alt="Categories" width="800" height="339"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;"Other" being 32% is because my categorization rule is sloppy. I just wrote a simple "if the merchant name contains keyword X, bucket Y" rule, and lots of merchants didn't match any keyword and ended up in "Other." &lt;strong&gt;Reading meaning from a merchant name&lt;/strong&gt; is exactly the kind of thing AI is good at, so next time I'll let the AI do the categorization itself.&lt;/p&gt;

&lt;h3&gt;
  
  
  Top 15 merchants
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqynqrvxdlol28s3mr63m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqynqrvxdlol28s3mr63m.png" alt="Top merchants" width="800" height="429"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Amazon at ¥421,978 (105 tx) is far and away #1. Amazon really is too convenient...&lt;/p&gt;

&lt;h3&gt;
  
  
  Weekday rhythm
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftmwt0stf6hralf5vl8kp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftmwt0stf6hralf5vl8kp.png" alt="Weekday pattern" width="800" height="273"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Tuesday alone is ¥692,549 — way above the rest. Probably because that's when most of the subscription auto-charges land.&lt;/p&gt;




&lt;h2&gt;
  
  
  8. Today's takeaways
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Separate "aggregation" from "interpretation"
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;AI is bad at&lt;/th&gt;
&lt;th&gt;AI is good at&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Multi-row sum/average (numbers go wildly off)&lt;/td&gt;
&lt;td&gt;Categorization (interpreting fuzzy meaning)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Percentage math (saw "25% off → 100% off")&lt;/td&gt;
&lt;td&gt;Pattern recognition / hypothesis generation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Distributed aggregation like monthly totals&lt;/td&gt;
&lt;td&gt;Narrative interpretation, savings proposals&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;→ &lt;strong&gt;Aggregation is the spreadsheet tool's job; interpretation is the AI's.&lt;/strong&gt; When you split the work, things go fast and accurate. "Data prep matters before analysis" — yeah, that old saying really is true. Note to self.&lt;/p&gt;

&lt;h3&gt;
  
  
  Sometimes input quality beats raw size
&lt;/h3&gt;

&lt;p&gt;"7B + pre-aggregated input in 50 seconds" outperformed "72B + raw CSV in 12 minutes". &lt;strong&gt;Sometimes you don't need a bigger model — you need cleaner input.&lt;/strong&gt; Felt that one today.&lt;/p&gt;

&lt;h3&gt;
  
  
  The local-LLM angle
&lt;/h3&gt;

&lt;p&gt;Feeding 12 months of raw credit card data to an AI without a single byte going to the cloud — it was surprisingly stress-free. This is one of the spots local LLMs really shine. Got personal info, or anything cloud-uncomfortable? This is the place for them.&lt;/p&gt;




&lt;h2&gt;
  
  
  9. Tech details (Claude explains)
&lt;/h2&gt;

&lt;p&gt;The technical bits, written up by my AI pair.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;SCP transfer to the DGX (mDNS, no IP needed)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;NVIDIA Sync auto-configures a Host alias in &lt;code&gt;~/AppData/Local/NVIDIA Corporation/Sync/config/ssh_config&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Host spark-XXXX.local
  Hostname spark-XXXX.local
  User [user]
  Port 22
  IdentityFile "...\\nvsync.key"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;Which means I can SSH/SCP using &lt;code&gt;spark-XXXX.local&lt;/code&gt; without ever looking up an IP. The &lt;code&gt;.local&lt;/code&gt; suffix uses mDNS (Multicast DNS) for hostname resolution within the LAN.&lt;/p&gt;

&lt;p&gt;Transfer command (one line, from PowerShell on the Windows side):&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;scp&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-r&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"C:\Users\[user]\Desktop\docs\dgx\csv"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;spark-XXXX.local:/home/&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="nx"&gt;/personal/dgx-100-experiments/private-data/credit-card-csv&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;ol&gt;
&lt;li&gt;Ollama install + the sudo-TTY catch + GPU detection log&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Ollama install:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://ollama.com/install.sh | sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;Running this through Claude Code's Bash, it errored at the sudo password prompt — an interactive TTY is required there:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sudo: a terminal is required to read the password
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;Reopened a separate SSH session, ran the same command manually, and it went through.&lt;/p&gt;

&lt;p&gt;Once installed, systemd auto-starts the service. The GPU detection log via &lt;code&gt;journalctl -u ollama&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;inference compute id=GPU-986c194b... name=CUDA0 description="NVIDIA GB10"
total="121.7 GiB" available="79.0 GiB"
default_num_ctx=262144
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;VRAM (DGX Spark unified memory): &lt;strong&gt;121.7 GiB&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Default context: &lt;strong&gt;262,144 tokens&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Compared with a typical RTX 4090 (24 GB VRAM, 8K–32K default context), the gap is significant.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Loading both models simultaneously&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama pull qwen2.5:7b   &lt;span class="c"&gt;# 4.7 GB&lt;/span&gt;
ollama pull qwen2.5:72b  &lt;span class="c"&gt;# 47 GB&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;After loading both, &lt;code&gt;ollama ps&lt;/code&gt; shows:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;NAME           SIZE      PROCESSOR    CONTEXT    
qwen2.5:72b    61 GB     100% GPU     32768
qwen2.5:7b     8.2 GB    100% GPU     32768
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;Total ~69 GB used out of 79 GB available. Both models stay resident, switching between them is instant.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Custom CSV parser for the credit card data&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Three quirks needed handling: CP932 encoding, no quotes (commas in some merchant names break parsing), and a trailing summary row in each file.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse_line&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;gt&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;fields&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rstrip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\r\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fields&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;lt&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;fields&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;  &lt;span class="c1"&gt;# skip blank/summary rows
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fields&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;gt&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;merchant&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fields&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="n"&gt;fields&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;fields&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;merchant&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;fields&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;:]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;fields&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;load_one&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;gt&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cp932&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;next&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# skip header (cardholder metadata)
&lt;/span&gt;        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;parsed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;parse_line&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;parsed&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;COLUMNS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;利用日&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;利用日&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="nb"&gt;format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;%Y/%m/%d&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;利用金額&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;利用金額&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;ol&gt;
&lt;li&gt;Japanese fonts in matplotlib&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;code&gt;japanize-matplotlib&lt;/code&gt; doesn't work on Python 3.12 — it imports &lt;code&gt;distutils&lt;/code&gt;, which was removed from the standard library.&lt;/p&gt;

&lt;p&gt;The modern replacement is &lt;code&gt;matplotlib-fontja&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;matplotlib-fontja
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;matplotlib_fontja&lt;/span&gt;  &lt;span class="c1"&gt;# noqa: F401  ← just importing it sets up IPAexGothic
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;ol&gt;
&lt;li&gt;Calling Ollama from Python&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The official &lt;code&gt;ollama&lt;/code&gt; Python client is straightforward:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ollama&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ollama&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;stream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen2.5:72b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SYSTEM_PROMPT&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_prompt&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temperature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flush&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;Streaming makes long generation easier to watch unfold.&lt;/p&gt;




&lt;h2&gt;
  
  
  Tomorrow: Day 4
&lt;/h2&gt;

&lt;p&gt;Day 4 plan: &lt;strong&gt;let a local AI sort 20,000 iPhone photos&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The actual goal is to have a local image-recognition model (CLIP family?) clean up my photo library so I can stop paying iCloud for storage upgrades...!&lt;/p&gt;




&lt;h1&gt;
  
  
  100ExperimentsWithDGX #LocalLLM #Ollama
&lt;/h1&gt;

</description>
      <category>localllm</category>
      <category>ai</category>
      <category>dgxspark</category>
      <category>ollama</category>
    </item>
    <item>
      <title>[Day 2] I Trained an AI on 22 Photos of My Cat — Now It Draws Her in Any Scene</title>
      <dc:creator>PEPPERCORN</dc:creator>
      <pubDate>Tue, 05 May 2026 00:06:00 +0000</pubDate>
      <link>https://dev.to/peppercorn_llm/day-2-i-trained-an-ai-on-22-photos-of-my-cat-now-it-draws-her-in-any-scene-3a92</link>
      <guid>https://dev.to/peppercorn_llm/day-2-i-trained-an-ai-on-22-photos-of-my-cat-now-it-draws-her-in-any-scene-3a92</guid>
      <description>&lt;h1&gt;
  
  
  [Day 2] I Trained an AI on 22 Photos of My Cat — Now It Draws Her in Any Scene
&lt;/h1&gt;

&lt;h2&gt;
  
  
  So, yesterday I generated "some cat"
&lt;/h2&gt;

&lt;p&gt;Day 1 ended with "I made my DGX draw a cat" — but the cat that came out was just "a cat from somewhere". Today, the goal is to teach the AI about my actual cat (who's currently being looked after at my parents' place back in Japan).&lt;/p&gt;

&lt;p&gt;This is what people call LoRA training.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;LoRA: A technique that teaches an AI model "specific features" using a small set of images, without touching the base model itself. Apparently. The output is a small "diff" file (tens of MB).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is experiment #2.&lt;/p&gt;




&lt;h2&gt;
  
  
  The training data
&lt;/h2&gt;

&lt;p&gt;Source material: 22 photos of my cat.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy9tmru213ymne73f61pv.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy9tmru213ymne73f61pv.jpg" alt="Training photo collage" width="800" height="1058"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I picked a mix of angles — front-facing, full body, sleepy poses, varying lighting — to give the AI a fair shot at recognizing the cat's defining features (tuxedo black-and-white pattern, white socks, the black smudge on the nose).&lt;/p&gt;




&lt;h2&gt;
  
  
  Training pipeline
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Pre-processing
&lt;/h3&gt;

&lt;p&gt;iPhone HEIC files don't work directly with most AI tools, so first conversion to JPG. 10 of the 22 were HEIC.&lt;/p&gt;

&lt;p&gt;Then resize to 512px on the short side for training. &lt;strong&gt;This is where I tripped over a sneaky bug&lt;/strong&gt; — details in the collapsible section below.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Captions
&lt;/h3&gt;

&lt;p&gt;Every image gets a text description like "ohwx cat, sitting on a wooden floor, indoor, soft lighting". The four-letter &lt;code&gt;ohwx&lt;/code&gt; is a meaningless token that becomes the trigger word for "my specific cat" after training.&lt;/p&gt;

&lt;p&gt;Drafting 22 captions by hand would be tedious — but Claude can read images directly, so it drafted them while I just reviewed. The accuracy was uncanny. For example:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnmzgmz033je98xlpyilb.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnmzgmz033je98xlpyilb.jpg" alt="Cat on a kitchen counter" width="512" height="683"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;ohwx cat, walking on a metal kitchen counter, side profile, indoor kitchen with spice bottles and shelves in the background&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7nm6622l9qfb0py7z2ag.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7nm6622l9qfb0py7z2ag.jpg" alt="Mid-yawn cat" width="512" height="683"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;ohwx cat, in a loaf pose on a gray carpet, mouth open showing teeth, mid-yawn, indoor with shelves and warm lights in the background&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fisxhfub5fudklbriga1p.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fisxhfub5fudklbriga1p.jpg" alt="Cat by a window" width="683" height="512"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;ohwx cat, sitting on a wooden floor by a balcony window, viewed from behind, sharp sunlight casting long shadows, indoor&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;SUGOI.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Kohya_ss training
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;Kohya_ss&lt;/code&gt; is the de-facto LoRA training tool. Set up a TOML config, run one command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;accelerate launch train_network.py &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--config_file&lt;/span&gt; configs/train.toml &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--dataset_config&lt;/span&gt; configs/dataset.toml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Training logs scroll by, and the loss value gradually drops. Lower loss = the model is learning, apparently.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Done
&lt;/h3&gt;

&lt;p&gt;1100 steps in 13 minutes 3 seconds on the DGX Spark.&lt;/p&gt;




&lt;h2&gt;
  
  
  Result 1: just typing "ohwx cat" gives me my cat
&lt;/h2&gt;

&lt;p&gt;The first thing I tried was a "without LoRA vs with LoRA" comparison. Same prompt — "ohwx cat as a chef in a kitchen, ..." — first without the LoRA, then with it:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjzfjjk84dv92xns5nt3v.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjzfjjk84dv92xns5nt3v.jpg" alt="Without (left) vs With (right) LoRA" width="800" height="598"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Left: no LoRA. Right: with LoRA.&lt;/p&gt;

&lt;p&gt;Without LoRA, &lt;code&gt;ohwx&lt;/code&gt; is gibberish to the model, so it's ignored and only "a chef in a kitchen" carries weight. Result: a human chef. A nice woman cooking in a pink kitchen.&lt;/p&gt;

&lt;p&gt;With LoRA, &lt;code&gt;ohwx&lt;/code&gt; becomes a real token that points at my cat. Same prompt, but now my cat is the chef.&lt;/p&gt;

&lt;p&gt;This was the moment that hit.&lt;/p&gt;




&lt;h2&gt;
  
  
  Result 2: novel scene reproduction
&lt;/h2&gt;

&lt;p&gt;The training set has no photo of the cat sitting on a wooden floor in this exact composition. So I tried it:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fngsfgj9etl9pv39axg2z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fngsfgj9etl9pv39axg2z.png" alt="My cat sitting on a wooden floor" width="512" height="768"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;White socks: present. Nose smudge: present.&lt;/p&gt;




&lt;h2&gt;
  
  
  My cat, in places she's never been
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;ohwx cat&lt;/code&gt; in various scenes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Sunny balcony
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxbhlskto67vvbgmx2hdm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxbhlskto67vvbgmx2hdm.png" alt="Cat on a sunny balcony" width="512" height="768"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Cozy.&lt;/p&gt;

&lt;h3&gt;
  
  
  Chef (reprise)
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2awj3mdsj8u788bedxhl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2awj3mdsj8u788bedxhl.png" alt="Cat as a chef" width="512" height="768"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The chef hat fits suspiciously well. Cooking ability unverified.&lt;/p&gt;

&lt;h3&gt;
  
  
  Autumn forest
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs1l3r9hwnvh3kqk8qc1n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs1l3r9hwnvh3kqk8qc1n.png" alt="Cat in an autumn forest" width="512" height="768"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A painterly take.&lt;/p&gt;

&lt;h3&gt;
  
  
  Astronaut
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frps74rdecajbrews1tz4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frps74rdecajbrews1tz4.png" alt="Cat as an astronaut" width="512" height="768"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A doppelgänger via the helmet glass — but sci-fi all the same.&lt;/p&gt;




&lt;h2&gt;
  
  
  Today's takeaway
&lt;/h2&gt;

&lt;p&gt;"Build your own AI from your own data" turned out to be way more accessible than I'd assumed.&lt;/p&gt;




&lt;h2&gt;
  
  
  Tech details (Claude explains)
&lt;/h2&gt;

&lt;p&gt;The technical bits, written up by my AI pair.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;HEIC → JPG conversion and the EXIF orientation trap&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Reading iPhone HEIC files in Python is straightforward with &lt;code&gt;pillow-heif&lt;/code&gt;. JPG conversion is a few lines:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;PIL&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Image&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ImageOps&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pillow_heif&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;register_heif_opener&lt;/span&gt;
&lt;span class="nf"&gt;register_heif_opener&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;Image&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;IMG_1234.HEIC&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;img&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;oriented&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ImageOps&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exif_transpose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;img&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# ← critical line
&lt;/span&gt;    &lt;span class="n"&gt;rgb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;oriented&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;convert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;RGB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;rgb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;IMG_1234.jpg&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;quality&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;95&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h3&gt;
  
  
  What I tripped on
&lt;/h3&gt;

&lt;p&gt;My first version skipped &lt;code&gt;ImageOps.exif_transpose()&lt;/code&gt;. Result: 8 of 22 photos came out rotated 90° in the resized output.&lt;/p&gt;

&lt;p&gt;iPhones save portrait shots with the actual pixels stored landscape-ways, plus an EXIF Orientation tag saying "rotate 90° on display". Pillow's default &lt;code&gt;Image.open()&lt;/code&gt; ignores that tag — you have to call &lt;code&gt;exif_transpose()&lt;/code&gt; explicitly.&lt;/p&gt;

&lt;p&gt;Caught it before training started. If I hadn't, the LoRA would have learned "sideways cat" and generation would be weird.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Kohya_ss setup on ARM64 (DGX Spark)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;There are two repos commonly referred to as "Kohya_ss":&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;bmaltais/kohya_ss&lt;/code&gt; — GUI wrapper, xformers dependency (clashes with ARM64)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;kohya-ss/sd-scripts&lt;/code&gt; — the actual training engine, CLI/TOML driven&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;DGX Spark is ARM64, so I went with the latter:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone &lt;span class="nt"&gt;--depth&lt;/span&gt; 1 https://github.com/kohya-ss/sd-scripts.git ~/Kohya_ss
&lt;span class="nb"&gt;cd&lt;/span&gt; ~/Kohya_ss
python3 &lt;span class="nt"&gt;-m&lt;/span&gt; venv venv &amp;amp;amp&lt;span class="p"&gt;;&lt;/span&gt;&amp;amp;amp&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nb"&gt;source &lt;/span&gt;venv/bin/activate
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--upgrade&lt;/span&gt; pip
pip &lt;span class="nb"&gt;install &lt;/span&gt;torch torchvision &lt;span class="nt"&gt;--index-url&lt;/span&gt; https://download.pytorch.org/whl/cu128
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;DGX Spark uses CUDA 12.8 + ARM64 (sbsa), so the PyTorch &lt;code&gt;cu128&lt;/code&gt; channel works directly. Surprisingly painless.&lt;/p&gt;
&lt;h3&gt;
  
  
  Training config (TOML)
&lt;/h3&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="c"&gt;# train.toml (excerpt)&lt;/span&gt;
&lt;span class="py"&gt;pretrained_model_name_or_path&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;".../Realistic_Vision_V6.0_NV_B1.safetensors"&lt;/span&gt;
&lt;span class="py"&gt;vae&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;".../vae-ft-mse-840000-ema-pruned.safetensors"&lt;/span&gt;

&lt;span class="py"&gt;network_module&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"networks.lora"&lt;/span&gt;
&lt;span class="py"&gt;network_dim&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;32&lt;/span&gt;
&lt;span class="py"&gt;network_alpha&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;

&lt;span class="py"&gt;optimizer_type&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"AdamW8bit"&lt;/span&gt;
&lt;span class="py"&gt;unet_lr&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1e-4&lt;/span&gt;
&lt;span class="py"&gt;text_encoder_lr&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;5e-5&lt;/span&gt;
&lt;span class="py"&gt;lr_scheduler&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"cosine_with_restarts"&lt;/span&gt;

&lt;span class="py"&gt;max_train_epochs&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;
&lt;span class="py"&gt;save_every_n_epochs&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;

&lt;span class="py"&gt;mixed_precision&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"bf16"&lt;/span&gt;
&lt;span class="py"&gt;sdpa&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="py"&gt;cache_latents&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="c"&gt;# dataset.toml&lt;/span&gt;
&lt;span class="nn"&gt;[general]&lt;/span&gt;
&lt;span class="py"&gt;shuffle_caption&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
&lt;span class="py"&gt;caption_extension&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;".txt"&lt;/span&gt;
&lt;span class="py"&gt;keep_tokens&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;

&lt;span class="nn"&gt;[[datasets]]&lt;/span&gt;
&lt;span class="py"&gt;resolution&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;512&lt;/span&gt;
&lt;span class="py"&gt;batch_size&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
&lt;span class="py"&gt;enable_bucket&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

  &lt;span class="nn"&gt;[[datasets.subsets]]&lt;/span&gt;
  &lt;span class="py"&gt;image_dir&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"/path/to/cat-photos-512"&lt;/span&gt;
  &lt;span class="py"&gt;num_repeats&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;22 photos × 10 repeats × 10 epochs ÷ batch 2 = 1100 steps. 13 minutes.&lt;/p&gt;

&lt;p&gt;Base model: Realistic Vision V6.0 B1 noVAE (a photo-realistic SD 1.5 derivative). External VAE: sd-vae-ft-mse-original. The combination is good at fur detail.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Hitting the ComfyUI HTTP API for batch generation&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Clicking through the GUI for one image at a time gets old fast. ComfyUI exposes an HTTP API that's easy to drive from Python — &lt;code&gt;urllib.request&lt;/code&gt; from the standard library is enough (no extra deps).&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;urllib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;

&lt;span class="n"&gt;COMFY_URL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://127.0.0.1:8188&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;queue_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="p"&gt;}).&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;req&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;urllib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;COMFY_URL&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Content-Type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;application/json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;urllib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;urlopen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;())[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;wait_for_history&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;180&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;lt&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;urllib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;urlopen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;COMFY_URL&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/history/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;prompt_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;prompt_id&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;prompt_id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;The workflow is ComfyUI's API format (a dict of node IDs with their connections). To use a LoRA, insert a &lt;code&gt;LoraLoader&lt;/code&gt; node between the checkpoint loader and KSampler.&lt;/p&gt;

&lt;p&gt;DGX Spark generates one 512×768 image in about 3 seconds. With seed/strength/prompt parametrized in a script, all 12 grid images came out in under a minute.&lt;/p&gt;




&lt;h2&gt;
  
  
  Tomorrow: Day 3
&lt;/h2&gt;

&lt;p&gt;Day 3 plan: have a local AI analyze my credit card history.&lt;/p&gt;

&lt;p&gt;The kind of data I'd rather not send to a cloud AI, but absolutely want to understand. Quintessential local-AI territory.&lt;/p&gt;




&lt;h1&gt;
  
  
  100ExperimentsWithDGX #LocalLLM
&lt;/h1&gt;

</description>
      <category>localllm</category>
      <category>ai</category>
      <category>dgxspark</category>
      <category>lora</category>
    </item>
    <item>
      <title>[Day 1] DGX Spark Came Home — I Made It Draw a Cat</title>
      <dc:creator>PEPPERCORN</dc:creator>
      <pubDate>Mon, 04 May 2026 03:20:48 +0000</pubDate>
      <link>https://dev.to/peppercorn_llm/day-1-dgx-spark-came-home-i-made-it-draw-a-cat-30f7</link>
      <guid>https://dev.to/peppercorn_llm/day-1-dgx-spark-came-home-i-made-it-draw-a-cat-30f7</guid>
      <description>&lt;h1&gt;
  
  
  [Day 1] DGX Spark Came Home — I Made It Draw a Cat
&lt;/h1&gt;

&lt;h2&gt;
  
  
  So... what is "local LLM" again?
&lt;/h2&gt;

&lt;p&gt;Honestly, I'm still figuring out what "local LLM" even means. But somehow, through a series of decisions I won't fully justify here, I ended up buying an NVIDIA DGX Spark — and now it's sitting in my house.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;DGX Spark: NVIDIA's "supercomputer for the home" — a small but seriously expensive box with the latest-gen AI chip inside. Apparently.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;What I really want to figure out is: when should I use local AI vs. cloud AI? Reading articles about it doesn't seem to help, so I'm going full hands-on. Goal: 100 experiments, one per day-ish, until I have an evidence-based answer.&lt;/p&gt;

&lt;p&gt;This is experiment#1.&lt;/p&gt;




&lt;h2&gt;
  
  
  First, the hardware
&lt;/h2&gt;

&lt;p&gt;So this is what showed up at my door — solidly packed in a sturdy cardboard box.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fetwrk2l7jv4q4qg6387t.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fetwrk2l7jv4q4qg6387t.jpg" alt="DGX Spark box" width="800" height="1067"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When I opened it, I was surprised at how small it actually is. "This is the AI machine?" kind of small.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqq8k08mii298z3qtorrr.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqq8k08mii298z3qtorrr.jpg" alt="DGX Spark hardware (mesh sides)" width="800" height="1067"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Boot up → Initial OS setup
&lt;/h2&gt;

&lt;p&gt;Power on, and an Ubuntu-based DGX OS 7.5.0 boots up.&lt;/p&gt;

&lt;h3&gt;
  
  
  Welcome screen
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqn3ibxlwj605xrrk9zfi.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqn3ibxlwj605xrrk9zfi.jpg" alt="Get started screen" width="800" height="470"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;"Get started" — yes, please.&lt;/p&gt;

&lt;h3&gt;
  
  
  Language and timezone
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkivl91tfq5u5wv55omq2.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkivl91tfq5u5wv55omq2.jpg" alt="Language and timezone" width="800" height="541"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Standard Linux installer territory — same as Ubuntu?&lt;/p&gt;

&lt;h3&gt;
  
  
  Privacy settings
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8v6rcegdhblxy68puwzv.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8v6rcegdhblxy68puwzv.jpg" alt="Privacy settings" width="800" height="755"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Diagnostic data sharing prompts.&lt;/p&gt;

&lt;h3&gt;
  
  
  System update
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foaogr37pvykiys42avjz.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foaogr37pvykiys42avjz.jpg" alt="Update started" width="799" height="438"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The moment I plugged it in, it started updating itself. Modern Linux being Linux.&lt;/p&gt;

&lt;h3&gt;
  
  
  Setup complete
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fho3zwtbliqbvvabytjwt.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fho3zwtbliqbvvabytjwt.jpg" alt="Setup complete" width="712" height="413"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I picked a username and let the hostname auto-assign. DGX-side prep done.&lt;/p&gt;




&lt;h2&gt;
  
  
  Connecting from my Windows PC
&lt;/h2&gt;

&lt;p&gt;Plugging a monitor into the DGX every time would be tedious, so I want to SSH in from my regular Windows machine (which I've nicknamed "myPC1").&lt;/p&gt;

&lt;p&gt;NVIDIA provides a desktop app called NVIDIA Sync that's supposed to make SSH setup painless. So I install it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq5uezyis10h8hluz2xb0.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq5uezyis10h8hluz2xb0.jpg" alt="NVIDIA Sync install" width="643" height="287"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;…and that's where I fell into a trap big-time. Windows OpenSSH refused to connect with a "your SSH config has weird permissions, can't trust it" error.&lt;/p&gt;

&lt;p&gt;Full troubleshooting steps are in the collapsible "Tech details" section below.&lt;/p&gt;




&lt;h2&gt;
  
  
  Inside the DGX, finally
&lt;/h2&gt;

&lt;p&gt;After much wrestling, I made it inside. Here's the rough lay of the land:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Item&lt;/th&gt;
&lt;th&gt;Spec&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GPU&lt;/td&gt;
&lt;td&gt;NVIDIA GB10 Grace Blackwell&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory&lt;/td&gt;
&lt;td&gt;128GB (unified between CPU and GPU)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Storage&lt;/td&gt;
&lt;td&gt;4TB SSD (basically empty)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CPU&lt;/td&gt;
&lt;td&gt;20 cores (perf + efficiency combo)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Idle power&lt;/td&gt;
&lt;td&gt;4W (yes, four)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;128GB of memory is apparently 8–16x what's in a typical laptop.&lt;/p&gt;




&lt;h2&gt;
  
  
  Setting up image generation → 🐱
&lt;/h2&gt;

&lt;p&gt;This is the main event. I'm setting up ComfyUI to generate the first cat from this DGX.&lt;/p&gt;

&lt;p&gt;The ComfyUI interface looks like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F53sxhno991z6dni6ozdv.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F53sxhno991z6dni6ozdv.jpg" alt="ComfyUI connected" width="800" height="403"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The "boxes connected by cables" view is intimidating at first, but the default workflow is pre-wired. You just type a prompt and hit Queue Prompt.&lt;/p&gt;

&lt;p&gt;So:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;a cute fluffy cat sitting on a sunny windowsill, photorealistic, high detail, beautiful lighting, soft fur, cinematic, masterpiece, best quality&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A few seconds later...&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flj7bnj1taf0oplft2itq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flj7bnj1taf0oplft2itq.png" alt="ComfyUI cat 1" width="512" height="512"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🐱 There it is — the very first cat my DGX has ever drawn!&lt;/p&gt;

&lt;p&gt;Tweaked the prompt and made some more.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy5lb3ptrj1ue8cdejtno.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy5lb3ptrj1ue8cdejtno.png" alt="ComfyUI cat 2" width="512" height="512"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Eyes a bit unsettling but yeah, fluffy cat.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frrkiqc4px9sjeys77cms.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frrkiqc4px9sjeys77cms.png" alt="ComfyUI cat 3" width="512" height="512"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Going a touch dark there.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc6e77vf9oc2i6hem4fdm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc6e77vf9oc2i6hem4fdm.png" alt="ComfyUI cat 4" width="512" height="512"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;…is this a cat? It feels artistic though.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhwdpwdcz28b6dsmhsz51.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhwdpwdcz28b6dsmhsz51.png" alt="ComfyUI cat 5" width="512" height="512"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Distinctive composition.&lt;/p&gt;

&lt;p&gt;Each masterpiece takes a few to a dozen seconds. That speed means I can iterate on prompts without thinking about cost — which turned out to be quite addictive.&lt;/p&gt;




&lt;h2&gt;
  
  
  Tech details (let the AI explain it)
&lt;/h2&gt;

&lt;p&gt;The rest is the technical stuff. Read on if you're curious.&lt;/p&gt;

&lt;p&gt;I'm a non-engineer poking at this stuff for the first time, so I had Claude (my AI pair programmer for this challenge) write up the technical details. Hopefully useful for anyone walking the same path.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;How to actually get SSH working on Windows&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;NVIDIA Sync should generate an SSH keypair, register the public key on the DGX side at &lt;code&gt;~/.ssh/authorized_keys&lt;/code&gt;, and let you connect without a password.&lt;/p&gt;

&lt;p&gt;If it doesn't work, the cause is usually permissions on Windows SSH config files.&lt;/p&gt;
&lt;h3&gt;
  
  
  Symptom
&lt;/h3&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ ssh spark-XXXX.local
Bad permissions. Try removing permissions for user: [PC]\CodexSandboxUsers
on file C:/Users/[user]/.ssh/config.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;If you've installed Codex CLI or similar sandboxing tools in the past, the &lt;code&gt;[PC]\CodexSandboxUsers&lt;/code&gt; group may have inherited permissions on &lt;code&gt;~/.ssh/&lt;/code&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  Fix (run from an elevated PowerShell)
&lt;/h3&gt;

&lt;p&gt;Use environment variables to avoid hard-coding your username/PC name.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Take ownership&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;takeown&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;/f&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$&lt;/span&gt;&lt;span class="nn"&gt;env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nv"&gt;USERPROFILE&lt;/span&gt;&lt;span class="s2"&gt;\.ssh\config"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;icacls&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$&lt;/span&gt;&lt;span class="nn"&gt;env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nv"&gt;USERPROFILE&lt;/span&gt;&lt;span class="s2"&gt;\.ssh\config"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;/grant:r&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nv"&gt;$&lt;/span&gt;&lt;span class="nn"&gt;env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nv"&gt;USERNAME&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;:F"&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="c"&gt;# Disable inheritance and remove the bad user&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;icacls&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$&lt;/span&gt;&lt;span class="nn"&gt;env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nv"&gt;USERPROFILE&lt;/span&gt;&lt;span class="s2"&gt;\.ssh\config"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;/inheritance:d&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;icacls&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$&lt;/span&gt;&lt;span class="nn"&gt;env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nv"&gt;USERPROFILE&lt;/span&gt;&lt;span class="s2"&gt;\.ssh\config"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;/remove&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$&lt;/span&gt;&lt;span class="nn"&gt;env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nv"&gt;COMPUTERNAME&lt;/span&gt;&lt;span class="s2"&gt;\CodexSandboxUsers"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;Use &lt;code&gt;/inheritance:d&lt;/code&gt; rather than &lt;code&gt;/inheritance:r&lt;/code&gt; — &lt;code&gt;:r&lt;/code&gt; strips all permissions, locking yourself out.&lt;/p&gt;
&lt;h3&gt;
  
  
  NVIDIA Sync's internal config files need the same treatment
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;~/.ssh/config&lt;/code&gt; &lt;code&gt;Include&lt;/code&gt;s an NVIDIA Sync config file, and that one inherits the same problem.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$cfg&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$&lt;/span&gt;&lt;span class="nn"&gt;env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nv"&gt;LOCALAPPDATA&lt;/span&gt;&lt;span class="s2"&gt;\NVIDIA Corporation\Sync\config\ssh_config"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;icacls&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;$cfg&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;/inheritance:d&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;icacls&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;$cfg&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;/remove&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$&lt;/span&gt;&lt;span class="nn"&gt;env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nv"&gt;COMPUTERNAME&lt;/span&gt;&lt;span class="s2"&gt;\CodexSandboxUsers"&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="nv"&gt;$key&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$&lt;/span&gt;&lt;span class="nn"&gt;env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nv"&gt;LOCALAPPDATA&lt;/span&gt;&lt;span class="s2"&gt;\NVIDIA Corporation\Sync\config\nvsync.key"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;icacls&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;$key&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;/inheritance:d&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;icacls&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;$key&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;/remove&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$&lt;/span&gt;&lt;span class="nn"&gt;env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nv"&gt;COMPUTERNAME&lt;/span&gt;&lt;span class="s2"&gt;\CodexSandboxUsers"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h3&gt;
  
  
  Ghost SIDs that icacls can't remove
&lt;/h3&gt;

&lt;p&gt;If you have SIDs from deleted user accounts lingering, &lt;code&gt;icacls /remove&lt;/code&gt; won't touch them. You need PowerShell ACL manipulation:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$cfg&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$&lt;/span&gt;&lt;span class="nn"&gt;env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nv"&gt;LOCALAPPDATA&lt;/span&gt;&lt;span class="s2"&gt;\NVIDIA Corporation\Sync\config\ssh_config"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nv"&gt;$acl&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Get-Acl&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;$cfg&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nv"&gt;$badRules&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;$acl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Access&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Where-Object&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="bp"&gt;$_&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;IdentityReference&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Value&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-like&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"S-1-5-*"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-and&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="bp"&gt;$_&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;IdentityReference&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;Translate&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;System.Security.Principal.NTAccount&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-isnot&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;System.Security.Principal.NTAccount&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nv"&gt;$badRules&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ForEach-Object&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;$acl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;RemoveAccessRule&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;$_&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Out-Null&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="n"&gt;Set-Acl&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-Path&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;$cfg&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-AclObject&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;$acl&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;After this, &lt;code&gt;ssh spark-XXXX.local&lt;/code&gt; connects on the first try (replace XXXX with your hostname).&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Commands to check DGX specs&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# GPU&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;nvidia-smi
NVIDIA-SMI 580.142    Driver Version: 580.142    CUDA Version: 13.0
GPU 0: NVIDIA GB10    36C    P8    4W / N/A

&lt;span class="c"&gt;# OS&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;uname&lt;/span&gt; &lt;span class="nt"&gt;-a&lt;/span&gt;
Linux spark-XXXX 6.17.0-1014-nvidia ... aarch64 GNU/Linux

&lt;span class="c"&gt;# Memory&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;free &lt;span class="nt"&gt;-h&lt;/span&gt;
Mem: 121Gi  2.6Gi  118Gi

&lt;span class="c"&gt;# Storage&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;df&lt;/span&gt; &lt;span class="nt"&gt;-h&lt;/span&gt;
/dev/nvme0n1p2  3.7T  47G  3.5T  2%  /

&lt;span class="c"&gt;# CPU&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;lscpu
Architecture:  aarch64
CPU&lt;span class="o"&gt;(&lt;/span&gt;s&lt;span class="o"&gt;)&lt;/span&gt;:        20
Model name:    Cortex-X925 + Cortex-A725
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;Notable bits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CUDA 13.0 (latest)&lt;/li&gt;
&lt;li&gt;aarch64 (ARM64) architecture — yes, the DGX is ARM&lt;/li&gt;
&lt;li&gt;121Gi (≈128GB) unified memory&lt;/li&gt;
&lt;li&gt;20 cores in big.LITTLE layout (10 perf + 10 efficient)&lt;/li&gt;
&lt;/ul&gt;

&lt;ol&gt;
&lt;li&gt;ComfyUI installation steps&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Following the official NVIDIA &lt;a href="https://build.nvidia.com/spark" rel="noopener noreferrer"&gt;Comfy UI playbook&lt;/a&gt;.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Virtual environment&lt;/span&gt;
&lt;span class="nb"&gt;cd&lt;/span&gt; ~
python3 &lt;span class="nt"&gt;-m&lt;/span&gt; venv comfyui-env
&lt;span class="nb"&gt;source &lt;/span&gt;comfyui-env/bin/activate

&lt;span class="c"&gt;# PyTorch with CUDA 13.0&lt;/span&gt;
pip3 &lt;span class="nb"&gt;install &lt;/span&gt;torch torchvision &lt;span class="nt"&gt;--index-url&lt;/span&gt; https://download.pytorch.org/whl/cu130

&lt;span class="c"&gt;# ComfyUI itself&lt;/span&gt;
git clone https://github.com/comfyanonymous/ComfyUI.git
&lt;span class="nb"&gt;cd &lt;/span&gt;ComfyUI
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt

&lt;span class="c"&gt;# Model (SD 1.5, ~2GB)&lt;/span&gt;
&lt;span class="nb"&gt;cd &lt;/span&gt;models/checkpoints/
wget https://huggingface.co/Comfy-Org/stable-diffusion-v1-5-archive/resolve/main/v1-5-pruned-emaonly-fp16.safetensors

&lt;span class="c"&gt;# Launch server&lt;/span&gt;
&lt;span class="nb"&gt;cd&lt;/span&gt; ~/ComfyUI
python main.py &lt;span class="nt"&gt;--listen&lt;/span&gt; 0.0.0.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;Key packages installed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;torch 2.11.0+cu130&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;cuDNN 9.19&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;NCCL 2.28&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;transformers 5.7.0&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;comfyui-frontend-package 1.42.15&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Open &lt;code&gt;http://spark-XXXX.local:8188&lt;/code&gt; from your Windows PC's browser to access ComfyUI (XXXX is your hostname).&lt;/p&gt;
&lt;h3&gt;
  
  
  Download speed
&lt;/h3&gt;

&lt;p&gt;The 2GB model came down at 40.6 MB/s in 50 seconds from HuggingFace's CDN. About half of my home 1Gbps LAN.&lt;/p&gt;




&lt;h2&gt;
  
  
  Tomorrow: Day 2
&lt;/h2&gt;

&lt;p&gt;Day 2 plan: Train a LoRA on photos of my actual cat.&lt;/p&gt;

&lt;p&gt;Today's SD 1.5 only knows "some cat from somewhere". With LoRA fine-tuning, I should be able to teach it about my specific cat. That kind of personalization feels like the killer feature of running locally.&lt;/p&gt;




&lt;h1&gt;
  
  
  100ExperimentsWithDGX #LocalLLM
&lt;/h1&gt;

</description>
      <category>localllm</category>
      <category>ai</category>
      <category>dgxspark</category>
      <category>comfyui</category>
    </item>
  </channel>
</rss>
