DEV Community: PEPPERCORN

[Day 18] I set up a company of AI agents with Claude Code — and a local LLM (qwen) joined as the caretaker

PEPPERCORN — Sat, 25 Jul 2026 00:16:28 +0000

Intro

Day 18!

In Day 17 I audited 33,469 of my own AI conversations. Building on that, today I set up a company of cats to run my task management 🐱

What I used: Claude Code / DGX Spark / Slack & Notion.

Today's plan

What I want: hand the annoying parts of task management to a company of cats
Approach: build in order — the org, the task flow, the ledger, AI collaboration rules, then deploy to both machines
Done means: every morning, Necco and I pick "today's three tasks"
Result:
- A company of six cats. Inbox = Slack, source of truth = Notion, memory = the ledger
- I inventoried my projects: 80 of them. All in the ledger now
- One page of collaboration rules for Claude Code / Codex / qwen

① Built the org

Role-based staff (subagents), all named "___-neko" (neko = cat).

Name	Role
Necco	Chief secretary: morning meeting, task triage, routing, ledger upkeep
Shirabe-neko	Research
Tsukuri-neko	Implementation
Mekiki-neko	Review (read-only)
Soroban-neko	Number crunching
Rusuban-neko	Caretaker: a local LLM (qwen) living on the DGX. Summarizes, sorts, and tags incoming files

Before building, I looked at prior art. An individual delegating their own management to AI seems established as the AI Chief of Staff pattern.

The staff internals come tomorrow in Day 19.

② Built the task flow

A task pops up → one line into Slack from my phone (3 seconds)
             → [morning meeting] Necco sweeps the inbox
             → cleans it up, adds a due date, registers it in Notion
             → proposes "today's three"

Think of something, drop one line in Slack. By morning it's in Notion. An evening "done: ___" one-liner feeds the next morning's ledger update and task close-out.

③ Built the ledger

What's the ledger (LEDGER.md)? One file listing every project, so Necco can answer "where was that project again?" instantly.

For the initial data, I inventoried both of my machines. The count: 80 projects. I had no idea it was that many...

④ Set the AI collaboration rules

Instead of deciding "which AI gets this job" every time, it's now one page:

AI	Role
Claude Code	Lead (design, planning, review, conversation)
Codex CLI	Routine subcontractor (implementation)
qwen (local LLM)	Batch pre-processing (summarize, sort, tag — stays on the DGX)

Rusuban-neko is the DGX side of this rule. Zero API cost, and the data never leaves the house.

⑤ Deployed to both machines

The company is a git repository. I gave it a private GitHub remote and cloned it onto the DGX. An install script places the staff and skills into ~/.claude/.

Now either machine can summon the same company.

The first morning meeting

I opened Claude Code in the office and ran the first morning meeting.

The one-liner I had tossed into Slack the night before was cleaned up and registered in Notion by Necco. This is genuinely comfortable.

The full picture


Chief secretary	Necco (Claude Code)
Staff	Shirabe / Tsukuri / Mekiki / Soroban (subagents)
Caretaker	Rusuban-neko (local qwen on the DGX)
Inbox	Slack
Source of truth	Notion
Memory	the ledger file (80 projects)

Layer	Role
Slack	The inbox — ephemeral, free-form, 3 seconds to post
Notion	The source of truth — same as before
Ledger	The company memory — what exists, where, in what state

secretary/                   ← the office (cloned on both machines)
  CLAUDE.md                  ← Necco's persona, house rules, meeting runbook
  LEDGER.md                  ← the ledger (80 projects)
  input/                     ← file inbox (pre-processed by Rusuban-neko)
  agents/                    ← staff definitions (source of truth)
  skills/                    ← skills (source of truth)
  install.sh / install.ps1   ← places staff & skills into ~/.claude/

Notion stays; only the entrance changed. Necco does the formatting and registering.

Today's takeaways

Separate the inbox from the source of truth: Notion felt tedious because it was doing both jobs
Keep the company's memory in git: AI auto-memory splits across two machines
Default patterns are fine: borrow the shape from prior art, fit the details to your own setup

The details

The morning meeting runbook (8 steps)

git pull (bring in the other machine's updates)
Rusuban-neko pre-processes anything new in input/
Close out yesterday's three (update Notion statuses)
Sweep the Slack inbox → turn into TODOs → register in Notion
Notion + calendar + inbox + ledger → propose "today's three"
Reflect the evening one-liners into the ledger
Post the final digest to Slack
git push

Git sync (1 and 8) and close-out (3) are baked into the runbook. The human only "posts" and "answers".

Pitfalls I mapped out beforehand

Pitfall	Countermeasure
The ledger goes stale	The evening one-liner habit
AI memory splits across two machines	Permanent knowledge goes into git-tracked files; auto-memory is treated as cache
Forgetting to sync the two machines	pull / push are steps 1 and 8 of the runbook
Work calendar isn't visible	Fill in verbally at the morning meeting, for now
No closed loop on task completion	The meeting starts with closing out yesterday's three

Two-machine sync

Staff definitions (agents/) and skills (skills/) live in the repository as the source of truth
install.sh (Linux) / install.ps1 (Windows) place them into ~/.claude/
Adding a staff member = commit → pull on the other machine → run install

Prior-art notes

mimurchison/claude-chief-of-staff: CLAUDE.md at the core; goals filter every priority; humans keep the final say
Claudia: remembers promises and relationships; a messaging gateway from your phone
loganhc-09/claude-chief-of-staff: resident operation via scheduled scripts

The common rule: always-on rules → CLAUDE.md / occasional methods → skills / context-heavy work → subagents. This build follows it too.

Tomorrow: Day 19

Staff day — the four subagent definition files, plus getting Rusuban-neko (the local LLM) running as a resident service.

Thanks for reading!

[Day 17] I analyzed 33,469 of my own AI conversations to audit how I actually use AI

PEPPERCORN — Thu, 23 Jul 2026 04:34:59 +0000

Intro

Day 17!

Today I collected everything I've ever said to an AI and had a local model read it back to me. Seventeen months of history — then a self-audit of how I actually use these tools, and where the easy wins are.

What I collected	Messages	Mine only
ChatGPT (browser)	10,002	4,964
Claude (browser)	10,937	5,482
Terminal (Claude Code, Codex CLI)	12,530	3,518
Total	33,469	13,964

What I used: DGX Spark (my home AI machine) / qwen2.5 (the local model doing the analysis).

Usage drifted from browser to terminal over time.

Where the history lives

ChatGPT (browser)

Open Settings from the account icon (top right)
Go to Data controls
Hit Export data → confirm
A link arrives by email; download the zip from there

Claude (browser)

Open Settings from your name (bottom left)
Go to Privacy
Hit Export data
A link arrives by email; download the zip from there

Neither one is instant — you wait a bit for the email.

Terminal (Claude Code, Codex CLI)

No request needed. The logs are already on your machine — just copy them.

# Claude Code
~/.claude/projects/<per-project>/*.jsonl

# Codex CLI
~/.codex/sessions/<year>/<month>/<day>/*.jsonl

What you get from each

	Browser export	Terminal logs
How to get it	Request, wait for email	Just copy the files
Conversation text	Yes	Yes
Image / attachment files	Yes (563 for me)	No
Shared-link conversations	Yes	—
Project settings & docs	Yes	Yes
Record of what the AI actually did	No	Yes

Note that the browser export does not contain your terminal history, and vice versa. If you work across two machines, collect from both.

Audit 1: what I use it for

First, what was I actually using AI for? I had qwen2.5 read each conversation and tag it one by one ("this is research," "this is coding").

The big ones were research & learning (291), writing & editing (236), coding (193), and setup & troubleshooting (151) — two-thirds of everything.

Split by tool, it separates cleanly:

	Browser 1,267 convos	Terminal 75 convos
Coding	12.0%	54.7%
Setup & troubleshooting	10.9%	17.3%
Research & learning	22.7%	5.3%
Writing & editing	18.1%	9.3%

Terminal is 70% code-related; browser is 40% research and writing. The topics themselves split neatly between the two tools.

One caveat: the terminal side is only 75 conversations, so read it as a rough tendency, not a precise number. (Why so few? See "Not keeping work data in the first place" below.)

Audit 2: how I ask

Next, how I phrase requests. I counted phrasing patterns as a proxy for care. The yardstick is Anthropic's AI Fluency Framework.

Phrasing I counted	Browser	Terminal
Specify steps or format	21%	40%
Reject the output	11%	22%
Ask for the reasoning	8%	24%
Ask it to verify	6%	35%

In the terminal, my instructions are much more detailed.

(A fair reading: the terminal is where I do code work, and code work naturally invites step-by-step instructions and verification — so this may be a difference in task, not in skill.)

Audit 3: which features I use

The terminal logs record what the AI actually did. Here's the operation history:

Most-used operations	Count
Bash (run a command)	4,053
Edit (change a file)	2,844
Read (read a file)	2,443
PowerShell	1,001
Write (create a file)	612
WebSearch	460

So the reality is: read, edit, run a command. Nothing fancy.

And here's how often I reached for the fancier stuff. Turns out I barely write my own skills...!

Feature	Usage
Agents (the AI spawns helpers)	78 times, 5,542 helper messages
MCP (connect to outside services)	80 times
Skills (built-in)	6 times
Skills (my own)	0
Plan mode (approve a plan before it runs)	1 time
`/model` (switch model)	82 times

Agents and MCP I already lean on. Skills and plan mode? Almost untouched.

The easy wins

Three obvious places to improve:

1. Turn routine work into a Skill.

Zero of my own. My article pre-publish checklist runs the same way every time — I'd rather call it by name than re-ask by hand each time.

2. Use plan mode before big jobs.

Once in 17 months. I keep course-correcting mid-run; I'd rather see the plan first.

3. Move all coding into the terminal.

12.0% of it still happens in the browser, where the AI can't touch files directly — so I waste round-trips copy-pasting.

The details

What I counted, and what I left out

I counted only human↔AI text exchanges. I dropped:

agent-*.jsonl / journal.jsonl — internal work logs the AI keeps
Sub-agent (AI helper) messages, 5,542 of them — not typed by a human
Auto-inserted text like <system-reminder> — same reason
Image/attachment contents (563 .dat files) — text only this time

One thing I missed: Claude's design_chats (3 conversations) got skipped because my loader only reads conversations.json. It's just one message, so the numbers don't move — but I can't claim I read the entire export.

What counts as a "phrasing"

I picked these up with regular expressions:

What I counted	Example words (Japanese source)
Specify steps or format	first / next / finally / "in a table" / bullet points / JSON
Reject the output	wrong / not that / that's off / not working
Ask for the reasoning	really? / is that right / evidence / source / why
Ask it to verify	test / run it / verify / double-check / reproduce

It's counting surface phrasing, not actual competence — treat it as a rough proxy. The comparison used conversations from May 2026 onward, where I use both tools: 348 browser vs 102 terminal.

Not keeping work data in the first place

Nearly half the history was work. That isn't even really my data — it belongs to the other party.

At load time I classify by folder name; for work, I throw away the body and keep only a character count.

cat = classify(project)
# keep the body only when it's confirmed personal
body = (text or None) if cat in ("personal", "web") else None

The classification keywords are literally client/partner names, so they don't live in the code — they're read from a config file that never enters Git. The classifier itself is more sensitive than the data it classifies — an obvious point I only noticed during the pre-publish check.

Because of this design, I could only read 75 of the 12,530 terminal conversations. A design that protects me also trimmed my own analysis.

The measurement flipped three times

My first pass showed every metric dropping. Read naively: "I got sloppy." But it was the denominator's fault.

How I measured	What it showed	The problem
Rate per message	All metrics fall	Shorter messages drag it down automatically
Rate per conversation	Still falls	Picking up the shifting browser/terminal mix
Split browser vs terminal	Terminal higher on everything	← adopted this

I couldn't compare past-me to present-me directly: the tools swapped underneath me, so a skill difference and a place difference get tangled together.

A junked experiment: blind-judging old vs new requests

I hid the dates and had qwen2.5 compare an old request against a new one. Result: old won 10, new won 3. But it was browser-vs-browser, so I was really comparing "browser when I used it seriously" against "browser after it became an afterthought." It says nothing about whether I improved. (I judged each pair twice with the order flipped, counting a win only when both agreed — a fix for the position bias I found in the Day 14 cat-meow quiz.)

Speed notes, since I got stuck:

Snag	What happened
Big model won't fit	qwen2.5 72B: only 39.2 of 62.9 GB on the GPU → 7 min per judgment
Model switching	The previous model lingers; timeouts right after a switch
Prompt length	621 chars = 8s, 918 chars = 22s, 1,500 chars didn't finish in 60s

Switching to 32B (19 GB) fit entirely on the GPU → a few seconds each. Fitting or not fitting changes the order of magnitude. And the scoring isn't very reliable: of 25 browser-vs-browser pairs only 16 were usable, and for browser-old vs terminal-new, 17 of 25 timed out. No cloud API was used, so this analysis cost $0 (electricity aside).

Sources

The AI Fluency Framework — Rick Dakan, Joseph Feller, and Anthropic (CC BY-NC-SA 4.0): https://aifluencyframework.org/

Outro

Line up 17 months and you can see exactly what you were doing each month. Re-uploading all of that somewhere to analyze it feels like a bit much. Keeping it on my own machine is well suited to moments like this.

Soon, I'd like to build a Skill of my own.

Thanks for reading!

[Day 16] I made a theme song for my cat — lyrics, melody, and singing voice, all AI, all local

PEPPERCORN — Sat, 18 Jul 2026 14:34:28 +0000

Intro

Day 16!

Today's experiment: make a theme song for my family's cat — and see if it actually feels like their song, not a generic AI tune.

I used ACE-Step 1.5, a local music-generation AI (think "a Suno you can run at home"). It produced the whole package: lyrics, melody, and a singing voice. Everything here was generated locally on my own machine.

What I used: DGX Spark (my home AI machine) / ACE-Step 1.5 XL (music-generation AI).

Today's experiment

What I wanted to do

One theme song for my cat (Japanese vocals, ~3 minutes). Plus a few background tracks on the side, to measure how fast generation really is.

The approach

First, the AI interviews me about my cat
It turns my answers into lyrics (there's a "template" for which answer goes where)
Hand the lyrics and a style prompt to ACE-Step 1.5 to generate the song

The goal

Not "huh, the AI made something," but a song I'd actually recognize as my cat's.

STEP 1: the AI interviewed me about my cat

Writing the lyrics started with me answering six questions.

#	Question	My answer
1	What do you call them in the song?	"Our Nekko" (our little cat)
2	Personality in a word?	Super tsundere, super timid
3	Favorite gesture or habit?	Getting the base of their tail tapped
4	One memorable episode?	It takes at least six months to truly bond
5	How do you feel about them?	Calm — when they're nearby
6	Any song requests?	Bright and upbeat, female vocal

"Tsundere" = prickly on the outside, secretly sweet on the inside. A very cat thing.

STEP 2: the template that turns answers into lyrics

Each answer lands in a fixed place in the song.

Interview answer	Where it goes
The name	Chorus (repeated in the catchiest spot)
Personality	Verse 1 (introduce the character with everyday scenes)
Habit	Verse 2 (sharpen the picture with specifics)
Memory	Bridge (the emotional beat late in the song)
Feeling	The chorus's tone + the overall mood
Song request	Not lyrics — the style prompt

With that template, my cat's chorus came out like this (Japanese, with a rough English gloss):

Uchi-no-Nekko wa tsun-tsun-tsundere      (Our Nekko is tsun-tsun-tsundere)
Kokoro no tobira wa katai kedo           (The door to their heart stays shut, but)
Kiiroi hitomi de chiratto ichibetsu      (Those yellow eyes shoot one quick glance)
Dere wa kibun de nen ni suukai           (The sweet side? A few times a year, if you're lucky)

STEP 3: generate. A 3-minute Japanese-vocal song

Hand ACE-Step 1.5 XL the lyrics and a style prompt (upbeat J-pop, female vocal, Japanese), and out came a full 3-minute song with Japanese singing.

Speed: the audio is 10 seconds, but "composing" takes 8 minutes

The interesting part was the time breakdown.

What I made	Mode	Time
BGM (90s, no vocals) ×4	quick	40s total (~10s each)
Theme song (3min, vocals)	quick	~10–15s each
Theme song (3min, vocals)	with composition planner	~8 min (first run ~23 min)

What's the "composition planner"? = the language model (LM) built into ACE-Step 1.5. It thinks through the song's blueprint (structure, metadata) before rendering audio. Better quality, but the thinking costs time.

Bonus: I made an MV too

How it's built: Whisper transcribes the lyric timings → an image AI (AnythingV5) makes a character and backgrounds → a video AI (LTX-2.3) animates the character → rembg cuts it out and composites it onto the background with subtitles (finer steps in the fold below).

Honestly, the video came out pretty cursed — the lower body suddenly morphs into a tail, the eyes go a little feral — so it's mildly unsettling. That's on the "improve later" list.

Today's takeaways

The slow part isn't the sound, it's the thinking: the audio itself renders in ~10s; the long wait is the AI planning the song's structure
Old tools converged: anime-style image (Day 11) + Whisper (Day 14) + video generation (Day 15) came together into a single MV

The details

:::details Environment and models

Machine: DGX Spark (128GB unified memory)
ACE-Step 1.5: set up from the official repo with uv. DGX Spark (ARM64 + CUDA 13) is listed as an officially supported target, and it just ran
Model setup: DiT is acestep-v15-xl-turbo (4B, 8 steps), planner LM is the 4B version. ~36GB of downloads total
Peak memory during generation ~27GB. Generated via the REST API server (easier to reproduce)
MV side: ComfyUI (LTX-2.3 22B distilled fp8) + AnythingV5 (character & backgrounds) + rembg 2.0.76 (cutout, anime model isnet-anime) + Whisper large-v3 (timings) + ffmpeg (assembly) :::

:::details Full lyrics and the style prompt
Style prompt (Caption):

upbeat J-pop, energetic and heartwarming, female vocal, bright synth,
acoustic guitar, catchy chorus, 128 bpm, Japanese lyrics

Lyrics are passed with structural tags like [Verse] and [Chorus] (the words are Japanese; the （にゃー） bits are meow ad-libs sung as backing vocals):

[Intro]
（にゃー）

[Verse 1]
目が合った瞬間 ぷいっとそっぽ向く（にゃっ）
呼んでも来ないのに 気づけばそばにいる
物音ひとつで ロケットダッシュ（にゃー！）
ビビリなくせに 顔は堂々

[Pre-Chorus]
ツンとすまして 知らんぷり
それでもしっぽは 正直もの（にゃ？）

[Chorus - catchy]
うちのネッコは ツンツンツンデレ（にゃにゃ！）
心の扉は かたいけど
黄色い瞳で ちらっと一瞥
デレは気分で 年に数回（にゃー）

[Verse 2]
撫でられるのは 好きじゃないくせに
しっぽの付け根を とんとん叩けば
目を細めて 喉を鳴らして
もっと続けてと 視線で命令（にゃっ）

[Bridge - emotional]
心が通じるまで 半年かかった
ゆっくりゆっくり 縮めた距離は
いまでは世界で いちばん近い
気づけば隣が 定位置になった

[Chorus - anthemic]
うちのネッコは ツンツンツンデレ（にゃにゃ！）
ビビリなところも ご愛嬌
とんとんのリズムで しっぽが揺れる
今日も我が家の 王様です（にゃー！）

[Outro - fade out]
うちのネッコ（にゃー）
うちのネッコ（にゃー にゃー）

A tip from the official docs: put the fine-grained style control in the Caption, and keep the lyric tags simple.
:::

:::details How the MV was built (the finer steps)

Uses the intro through the second chorus (~70s). Per-line timings from Whisper large-v3 (since I already know the lyrics, a mis-hear here and there is fine — I only use the timestamps)
20+ cuts. One lyric line = one cut, each assigned a clip whose motion fits the words
The character image is a single frame: my "anime-fied cat" from Day 11, redrawn into a mascot with an image AI (AnythingV5). That's handed to the video AI (LTX-2.3) to animate. ~30s per clip; motion is described in plain requests ("turn away," "walk cycle," etc.)
Cutout compositing is frame-by-frame: rembg (isnet-anime) cuts out the character → composite onto a slowly zooming/panning background → draw subtitles → re-encode at 24fps
Keeping the character's scale fixed within a cut mattered. Matching to the per-frame silhouette makes it shrink the instant it stretches (learned the hard way) :::

:::details Gotchas

The dash comes last: when I asked LTX-2.3 to "suddenly bolt away," the motion tended to land in the clip's final 0.5s. Worked around it by shifting the segment I use
Feed the input image in the aspect ratio you want out: at first I fed a portrait (512×768) character image into a landscape (768×512) generation, and every clip came out as an upper-body zoom that cut off the ears. Rebuilt the input as a landscape full-body image to fix it. Video AIs inherit the input image's composition quite strongly
The planner may have been running slower than it should: digging through the logs, the LM's fast-execution component (vLLM) seems to have failed to compile due to a missing Python.h, so it may have quietly fallen back to a slow path (I didn't fully chase down the cause on Day 16). I hit the same root cause on Day 15 (python3.12-dev not installed) — could be a recurring Spark gotcha :::

:::details License notes

ACE-Step 1.5: code and weights both MIT. Commercial use OK
Planner LM (4B, Qwen3-4B based): also MIT
The generated music is explicitly cleared for commercial use in the official docs (training data is licensed + royalty-free + synthetic)
MV-side tools (LTX-2.3, AnythingV5, rembg, Whisper) are the ones I already vetted on Days 11–15 :::

Sources

Outro

Lyrics, melody, singing, and a music video — all AI, all on my own machine. See you next time!
Thanks for reading!

100ExperimentsWithDGX #LocalLLM

[Day 15] A cat photo became a video — with a meow. Two local video AIs compared: 30s vs 475s

PEPPERCORN — Mon, 13 Jul 2026 06:14:04 +0000

Intro

Day 15!

Today's experiment: turn one photo of my family's cat into a video that moves — and meows.

An image-to-video AI takes a single photo and imagines how the scene continues. I ran two of them and compared the results.

What I used: DGX Spark (my home AI machine) / two local video AIs (LTX-2.3 and Wan 2.2) / one photo of my family's cat.

Today's experiment

What I wanted to find out

Give both AIs the same cat photo and the same request — "look at the camera and meow once, about 4–5 seconds" — and compare them.

The two models

Wan 2.2: a well-established favorite for local video. Video only, no audio
LTX-2.3: a newer model (March 2026) that generates video and audio together in one pass
Same photo, same kind of prompt. Measured: generation time / memory / quality / sound

The goal

Decide which one to use when — with actual numbers from my own machine, not vibes.

The result first: 30 seconds vs 475 seconds

For a ~4–5 second video, LTX-2.3 took 30 seconds. Wan 2.2 took 475 seconds. Roughly a 16x gap.

What's a "step"? The number of refinement passes the AI makes while drawing the video. More steps = more careful but slower.

Time wasn't the only difference, though. Here is each result in turn.

Wan 2.2 — nearly photoreal, but 8 minutes per video

Wan 2.2's output (475s, silent).

The fur, markings, and face are almost indistinguishable from the real photo, and it meows exactly as asked. But one video takes 475 seconds (~8 minutes), with no sound (by design).

A speed-up add-on cut it to 70 seconds, at the cost of tamer motion — more of a tongue flick than a proper meow.

LTX-2.3 — 30 seconds, and it came with a meow

Here is LTX-2.3's output, sound included (volume on 🔊):

Done in 30 seconds — and along with the video, it generated an audio track with actual meows. I asked for one meow; it enthusiastically gave two, short cat-like "myah" sounds roughly in sync with the mouth. (And it sounds startlingly like my family cat's real meow.)

Image quality looks about the same as Wan's on screen. The difference is in the motion: Wan moved more, and more realistically; LTX was a bit more subdued.

Bonus: give it the first and last frame, and you can design the camera work

LTX-2.3 can take a first frame and a last frame, and fill in everything between.

I gave it the wide shot as the first frame, and a face close-up (a crop of the same photo) as the last.

Generated in 35 seconds.

The video starts wide, meows while zooming in, and ends on the close-up I specified.

The numbers

	Time	Peak memory	Audio	Quality impression
Wan 2.2 (standard)	475s	~49GB	none	near-photoreal, faithful motion
Wan 2.2 + speed LoRA	70s	~46GB	none	clean but tamer motion
LTX-2.3	30s	~57GB	yes	on par with Wan; tamer motion
LTX-2.3 (designed zoom)	35s	—	yes	same as above

Resolutions and frame counts follow each model's recommended settings, so this isn't a strictly identical-conditions benchmark (details below).

Today's takeaways

Distinct characters: Wan for faithful motion, LTX for speed and sound. Not "which is better" but "which for what"
Specifying the first and last frame let me design the camera work
The time gap is really a retry-count gap: at 30 seconds, "one more try" is easy

The details

Environment and models

Machine: DGX Spark (128GB unified memory)
ComfyUI v0.24.0 (native LTX-2.3 support) + Lightricks' official custom nodes
Both models in their fp8 (weight-reduced) versions; ~91GB of downloads total
- Wan 2.2 I2V-A14B (high-noise/low-noise pair, 14GB each) + its text encoder
- LTX-2.3 22B distilled (28GB) + Gemma 3 12B as its text encoder (13GB)
The speed add-on for Wan is the lightx2v 4-step LoRA
Attention backend unified on PyTorch SDPA for both models
Runs were headless ComfyUI (API calls) driven by a small runner script that logs time and memory for every run

Comparison-condition fine print

Wan 2.2: 480×640, 81 frames, 16fps (~5.1s), 20 steps
LTX-2.3: 512×768, 97 frames, 24fps (~4.0s), 8 steps
Each model ran its official template's representative settings, so resolution/frames/steps differ — it's a comparison of each model's everyday settings
Per-step time: ~23s for Wan, ~2.6s for LTX

The one gotcha I hit (LTX)

My hand-built workflow crashed inside an LTX helper node (LTXVCropGuides).

Cause: with audio+video generation, the latent is a special combined tensor (a NestedTensor), and this node calls an operation it doesn't support
Fix #1: for plain image-to-video, the node isn't needed at all — removing it solved the crash
Fix #2: for the first/last-frame trick the node is required, so I moved it to run after the audio/video split, where the tensor is ordinary again
The known audio-VAE NaN bug I had braced for never appeared

License notes

Wan 2.2: Apache 2.0. Commercial use OK, generated content is yours
LTX-2.3: LTX-2 Community License. Free commercial use under $10M annual revenue. Disclosing that content is AI-generated is mandatory. Also has an unusual remote-access restriction clause
Gemma 3 (LTX's text encoder): Gemma Terms of Use. Commercial use OK, outputs belong to the user, subject to the prohibited-use policy

Sources

Outro

One photo, and 30 seconds later my family's cat was meowing on screen. See you next time!
Thanks for reading!

100ExperimentsWithDGX #LocalLLM

[Day 14] I quizzed an AI on cat meows. It scored worse than random guessing

PEPPERCORN — Thu, 09 Jul 2026 04:57:52 +0000

Intro

Day 14!

Today's experiment: play cat meows to an AI and have it guess what the cat wants.

I really wanted to use my own cat's meows, but couldn't get recordings — so I used a public research dataset (meows from 21 cats).

What I used: DGX Spark (my home AI machine) / Whisper (speech-to-text AI) / Qwen2-Audio (an AI that can listen to audio directly) / 440 cat meows (public dataset).

Today's experiment

What I wanted to find out

Can an AI tell how a cat feels, from its meow alone?

Approach

A research dataset of meows with situation labels (21 cats, 440 clips)
Hide the labels → 90-question 3-choice quiz → grade it
Detour: see what Whisper makes of raw meows

The goal

Find out, by experiment, whether a general-purpose audio AI can understand cat voices — judged by an actual score, not a feeling.

Result first: this AI could not read cat feelings

On a 3-choice quiz, its accuracy was 23.3% — below random guessing (33.3%).

As I dug in, it turned out the AI wasn't really listening to the meows in the first place. Here's what happened.

Today's material: 440 meows from 21 cats

I used CatMeows, a research dataset of cat vocalizations (credit at the end of this post). Every meow comes with an answer label — the situation it was recorded in.

Label	Situation	Count
Brushing	Being brushed by the owner	127
Waiting for food	Food is being prepared	92
Isolation	Left alone in an unfamiliar room	221

Experiment 1: first, a speech-to-text AI

First, a detour: what happens if you hand raw cat meows to Whisper, the speech-to-text AI?

In Japanese mode, 10 of the 12 meows came back as 「ご視聴ありがとうございました」("Thank you for watching!") — plus one 「チャンネル登録をお願いいたします。」("Please subscribe to my channel.")

Whisper is trained on subtitled video audio, so unrecognizable sounds tend to come back as the stock phrases that end videos.

Experiment 2: a 90-question quiz for the AI

Next, I handed meows one at a time to Qwen2-Audio (an AI that listens to audio directly) and asked:

This is a recording of a domestic cat meowing.
In which situation was this meow most likely recorded?
(A) The cat is being brushed by its owner.
(B) The cat is waiting for food.
(C) The cat is isolated alone in an unfamiliar room.

90 questions total — 30 meows from each situation. Random guessing would score 33.3%.

The result: 23.3% (21 out of 90). It lost to random guessing.

Here's the per-situation breakdown:

Situation	Correct
Brushing	0 / 30
Waiting for food	9 / 30
Isolation	12 / 30

Brushing: zero. The AI never once picked "(A) brushing" in all 90 questions.

Experiment 3: shuffle the option order

"Never picks A" is odd. So I checked: same meows, same question, only the option order swapped — another 90 questions.

If the AI answers by listening, changing the order shouldn't change its answers.

In fact, the answers flipped completely.

	The AI's answers
Before the swap	"isolation" ×50 / "food" ×40
After the swap	almost all "food" (×89)
Option A (first)	0 in both runs (0 out of 180)

The AI was answering by option order, not by the meow.

(The post-swap 32.2% accuracy only looks better because answering "food" every time gets exactly the food questions right.)

Meanwhile, a meow-only AI scored 96%

The research team behind this dataset built a purpose-built meow classifier (2019) that scores 95.9% on the same 3-way task (trained on this data, so not the same conditions as my zero-shot run).

General-purpose audio AI seems to be strong at words, still weak at what non-word sounds mean.

Today's takeaways

Quizzes get gamed: an audio LLM may answer by option position instead of listening. Shuffle the options and rerun to catch it
Don't trust accuracy alone: the 32.2% run was all-in on one answer. Check the answer distribution
Whisper hallucinates: non-speech sounds come back as stock phrases from its training data ("Thank you for watching!")
Niche tasks want specialists: for reading meows, a purpose-built model is the right tool

The details

Models and dataset

Speech-to-text: openai/whisper-large-v3-turbo (MIT license)
Quiz: Qwen/Qwen2-Audio-7B-Instruct (Apache 2.0), run locally via transformers, greedy decoding (do_sample=False) so answers are reproducible
Dataset: CatMeows (CC BY 4.0). 21 cats, 440 clips, 8kHz mono. The first letter of each filename is the ground-truth label (B=brushing, F=food, I=isolation)
Machine: DGX Spark. A 7B audio model fits in memory with plenty of room; a few seconds per question

How I asked the quiz (prompt)

This is a recording of a domestic cat meowing.
In which situation was this meow most likely recorded?
Choose exactly one:
(A) The cat is being brushed by its owner.
(B) The cat is waiting for food.
(C) The cat is isolated alone in an unfamiliar room.
Answer with only the single letter A, B, or C.

30 clips per label, stratified sampling with a fixed random seed → 90 questions
The control run swapped the wording of (A) and (C) only; everything else identical
Grading: extract the first A/B/C that appears in the model's reply

Environment notes

I reused an existing Python environment and added librosa for audio loading, which bumped numpy past what another library wanted (core stayed fine). Lesson re-learned: separate environments per experiment is the safer way
The recordings are 8kHz (phone quality); the models expect 16kHz, so audio is resampled on load. Could that hurt? Maybe — but the specialized model hit 96% on the same 8kHz clips, so it's not much of an excuse

Sources

Meow dataset: CatMeows: A Publicly-Available Dataset of Cat Vocalizations (Zenodo, DOI: 10.5281/zenodo.4008297, CC BY 4.0)
Original paper: Ntalampiras et al., "Automatic Classification of Cat Vocalizations Emitted in Different Contexts," Animals 9(8), MDPI, 2019 (the 95.9% figure is from this paper)

Outro

I set out to have an AI listen to cats' feelings, and instead found an AI that answers by option order. The day it understands my own cat's meows still seems a way off.

Next up: a completely different experiment. Thanks for reading!

#100ExperimentsWithDGX #LocalLLM

[Day 13] I got a cat to "talk." The biggest wall: the AI couldn't recognize the cat's face

PEPPERCORN — Fri, 03 Jul 2026 07:06:10 +0000

Intro

Day 13!

Today's experiment: take a single cat image, lay human facial motion on top of it, and make a "talking cat." It's the usual "talking avatar" idea, except I use a cat instead of my own face. The tool is LivePortrait (still image + a "driving video" of motion → it transfers the video's expressions onto the still).

The result: a cat that properly talks. The hard part wasn't the animation — it was the step before it, getting the AI to recognize the cat's face. Here's where it snagged, and how I got past it.

What I used: DGX Spark (my home AI machine) / LivePortrait / one cat image (AI-generated) / a driving video (ships with LivePortrait).

Result first: a talking cat

The mouth and eye movement from a human talking-video landed on the cat's face. But there were a few snags on the way here.

Snag #1: it won't recognize a "face" at all

To make a cat talk, there's a first step: find where the face is in the image (the position of eyes, nose, mouth) — the face detector.

The tool I reached for first didn't have a single one installed, so it stopped with an error every time. I added detectors and tried again, but the answer didn't change:

Detector I added	Commercial use	Recognized the cat's face?
MediaPipe	OK	❌ no
InsightFace	Not allowed (non-commercial)	❌ no

None could recognize the cat's face. They're not broken — they're all built to find human faces, so a cat's face doesn't register as a "face."

Snag #2: why it couldn't recognize the face

The "animal mode" of the tool I started with only swaps the motion part for an animal version — the crucial "find the animal's face" detector was never bundled in. All that's left is the human one.

That was why work had stalled here last time, too. Not disk space, not the GPU — just a tool that couldn't look for an animal's face.

The fix: use the original tool, and skip the build

The upstream (original) LivePortrait does ship an animal-specific face detector, called XPose. So I set it up in a separate folder and used that.

The catch: XPose normally needs you to compile (build) a part on your own machine, and on this new-generation machine there was no guarantee the build would go through. So, reading the code, I found a slower but no-build spare part tucked inside. I rewrote three files to route to it, dodging compilation entirely. For a short clip, the slowness doesn't matter.

The exact files and edits are in "The details" at the bottom.

It worked — but at first it just looked like the cat stuck its tongue out

The detector ran, the cat's face was recognized, and a video generated. But the first result looked like the cat just gave a little tongue-out blep. The expression transfer was clearly working — so why?

The cause was the driving video (= the input you feed the AI). My first one was a short "just open the mouth" sample — and if the reference only opens its mouth, the cat only opens its mouth. Swapping in a video of someone actually talking gave me a cat whose eyes and mouth both moved.

The quality of the driving video pretty much decides the result.

Today's takeaway

The hard part isn't the motion engine, it's recognizing the animal's face.
A tool's "animal support" can be a label with the actual part (the detector) missing. When it won't run, tracking down which part is missing is the fast path.
If you hit a "must compile" part on a too-new machine, look first for a no-build fallback route.
The quality of the result is mostly decided by the quality of the input (the driving motion).

A note on licensing

The detectors that could recognize the cat's face (XPose / InsightFace) are both non-commercial licenses. So I avoid commercial use of the footage itself, and this article keeps the focus on the method and the gotchas.
The commercially-OK detector (MediaPipe) couldn't recognize the cat this time.

The details

What was missing, and how it was solved

The "animal mode" of the ComfyUI node I used first only swaps in the animal motion model; the animal face detector (XPose) is not bundled. Human detectors (InsightFace / MediaPipe / FaceAlignment) can't detect a cat's face, so it stops at No face detected.
The fix: set up upstream KwaiVGI/LivePortrait in a separate folder, fetch the official weight set (including xpose.pth), and use inference_animals.py.
The InsightFace and landmark models I already had could be reused.

The no-compile patch for XPose

XPose is built to compile its own CUDA custom op called MultiScaleDeformableAttention. On the newest GPU/CUDA generation there's no guarantee that build succeeds, so I routed it to the bundled pure-PyTorch fallback instead.

Three files edited (under XPose's ops/):

functions/ms_deform_attn_func.py: wrap the compiled-version import in try/except, set the flag to False on failure.
modules/ms_deform_attn.py: when that flag is False, branch forward through the pure-PyTorch ms_deform_attn_core_pytorch.
(if needed) add weights_only=False to the torch.load in animal_landmark_runner.py.

Now animal detection runs with no compilation at all. It's slower, but for a short clip it's fine (one generation finished in ~8 seconds).

Environment gotchas

Each time I swapped detectors, the base library (numpy) version see-sawed (mediapipe wants numpy<2, insightface wants 2.x). The existing core (cv2/torch) survived, but the clean approach is to keep upstream LivePortrait in its own isolated environment.
When stopping the server, killing by process name took out my own command too. Stopping by port number was safe.
Disk and GPU had plenty of headroom the whole time — not once was the snag a resource shortage.

Next up

Next time I'm switching things up again with a different kind of experiment 🎬

100ExperimentsWithDGX #LocalLLM

[Day 12] I tried to build a line-art LoRA from video frames, and the characters' heads fused together

PEPPERCORN — Tue, 23 Jun 2026 00:41:38 +0000

Intro

Day 12!

This time I tried to take my own hand-drawn animation (a short video) and build a
line-art LoRA that learns its art style and characters.

The plan was a little lazy, honestly. The usual way to train this is to prepare
character stills one by one, by hand. But I thought:
"I already have the video — why not just rip frames out of it and collect the
training material the easy way?"

Short version: the lines got clean, but the one thing that mattered — actually
reproducing my characters — failed completely. And the reason it failed is what I
actually took away from today.

What I used: my home AI machine (DGX Spark) + a training tool (Kohya) + my own
hand-drawn animation (two characters).
Note: everything shown here is LoRA-generated line art only. I'm not showing the
source video itself or where it's published.

Result first: failure on the left, and the right is also a failure

Left is the first attempt. A second body grows upside-down out of the top of the head.
Right is after I tracked down the causes and rebuilt the data — the lines came out
clean. But as you'll see, it's also a failure: it looks nothing like my original
characters — it's a totally different person.

How can "the breakage got fixed" still be a failure? Let me walk through it.

What I did: rip frames from the video and train (v1)

Simple steps:

Extract still frames from my hand-drawn animation video
Roughly select ~300 of them as training material
Train a LoRA on them

The training itself took 17 minutes on the DGX. Lightning fast.
"Oh, this is easy," I thought — for about five minutes.

Then I generated, and it was a mess

Asking the finished LoRA for "single character" and "two-person scenes" gave me this:

Symptom	How it broke
Fused heads	A second body sprouts from the head / multiple faces merge into one
Backgrounds won't go away	Even asking for "white background," blue/pink backgrounds show up
Thick, muddy lines	No clean line art, everything is heavy and blurry
Ghost text	Meaningless characters (leftover captions) get baked into the image

It could tell the characters apart (A and B were recognized as different people).
But the looks were wrecked. Lined up, the actual outputs were quite the horror show:

▲ When two characters show up, you can't tell what's what anymore

▲ Faces fuse and multiply

▲ Leftover captions bake in as "ghost text" all over the frame

▲ Train on high-motion frames and everything just melts

Why did it break? (the real point)

The culprit was using video frames as the source itself.

A video, if you think about it, is footage with many things happening at once. Rip a
frame out of it and you don't just learn the character's shape — you learn all the
surrounding noise too.

Symptom	Cause
Fused heads	Video has lots of frames where two people move in the same shot. The model learns an instant where bodies overlap as "one single body"
Backgrounds stick	Tons of background-laden frames get in. It learns "character = with this colored background" as a set, and you can't override it later
Thick lines	Mid-motion frames are blurred; that blur bakes in as a "thick-line style"
Ghost text	Caption text sitting on the frames sneaks into the material and gets learned

In one sentence: video is noise-laden material — motion, overlap, backgrounds, text
all baked in — and it's a poor way to cleanly extract just a character's shape.

▲ Even asking for "white background," the training background color (pink) won't peel off — and the character multiplies for good measure

Fixing it (v2)

Now that I knew the causes, I rebuilt the data side and trained again.

Auto-remove frames with caption text
Drop "no character" frames — pure backgrounds, transition frames (~300 → 141 frames)
Split into three groups — "A only," "B only," "the two together" — to stop the characters from bleeding into each other
Switch the base model to an anime one (good at line art) and tune the settings

The result:

Aspect	v1 (first)	v2 (rebuilt)
Fused heads	✗ frequent	✓ gone
Thick/muddy lines	✗	✓ thin and clean
Line-art look	△	✓✓ clearly line art
Backgrounds	✗	△ white now, but color bleeds onto clothes
Ghost text	✗	△ far less, a little remains
Resemblance to my characters (the whole point)	✗	✗ a different person — sometimes missing an arm

Look at just the top of that table and you think "oh, it's fixed!" I did too, for a second.
The noise problems (fused heads, thick lines, background color) really were fixed.

But look closer and there's no trace of my original characters. The lines are clean,
but what comes out is "some vaguely anime-style stranger." The worst ones are even missing
an arm.

▲ The lines are clean, sure — but it looks nothing like my original character. A stranger.

▲ Stable as a drawing. Still zero resemblance (and color still bleeds onto the clothes)

The real wall I never got past

Even with the noise gone, the one thing I actually wanted — reproducing my own
characters — was completely out of reach. All I got was a clean-looking stranger.

And since even a single character came out this much of a different person, "two of them
together, in a scene with a relationship" was even more hopeless. Ask for the two of them
and only one shows up, or it falls apart.

I chased down why two-person was especially bad, too:

There were only 9 real frames in the entire set where the two were naturally side by side
I tried to get more by re-checking another 478 frames, but every "two-person" hit was a false positive (the detector reacting to on-screen text or body fragments)
→ In other words, you cannot grow "two-person scenes" out of video material

If the composition you want (the two of them cleanly together) doesn't happen to exist in
the video, you can't extract it after the fact. Obvious in hindsight — but it really sank
in once I'd hit the wall.

Today's takeaway

Ripping frames from a video teaches the model "vaguely anime-ish" at best.
It couldn't reproduce my characters even for a single figure (let alone two / a
relationship). In the end, you have to hand-draw the stills you want it to learn.

This isn't a sour-grapes conclusion — it's the answer after exhausting every way to grow
the data. I took the long way around trying to be lazy, but because of it I now
understand, first-hand, why hand-drawn stills are necessary.

Someday I'll act on this conclusion and prepare the compositions by hand — but that's
a project for another day.

The details

Training settings (v1 → v2)

Item	v1	v2
Base model	SD1.5 (plain)	anime-style (good at line art)
clip_skip	1	2
Data	~300 lumped together	split into 3 groups (68 / 64 / 9)
Epochs	2	3
LoRA dim / alpha	32 / 16	32 / 16 (kept)
Training time (DGX)	~17 min	~14 min

In v2 I pinned a required character-name tag to the front of each group's captions
(keep_tokens) to suppress the characters bleeding into each other.

Why "two-person scenes" couldn't be grown

I re-tagged another 478 video frames looking for "two people in frame." Co-occurrence
flagged 25, but on full-resolution inspection almost all of them held only one person —
the tagger was misfiring on on-screen text labels and body fragments. The real "two
together" frames were the 9 I'd hand-picked at the start, basically the whole supply.

What's still left (the homework)

Color bleeding onto clothes (likely from color tags in some groups)
Leftover ghost text (a little text-like noise remains)
And the big one: reproducing the characters at all. Neither single figures nor pairs actually look like "my" characters → hand-draw the compositions I need

Next up

Next time I'm switching things up with a completely different experiment 🎬

100ExperimentsWithDGX #LocalLLM

[Day 11] I turned my cat into anime art — and the AI drew a human girl instead. One photo through IPAdapter pulls it back to a cat

PEPPERCORN — Thu, 04 Jun 2026 04:13:23 +0000

Intro

Day 11! Back to cats 🐱

I took one photo of my cat (a black-and-white tuxedo boy) as a reference and had AI restyle him into anime, ukiyo-e, oil painting, and more.

The goal: change only the style while keeping "my cat" recognizable. But left alone, the AI started drawing humans instead of a cat. Here's what I did, step by step.

What I used: my home AI machine (DGX Spark) + an image-generation tool (ComfyUI) + one photo of my cat.

The reference is this one photo

A tomcat my family looks after for me, with yellow eyes and a slightly grumpy look.

Love that face. I'll turn him into various styles while keeping him recognizable as "my cat."

First, anime from text alone → a human

I started with no photo, just text: "a tuxedo cat, anime key visual." I clearly said cat.

Here's what came out. …A human girl.

Black hair, white collar. My cat's tuxedo pattern (black body, white chest) turned straight into clothing.

Next, I added the reference photo → still human

So I hand over the cat photo as a visual reference. The tool that applies it is IPAdapter.

What's the reference-photo trick (IPAdapter)? A tool that lets you pass a reference image, separate from the text prompt, and say "make it look like this." It's what preserves my cat's colors and face.

Surely this makes it a cat… nope. Still human.

And this habit wasn't limited to anime. Ask the same anime-style model for ukiyo-e or oil painting, and you still get anime-ish humans. It hijacks not just the subject (the cat), but the art style too.

Left: an "ukiyo-e" that's really an anime woman in a kimono. Right: an "oil painting" that's an anime woman in a tuxedo. Both are "humans painted in the cat's colors."

I tuned the settings → finally a cat

On top of the photo, I turned up its strength and added "don't draw humans" to the negatives (details below). That's when it finally became a sitting cat.

Why does it turn into a human?

Two reasons, as far as I can tell.

One: anime-savvy models tend to draw people, girls especially. Even with "cat" in the prompt, they drift toward a human if you let them.

Two: my cat's pose. He sits bolt upright, almost like a person, so the harder you push the reference, the more that upright posture rides along — tipping toward an "anthropomorphized" cat. The pop-art piece later is exactly that leftover.

Cyberpunk flipped to a cat with the photo alone

The interesting part: whether the photo alone was enough depended on the model. Anime was stubborn and needed tuning, but cyberpunk became a cat just by adding the photo.

Left (no reference): a human man in a neon city. Right (with reference): a cat with glowing ears.

I didn't change a single character of the prompt — the photo being there or not is the only difference between human and cat.

The styles that came out

Here's the gallery after the human problem was fixed — all with the reference photo, my cat as the base.

Top row, left to right: anime, ukiyo-e, oil painting (Van Gogh-ish), stained glass. Bottom row: cyberpunk, 3D (Pixar-ish), pop art.

"Likeness" and "style" are a tug-of-war

The oddly real 3D Pixar one shows this little trade-off nicely.

Left (no reference): a cute 3D cat, but "some cat." Right (with reference): it becomes my cat's face, but the 3D look washes out into basically a real photo.

Weaken the reference and the style shows but it's a different cat; strengthen it and it's my cat but the style fades. Finding that grip per style is what the tuning really is.

The boss I couldn't beat: storybook watercolor

"Gentle storybook watercolor" was the one style I never got to be a cat. Here's the result of seven retries.

A human, then somehow two cats, then a cat-eared girl holding a cat. "Single + watercolor + cat" wouldn't line up. Lower the reference → human; raise it → two cats. "Storybook" must be soaked in human imagery. Carrying this over.

The details

Here are the details.

The reference-photo mechanism (IPAdapter)

I added a custom node called ComfyUI_IPAdapter_plus to ComfyUI. It lets you hand over a reference image as a "visual guide," separate from the text prompt.

Model used: ip-adapter_sd15 (44.6MB, from h94/IP-Adapter)
The part that reads the image features: CLIP-ViT-H (reused an existing one)
The reference photo is cropped to a 768px square before handing it over

A number called the "reference strength (weight)" controls how closely it mimics. I moved between roughly 0.7 and 0.85 depending on the style.

What I did to suppress the "human" problem

I started at weight 0.7 plus words like "key visual" and "big eyes," which strongly invited humans. Three fixes:

Raise the reference strength to 0.85
Add human, girl, person, 1girl, humanoid to the "things I don't want drawn" list
Strip human-summoning words from the request and emphasize tuxedo cat, full body, animal

That corrected anime, ukiyo-e, and oil painting into cats. One catch: the phrase "tuxedo cat" itself tends to put an actual tuxedo (a suit) on the cat, so it cut both ways.

The base models I used

I switched the underlying image model by style.

Anime / illustration: AnythingV5
Realistic / 3D: Realistic Vision V6
Plain base: SD 1.5 (base)

When storybook failed, switching to the plain base gave a real cat but weak watercolor feel, and raising the strength split it into two cats — a real bind. The base model's "habits" matter a lot.

Common generation settings

Across all styles: 768px, 30 steps, sampler dpmpp_2m karras, cfg 7, seed fixed at 110011. I only varied the text request and the reference strength, keeping everything else equal for a fair comparison. Generation is fired at ComfyUI from a small script I wrote.

Next up

Next time it's cats again — and this time I'm planning video generation 🐱

100ExperimentsWithDGX #LocalLLM

[Day 10] Building my own personal weather officer AI, and teaching it my body's sense of cold over the next 100 days

PEPPERCORN — Mon, 01 Jun 2026 01:31:58 +0000

Intro

Day 10!

This time I'm starting a longer-running experiment. Meet the "weather officer AI" — a bot that texts me every morning saying "wear this today." The plan is to build a weather assistant that's tuned to me.

What I'm building today is just v0.1 (the very first version). From here through Day 100, I'll keep teaching it "too cold / just right / too warm" every morning, so it gradually learns my preferences. The experiment is: how smart does it get after 100 days?

What I used: my home AI machine (DGX Spark) + free weather data + a phone messaging app (Telegram)

Today's task

What I wanted

I live somewhere with a big daily temperature swing, and "what do I wear today?" is a small but real daily headache. Weather apps tell you the temperature, but whether I feel cold is a different question.

So the starting point was: can I build a clothing AI that's tuned to "how I feel," not just "the temperature"?

Approach

I kept the design dead simple.

Every morning at 7, automatically fetch today's weather
Decide "this morning's outfit" from the apparent temperature and push it to my phone
I just tap back "cold / just right / warm"
As these feelings pile up, the AI learns "this person runs cold" and corrects its suggestions

The goal this time

Not a "perfect forecast AI," but a "routine I can actually keep up every day." The smarts get grown over the next 100 days. Today is just laying the rails.

📊 How much does the temperature actually move in a day?

Before building anything, I pulled a week of apparent temperatures for where I live and graphed it.

A beautiful zigzag. Every day repeats "cold in the morning → way up by midday → down again at night." The average daily swing is 13°C, and on the biggest day it moved 20°C.

So I narrowed the suggestion down to "one outfit, at 7 a.m., matched to the apparent temperature at that hour." I record just once in the morning too. To keep something up for 100 days, simplicity matters most.

The AI doesn't learn the temperature swing itself. But by deciding to "focus on the morning," the suggestion and the feedback line up in time, so later I can cleanly check "was the morning suggestion right?"

🔧 How it works (the morning loop)

The finished weather officer runs on this loop.

At 7 a.m. it grabs the weather, decides the outfit, and pushes a notification. I tap back my feeling, and that gets recorded. Once those records pile up, step 5 — learning "you run cold / warm" — kicks in, and the suggestions gradually become mine.

Here's what the actual notification looks like.

☀️ Weather Officer AI — good morning

👕 This morning's outfit: long sleeves

This morning feels like: 13°C (highs up to 20°C today)
Rain: 4%  /  Wind: 13 km/h

How does it feel this morning? ↓ tap to tell me
   [🥶 cold]  [😊 just right]  [🥵 warm]

It suggests one outfit, but adds "highs up to 20°C today" so I can decide whether to throw on a layer myself. Tapping a button changes it to "✅ recorded," and the feeling is saved to my home AI.

The notifications go through the messaging app I already use (Telegram).

🛠️ The details

Below are the specifics.

The weather data

Weather comes from Open-Meteo, a free weather API. No API key needed, historical data available, and commercial use is OK (CC BY 4.0) — very generous.

I mostly use "apparent temperature" — not the raw air temperature, but a number adjusted for wind and humidity to reflect how it actually feels, which is better for deciding what to wear. I take the average apparent temperature from 7–9 a.m. as "this morning's feel." The coordinates stay only on my home machine; I don't write the specific location in the article or the code.

The clothing rule

A plain rule that splits apparent temperature into 7 bands, each mapped to an outfit (e.g. 13–20°C → long sleeves, 20–26°C → short sleeves).

There's one "personal offset" number baked in. It's zero for now (v0.1). Going forward, if "cold" keeps coming back, I'll push the offset negative so the same temperature suggests warmer clothes — growing it from the feedback.

The notification and buttons

The notification uses a Telegram "Bot" (a thing that sends messages automatically), built with the python-telegram-bot library to:

send a message at a fixed time every morning at 7
attach three buttons below it and record which one is pressed

This bot sits waiting inside my home AI machine and fires in the morning.

The shape of the feeling log (the 100-day foundation)

Each record is one line per day, with these fields:

date
that day's forecast (morning/daytime apparent temperature, wind, rain chance)
the outfit the AI suggested
my feeling (cold / just right / warm)

Kept in this consistent shape, I can later graph "monthly hit rate" or "my personal bias." I'll report this trend in the Day 21 / 36 / 54 / 74 / 87 check-ins.

Keeping it running

So the 7 a.m. notification reliably fires, I set the bot to launch automatically when the machine boots (systemd). Even after a reboot it comes back on its own, and the morning notification keeps going.

The feeling log and the notification settings (the Telegram token, etc.) are all stored only on my home machine — none of it goes anywhere external like GitHub.

The 100-day growth plan

I'll keep growing this weather officer across the series — it'll pop up here and there.

Milestone	What happens
Day 10 (now)	v0.1 done; the recording loop starts turning
Day 21	First "my personal bias" report from 11 days of data
Day 36	Graph the monthly hit rate and take a look
Day 54 / 74 / 87	Mid-reviews: seasonal changes in feel, how the correction works
Day 100	The 100-day accuracy trend and the finished version

Just tap a button every morning. I'm curious how "mine" it'll feel after 100 days.

Next time: Day 11

Next time it's a hard pivot back to cats 🐱 I'll convert photos of my cat into picture-book, anime, and photorealistic styles — the theme being whether I can keep "that's-our-cat-ness" while changing only the style.

LocalLLM #100ExperimentsWithDGX

[Day 9] A local Japanese sentiment AI (BERT) read 8 years of a LINE chat, and the ups and downs surfaced from numbers alone

PEPPERCORN — Fri, 29 May 2026 22:39:10 +0000

Intro

Day 9. Today is less about model internals and more of a personal experiment: have a local AI analyze the entire chat history with one LINE friend. (LINE is the dominant messaging app in Japan.)

When I exported it, 8 years were sitting there — from the very first message to today. It started, we talked a lot, it went quiet for a while, then picked up again. That whole arc is in there.

Because the content is what it is, nothing left my machine: everything ran locally on my DGX Spark.

What I used: my home AI box (DGX Spark) + a Japanese sentiment model (for tone) + a bigger local model (to guess events from numbers).

Today's setup

What I wanted to do

Re-reading 8 years of messages one by one isn't realistic. So instead of reading the content, I looked only at the "shape" of the conversation — when, how much, and in what tone we talked.

Concretely:

monthly message volume
the trend of tone (positive / negative)
then asking an AI to find "when something big happened"

Heads-up (the result)

From message counts and tone alone, the 8-year arc came out clearly on a chart. Started, went quiet, came back — the flow was visible without me re-reading a thing.

🔧 Pipeline

LINE chat export (text)
        │
        ▼
 1. Parse: split each message into {datetime, who, type, text}
        │   (from here on, message text never leaves the machine)
        ▼
 2. Aggregate: monthly counts, time-of-day, reply gaps
        │
        ▼
 3. Tone scoring: classify each of 66k messages pos/neu/neg
        │
        ▼
 4. Turning-point detection: from sudden changes in the numbers
        │   + also show ONLY the numbers to a bigger AI and ask it to guess
        ▼
 5. Answer check: compare against the real timeline

You can export a LINE chat as text from the chat screen ("send chat history").

Data size:

Item	Value
Span	~8 years 2 months
Total messages	87,621
Text messages	66,329
Stickers	15,605
Photos	3,982

15,605 stickers… that's a lot.

The two AIs

Step	Model	What it does	What it sees
3. Tone	Japanese sentiment model (`koheiduck/bert-japanese-finetuned-sentiment`)	scores each message pos/neu/neg	66k message texts (scores averaged per month)
4. Turning points	a bigger local model (`Qwen2.5` 72B)	guesses "what happened to these two?"	only the per-month table of counts + tone scores (no conversation, no words)

Both run locally on my own machine.

📊 Results

The 8-year arc of volume and tone

This chart is the highlight. Top: monthly message count. Bottom: tone (up = positive, down = negative). The x-axis is months since the conversation started. (Axis labels are in Japanese.)

Plotted, it isn't a steady climb or a flat line — it splits cleanly into "chapters": ramp-up → an 8-month silence → a second peak → a stable plateau. Four phases, at a glance.

Tone has two peaks of about +0.6, around the start and around when things resumed (overall mean ≈ 0, slightly negative in the later years). The interesting part: in the month before the silence, tone had already dropped to −0.1. The mood dimmed before the volume did.

There are two dips into negative tone. The one before the silence was an "omen." The other is the recent years — not an omen, but the effect of logistics-y messages ("what time are you home?") piling up.

💡 Mini-note: how is "tone" turned into a number?
The scoring is done by a Japanese sentiment model. Roughly:

pre-trained on lots of Japanese text labeled positive / negative

judges with context, not just by spotting keywords

returns a probability of "positive-ness" / "negative-ness" per message

I used the difference as a per-message score

What kinds of messages scored how?

A few actual judgments (short, name- and place-free one-liners):

Message	Verdict
「楽しかったね！」 (that was fun!)	Positive
「これめちゃうまい」 (this is so good)	Positive
「おはようございます」 (good morning)	Neutral
「もうお家？」 (home already?)	Neutral
「全く集中できない」 (can't focus at all)	Negative
「それは悔しいな、、」 (that's frustrating…)	Negative
(a long trip-planning message)	Neutral
(a snappy one-liner sent in a huff)	Negative

Plain happy lines score positive; logistics ("good morning", "home already?") score neutral; tiredness or irritation scores negative. Even long, businesslike planning messages lean neutral.

Mornings are when we talk

Message density by weekday × hour (brighter = more).

A clear concentration at 7–9 a.m.!

Could the AI guess the turning points?

First, the simple method: mechanically pick the points where message volume jumped or dropped, then check against the real timeline.

Real event	Auto-detected timing
When it started	exact match
When it went quiet	exact match
When it resumed	exact match
When it got lively again	a few months off
A big life milestone	hard to detect (barely shows in counts)

Sharp volume changes were nailed. But "a big life milestone" got missed. So I showed the same numbers to the bigger local model and asked "what happened?" — and got back:

"around when it started" → roughly matches
"a stretch of going silent" → matches the quiet period
"a major life change" → almost exactly before the real milestone

Rather than hunting for a single spike, it reads the whole sequence of numbers as a "flow," so it could pick up even an event that barely moves the counts.

💡 Takeaways

1. Volume + tone alone reveal the arc

Counts and tone were enough to see the 8-year shape. Silence marks the quiet stretch; a surge marks the resumption — straight off the chart.

2. A local model reads a story out of numbers

Given only monthly numbers, the model inferred even a barely-visible event ("something big around here"), and it lined up with reality. It connects scattered points into one flow.

3. A "negative" tone doesn't mean a bad relationship

The slight negative lean in later years isn't about getting along badly. Logistics messages ("what time are you home?") just don't score high. Low score ≠ trouble. It isn't that sentiment analysis is poor — the scores need to be read together with context.

🛠️ Technical details

Parsing & aggregation

LINE export format is a date header plus time<TAB>name<TAB>text. Multi-line messages (4,987 of them) are merged back into the previous message.
Speakers normalized to "A / B" by message count (no real names in anything public). Temporary group members and system lines excluded.
Messages tagged by type (text / sticker / photo / call / unsent…). Tone uses text only; volume counts use all types.
Aggregation and plotting in Python (pandas / matplotlib).

Tone (sentiment)

koheiduck/bert-japanese-finetuned-sentiment, a 3-class (pos / neu / neg) Japanese model.
66,329 texts scored on GPU in batches; per message I take P(pos) − P(neg) in [−1, +1], then average per month.

Turning-point detection

Rule-based: long near-zero stretches (silence), large month-over-month surges, and tone peaks — all from numbers only.
Plus: the per-month table of counts + tone scores fed to a bigger local model (Qwen2.5-72B via ollama) to guess events. No message text was given.
Real event dates were kept in a local note only, used for annotation and the answer check.

Privacy

Every file containing message text (raw export, parsed data, scores) stays in a non-public folder.
Only aggregate numbers and charts are published. The chart x-axis is relativized to "months since the conversation started," hiding actual dates.
Apart from a few short, name- and place-free one-liners shown as scoring examples, no conversation content, real names, specific dates, or long text appears in the article or charts.

Tomorrow: Day 10

Weather forecasts say one temperature, but everyone feels it differently. Same degrees, different "do I need a coat?" So next I'm building my own personal "weather officer" AI: from past weather data, it'll tell me each morning something like "coat + beanie today." Over the next 100 days I'll teach it my own sense of cold — the start of a longer project.

100ExperimentsWithDGX #LocalLLM

[Day 8] Pushing Looped Transformers Beyond Addition: OpenMythos on Bracket-Matching Depth

PEPPERCORN — Fri, 29 May 2026 06:27:04 +0000

[Day 8] Pushing Looped Transformers Beyond Addition: OpenMythos on Bracket-Matching Depth

Intro

Day 8!

A direct follow-up to Day 7: same OpenMythos-style mini model (3.4M params), same training pipeline, one task change — multi-digit addition swapped for nested-bracket parsing. The goal was to ask two follow-up questions Day 7 left open:

Does the "training-time loop count is the peak" finding generalize across tasks?
If we increase the structural complexity of the input (deeper nesting), does inference-time loop count start to matter?

Tools used: my home AI machine (DGX Spark, GB10) + OpenMythos (PyTorch reconstruction of the rumored Claude Mythos architecture) + synthetic bracket sequences.

Today's setup

Why bracket matching?

Day 7's task was 2-5 digit addition. Addition tests "carry propagation from low to high digit" — a fundamentally local, left-to-right state update. To probe whether looped depth helps with a different kind of structural reasoning, I wanted a task where:

The output depends on left-to-right state tracking (rules out attention-based global aggregation shortcuts).
The task admits an explicit notion of depth I can vary as a controlled difficulty knob.

Bracket matching fits both. The standard linear-time algorithm is push-on-open / pop-on-close with a stack. A model that has internalized that algorithm should scale gracefully with depth — and one that hasn't will visibly fall over.

Task: first-break-position prediction

Input: a string of ( ) [ ] { } characters, terminated by =.
Output: the left-most position at which the bracket structure breaks, as 2 digits, terminated by $. If the sequence is balanced, output --$.

Examples:

((()))=          → --$       (balanced)
([{}])=          → --$       (balanced)
([)]=            → 02$       ()` at position 2 doesn't match preceding `[`)
(()(=            → 04$       (stack non-empty at end of string, position = len)
))=              → 00$       (close on empty stack at position 0)

The "break position" is defined by a stack parser scanning left-to-right:

Close bracket whose type ≠ stack top → return that close position.
Close bracket on empty stack → return that close position.
End of string with non-empty stack → return len(s).
Otherwise balanced → return -1 (output --).

Why not just binary balanced / imbalanced?

That was the original plan. A first smoke run with T/F output saturated to 100% accuracy across all depths (up to 10) by step 4,000. There are too many shortcut signals — length parity, open/close count, etc. — for a transformer to learn the actual stack algorithm.

The first-break-position output forces the model to commit to a specific character position, which can only be answered by tracking state left-to-right. After this change, smoke results at 5,000 steps showed clean depth-dependent difficulty (d=2: 100%, d=20: 71%) and the loss had room to keep dropping. That's the signal I needed to study loop-count behavior meaningfully.

Difficulty knob: depth

I trained and evaluated across depths {2, 4, 6, 8, 10, 12, 16, 20}, with pair count capped at min(2 * depth, 50) so the 2-digit position output stays in range. Balanced and imbalanced sequences mixed 50/50; imbalanced sequences generated by deleting a close (30%), deleting an open (30%), or substituting a bracket (40%).

Architectural changes from Day 7

Minimal — only what the new vocab and longer sequences required:

	Day 7 (addition)	Day 8 (brackets)
vocab_size	16	20
max_seq_len	32	128
max_loop_iters (train)	4	4
Difficulty axis	2-5 digits	depth 2-20
Answer tokens	1-6 (digits + `$`)	2 + `$`
Total params	3.39M	3.39M

Same MythosConfig template otherwise. Same hyperparameters (AdamW, max LR 3e-4, warmup 2000, cosine decay, 30k steps, fp32, 4 seeds in parallel).

Headline finding

The Day 7 "peak at training loop count" finding generalizes. With training max_loop_iters=4, accuracy peaks at exactly T=4 again, and decays in both directions — including at every depth I tested.
But the peak height is much lower. Best accuracy was 66% at depth 2; depth 20 caps at ~36%. Day 7 hit 100% at d=5; brackets at the same parameter budget plateau dozens of points short.
Inference-time loop extrapolation does NOT improve deep-nesting performance. The hypothesis "deeper inputs benefit from more loops" did not reproduce — T>4 hurts at every depth, just as in Day 7.
Fixed-point reproduced, slightly later. Cosine similarity between consecutive hidden states reaches ~0.95 by T=3 and ~0.99 by T=4 — a step or two later than addition (which got there by T=2).

🪢 The task in pictures

Input:  ( ( [ ) ] ) =
Pos:    0 1 2 3 4 5

stack walk:
  pos 0: '(' → push '('             stack: ( 
  pos 1: '(' → push '('             stack: ( (
  pos 2: '[' → push '['             stack: ( ( [
  pos 3: ')' → top is '[', mismatch!  → first break at position 3

Expected output: 03$

The interesting thing about this task vs. addition: the answer can be anywhere from 0 to ~40 depending on the input, and the model has to commit to a specific integer. There's no global-aggregation shortcut — you have to walk left-to-right and remember what you've seen.

🔧 Pipeline

OpenMythos tiny (3.4M params, same as Day 7 modulo vocab + max_seq_len)
  ↓
Train 4 seeds in parallel, 30k steps, fp32 on DGX Spark (GB10)
  ↓
Experiment A: greedy autoregressive accuracy
              loops ∈ {1, 2, 4, 8, 16, 32}  ×  depth ∈ {2, 4, 6, 8, 10, 12, 16, 20}
  ↓
Experiment B: cosine similarity between consecutive hidden states
              ⇒ does the recurrent block reach a fixed-point?
              ⇒ does the fixed-point timing depend on depth?
  ↓
Compare against Day 7 (digits) along the same axes

Training throughput note (vs Day 7)

Day 7's 4-seed parallel training was fast because max_seq_len=32 left the GPU underutilized per process. With max_seq_len=128, a single process already saturates the GB10 — 4-seed parallel drops per-process throughput from ~60K tok/s to ~12.8K tok/s (a -79% per-process penalty). Aggregate parallel throughput is actually ~15% slower than sequential 4-seed.

I let it run in parallel anyway because it was overnight and I had no other DGX usage scheduled. Worth noting for anyone planning similar replications: longer sequences kill the "free" benefit of multi-seed parallelism on a single GPU.

GPU draw stayed at 51W / 72°C / 95% utilization throughout — comfortable enough to leave running.

📊 Results

Experiment A: accuracy heatmap

Mean exact-match accuracy across 4 seeds, 500 eval samples per condition:

Inference loops	d=2	d=4	d=6	d=8	d=10	d=12	d=16	d=20
1	0.11	0.05	0.03	0.02	0.01	0.01	0.01	0.02
2	0.32	0.20	0.13	0.08	0.08	0.08	0.07	0.07
4 (train)	0.66	0.56	0.50	0.45	0.44	0.41	0.41	0.36
8	0.58	0.56	0.51	0.47	0.46	0.44	0.39	0.34
16	0.55	0.51	0.44	0.41	0.40	0.38	0.36	0.32
32	0.55	0.48	0.42	0.40	0.39	0.37	0.36	0.31

Observations:

Peak at T=4 across every depth column. Day 7's "loops help only in a narrow window centered on training" finding generalizes: no depth I tested has its best accuracy at T≠4.
Depth scaling is graceful but the ceiling is low. Going from d=2 to d=20 at T=4, accuracy degrades smoothly (0.66 → 0.36), but the absolute numbers stay far from saturation.
The "deeper input ⇒ more loops" hypothesis does not hold. I'd hoped to see T=8 or T=16 begin to dominate at d=20, indicating inference-time scaling could rescue depth. Instead, every depth column peaks at T=4 and decays — same shape as Day 7's digit-count columns, just stretched lower.
T=8 is unusually competitive at mid-depths. At d=4 through d=10, T=8 is within ~1pt of T=4 (sometimes slightly higher). Possibly two adjacent settings of test-time depth around the training value are both near-optimal.

Experiment B: fixed-point analysis

Mean cosine similarity between consecutive hidden states cos(h_t, h_{t-1}) measured at the first-answer-token position, averaged across 4 seeds, 200 samples per depth:

t	d=2	d=4	d=8	d=12	d=16	d=20
1	0.85	0.89	0.92	0.92	0.93	0.91
2	0.91	0.91	0.94	0.95	0.95	0.97
3	0.94	0.94	0.92	0.92	0.94	0.95
4	0.95	0.97	0.96	0.95	0.93	0.93
8	0.998	0.995	0.998	0.996	0.996	0.992
16	0.9994	0.9996	0.9989	0.9985	0.9976	0.9979
32	0.9998	0.9998	0.9998	0.9997	0.9995	0.9996

Three things to note:

Fixed-point timing is slightly later than Day 7. Day 7 reached ~0.95 by T=2; brackets reach ~0.95 at T=3 and ~0.99 at T=4. About one extra loop step on this metric. Possibly the more complex left-to-right state needs a beat longer to settle.
Depth dependence is small. d=20 traces almost on top of d=2, again echoing Day 7 (where digit-count had only marginal effect on fixed-point timing). "Harder problem ⇒ slower fixed-point" did not appear.
Hidden state stops moving by T=4 (cosine ~0.99) while accuracy starts decaying. Same paradox as Day 7: extra loops are computation without information. Either the late-loop perturbations are small but logit-relevant drift away from a converged answer, or this is purely a distribution-shift artifact of training only at T=4.

Comparison with Day 7

Axis	Day 7 (addition)	Day 8 (brackets)
Loop-count peak at T=train (=4)	Yes	Yes
Best accuracy at peak	100% (all digits)	66% (d=2), 36% (d=20)
Inference-time loop extrapolation	Hurts	Hurts
Cosine fixed-point arrival	~T=2	~T=3
Depth/digit dependence on fixed-point	Small	Small
Training dynamics	Grokking (sudden phase transition)	Smooth slow climb

Day 8 reproduces all the qualitative findings of Day 7. What changes is the quantitative ceiling: at the same parameter budget and the same training compute, structure-tracking caps far below saturation while addition saturates.

💡 Tying back to the three perspectives

Day 7 tested looped transformers against three published views:

Saunshi et al. — loops can match deeper fixed-depth networks on algorithmic tasks
Geiping et al. (Huginn) — at scale, extra loops give marginal gains
Micheal Bee — loops plateau early at small scale (T=2 fixed-point)

Day 8 adds three more data points to the picture:

The "peak at training loop count" pattern persists across qualitatively different algorithmic tasks (addition vs. bracket parsing). This is consistent with Saunshi's framing but argues against naive depth-extrapolation at inference.
The fixed-point arrives at slightly different times for different tasks. Bee's "T=2" appears to be a property of the specific task and training recipe, not a universal property of looped transformers. Brackets need ~T=3-4 to plateau, addition needs ~T=2.
Task structural complexity matters more than loop count. At a fixed budget, the ceiling on accuracy is set by something else (model capacity? loss landscape? data efficiency?), not by the number of inference loops. Adding more loops can't compensate.

A useful refinement: looped transformers carry compute up to a depth bounded by the task's algorithmic complexity and the model's expressive capacity. Beyond that, the hidden state stops moving meaningfully and additional loops are computation without information. Day 7 showed this for a task within capacity (addition saturates); Day 8 shows it for a task that bumps against capacity (bracket parsing caps short).

🛠️ Technical details

Smoke history (why the task definition changed)

Initial smoke: balanced/imbalanced binary classification, depths 2-10.
Result: 100% accuracy across all depths by step 4,000.
Diagnosis: too many shortcut signals (length parity, open/close count) for the model to learn the stack algorithm — even with mutations that should defeat counting shortcuts. The 2-bit output gives the model no incentive to track position-by-position state.

Second smoke: first-break-position output, depths 2-20.
Result at 5,000 steps: d=2 100%, d=20 71%, with loss still trending down (0.32 → still falling).
Diagnosis: depth-dependent difficulty visible, room to scale training to expose loop-count effects.

Lesson worth recording: output information density matters as much as task structure for studying loop behavior. A binary classifier with global-aggregation shortcuts is a weak probe of recurrent depth.

Config and hyperparameters

MythosConfig(
    vocab_size=20,         # 6 brackets + '=' + '$' + space + '-' + '0'-'9'
    dim=256,
    n_heads=8,
    n_kv_heads=2,          # GQA
    max_seq_len=128,       # Day 7 was 32
    max_loop_iters=4,
    prelude_layers=1,
    coda_layers=1,
    attn_type="gqa",
    n_experts=4,           # MoE FFN inside recurrent block
    n_shared_experts=1,
    n_experts_per_tok=2,
    expert_dim=512,
    lora_rank=8,
    rope_theta=10000.0,
)

Total parameters: 3,394,338 (~3.4M, matches Day 7 to within rounding).

Training:

Optimizer: AdamW, betas (0.9, 0.95), wd 0.1
LR: max 3e-4, warmup 2000 steps, cosine decay to 1e-5
Grad clip: 1.0
Batch size: 128
Max steps: 30000
dtype: fp32 (same RoPE-complex-buffer reason as Day 7)
4 seeds {0, 1, 2, 3} in parallel

Data generation

On-the-fly synthetic. For each sample:

Sample depth d ∈ {2, 4, 6, 8, 10, 12, 16, 20} uniformly
Sample pair count n_pairs ~ U[max(1, d-1), min(2*d, 50)]
Generate balanced parenthesization (random bracket types, nested or sequential)
With prob 0.5, apply a mutation: delete close (30%), delete open (30%), substitute (40%)
Compute first-break position with the stack parser; format output

Loss is applied only at positions following = (i.e., on the 2-digit answer + $).

Evaluation

Experiment A: greedy autoregressive generation, exact 3-token match (position digits + $). 500 samples per (seed, n_loops, depth).
Experiment B: re-implementation of OpenMythos forward to expose per-loop hidden states. Cosine similarity at the first answer-token position. 200 samples per (seed, depth), 32 loop iterations.

What I'd want to try next

Increase training-time loop count and re-measure. Does the peak track with training depth (suggesting it's purely a distribution-shift artifact) or does extrapolation stay broken?
Scale model dim while keeping loops fixed. Does a 10x bigger model break through the ~66% / ~36% bracket ceiling, or does the structure-tracking task itself need a different inductive bias?
Mix tasks in training. Train on addition + brackets jointly and see if there's interference or transfer.
Inject explicit halting (ACT). Let the model choose how many loops per token. Does it match the empirical optimum or settle elsewhere?

References

Training and evaluation scripts: https://github.com/SAETAG/dgx-100-experiments/tree/main/days/day08-bracket-matching/scripts.

Tomorrow: Day 9

Switching gears to something much more personal — handing private chat data to a local model and seeing what it surfaces…!

100ExperimentsWithDGX #LocalLLM

[Day 7] Does Giving an AI More 'Thinking Time' Really Make It Smarter? Training an OpenMythos-Style Mini Model on DGX

PEPPERCORN — Tue, 19 May 2026 03:17:51 +0000

[Day 7] Does Giving an AI More "Thinking Time" Really Make It Smarter? Training an OpenMythos-Style Mini Model on DGX

Intro

Day 7!

Reddit kept surfacing this new project called OpenMythos in my feed with "12 days to replicate frontier AI, ASI is near" headlines, and I got curious enough to dig in.

Tools used: my home AI machine (DGX Spark) + OpenMythos (PyTorch reconstruction of the rumored Claude Mythos architecture) + synthetic multi-digit addition.

The question: does giving an AI more "thinking time" (= more recurrent loops at inference) actually make it smarter?

Today's setup

The hype

On 2026-04-07, Anthropic announced Claude Mythos. Press coverage highlights zero-day discovery capabilities — reportedly 271 zero-days in Firefox and a 27-year-old bug in OpenBSD — but the model's architecture and weights remain unreleased. Anthropic kept Mythos itself behind a limited-access coalition (Project Glasswing — AWS, Apple, Microsoft, Google, CrowdStrike, Palo Alto, ~40 organizations) rather than releasing it publicly.

Twelve days later, Kye Gomez (Swarms) released OpenMythos, a PyTorch reconstruction of the suspected architecture. The repo is explicit upfront:

"an independent, community-driven theoretical reconstruction based solely on publicly available research and speculation. It is not affiliated with, endorsed by, or connected to Anthropic"

So OpenMythos is not Mythos. It's a hypothesis-in-code: a Recurrent-Depth Transformer (RDT) with MoE FFNs and MLA/GQA attention, capable of being trained from scratch on standard text data. No leaked weights, no distillation.

Reddit's "ASI is near" framing skips this critical distinction. The interesting question, once you set the hype aside, is whether the architectural idea — recurrent depth — actually works.

Note for this article: OpenMythos is not Claude Mythos — it's a theoretical reconstruction inspired by looped-transformer research. The experiments below are not "Claude Mythos capability tests" but rather "how does a looped / recurrent-depth structure behave on a small synthetic task."

Three perspectives on looped transformers

Browsing the literature, I found three different studies giving different pictures of how looped transformers behave:

Source	Scale	Claim
Saunshi et al. 2025 (ICLR, research paper)	tens of M params, synthetic	Loops work: k layers looped L times approximately matches kL-layer fixed-depth, on addition / p-hop induction / math
Geiping et al. 2025 (Huginn, research paper)	3.5B params, 800B tokens	Task-dependent: at scale on natural-language benchmarks, gains can be marginal (T=4 → T=32 only +1.82 points on GSM8K), though effects vary by task and compute regime
Micheal Bee 2026-04 (Medium, independent experiment blog)	17M params, 12 GPU-hours on RTX 5070 Ti	Loops plateau at T=2 in this small-scale setup: hidden state reaches a fixed-point that subsequent iterations cannot escape

Theory, large-scale empirics, and an independent solo replication give different pictures. I wanted to add a fourth data point from my own DGX Spark on a clean, controlled task — multi-digit addition.

What I'd hoped to see

Does training-time accuracy phase-transition (grok) at some step? (Saunshi 3-stage prediction)
Does test-time loop count matter? At what point does it stop helping?
Does the hidden state actually keep evolving across loops, or does it hit a fixed-point early? (the Bee question)

Headline finding

Loops help, but only within a narrow window centered on the training loop count. With training-time max_loop_iters=4, accuracy peaks at exactly T=4 (100% across all digit counts) and decays in both directions — fewer loops underthink, more loops overthink.
Bee's "T=2 fixed-point" reproduced. Cosine similarity between consecutive hidden states jumps from ~0.72 to ~0.95 at T=2, then climbs slowly to ~0.99 by T=4 and stays flat through T=32.
Striking per-seed grokking variance. Same hyperparameters, four seeds: seeds 1 and 3 solve 5-digit addition by step 4,000; seed 2 takes 10,000; seed 0 stalls at <10% until step 16,000, then jumps to 100%.
No depth extrapolation in this setup. Saunshi's claim that training at T=4 should generalize to deeper T at inference does not reproduce here — instead, T>4 hurts.

🌀 What is a "looped" transformer?

A standard transformer (GPT-4, Llama, most local LLMs) routes input tokens through a stack of distinct layers, each used exactly once per forward pass. To make it "think deeper," you stack more layers — increasing parameter count.

A looped transformer reuses the same parameters across multiple iterations. The model has a Prelude → Recurrent Block × T → Coda structure: a few standard layers up front, then one block iterated T times with input injection at every step, then a few more standard layers.

Input tokens
   ↓
[Prelude P]          — standard layers, run once
   ↓
[Recurrent Block R]  — one block looped T times
   ↑_______↓          h_{t+1} = A·h_t + B·e + Transformer(h_t, e)
   ↓
[Coda C]             — standard layers, run once
   ↓
Output logits

At each loop iteration t, the hidden state updates via the LTI injection rule, and the encoded input e (Prelude output) is re-injected to keep the original signal alive across arbitrary depth. The injection parameters are constrained so that spectral radius ρ(A) < 1, which prevents divergence over many loops (Parcae stability framework).

The key claim: more loops at inference = deeper reasoning, without adding parameters. This is conceptually analogous to chain-of-thought scaling — except the "thinking" happens in continuous latent space rather than discrete token space.

🔧 Experimental setup

I trained a deliberately tiny OpenMythos variant on multi-digit addition. The model is small enough to run 4 seeds in parallel on a single GPU but large enough to exhibit the looped-transformer phenomena.

OpenMythos tiny (3.4M params)
  ↓
Train 4 seeds in parallel, 30k steps each, fp32 on DGX Spark (GB10)
  ↓
Experiment A: greedy autoregressive accuracy
              loops ∈ {1, 2, 4, 8, 16, 32}  ×  digits ∈ {2, 3, 4, 5}
  ↓
Experiment B: cosine similarity between consecutive hidden states
              ⇒ does the recurrent block reach a fixed-point?
  ↓
Compare against Saunshi / Huginn / Bee

Model config

MythosConfig(
    vocab_size=16,         # digits 0-9 + '+', '=', pad, eos
    dim=256,
    n_heads=8,
    n_kv_heads=2,          # GQA
    max_seq_len=32,
    max_loop_iters=4,      # training depth; inference varies
    prelude_layers=1,
    coda_layers=1,
    attn_type="gqa",
    n_experts=4,           # MoE FFN inside recurrent block
    n_shared_experts=1,
    n_experts_per_tok=2,
    expert_dim=512,
    lora_rank=8,           # depth-wise LoRA per loop step
)

Total parameters: 3,386,658 (~3.4M).

Data

On-the-fly synthetic addition. Operands are uniformly sampled from [10^(d-1), 10^d - 1] for digit count d ∈ {2, 3, 4, 5}. Sequence format "A+B=R$", where R = str(A+B)[::-1] (reverse-order answer, following Saunshi's convention so left-to-right autoregressive generation can carry digits naturally).

Loss is applied only at positions following the = token (i.e., on the answer tokens).

Training

Optimizer: AdamW, betas (0.9, 0.95), wd 0.1
LR: max 3e-4, warmup 2000 steps, cosine decay to 1e-5
Grad clip: 1.0
Batch size: 128
Max steps: 30000
dtype: fp32

Initially I tried bf16 to use the GB10 efficiently, but OpenMythos stores RoPE frequencies as complex64 buffers, and model.to(bfloat16) silently drops the imaginary part, breaking attention. For a 3.4M-param model on 128 GB of unified memory, fp32 is fine — the bottleneck is not memory but parallel scheduling.

Four seeds {0, 1, 2, 3} run in parallel on the same GPU. Per-seed throughput drops to ~12K tok/s (vs ~50K solo), but wall-clock time for all four is approximately equivalent to one solo run.

📊 Results

Experiment A: accuracy heatmap

Mean fully-correct rate across 4 seeds, 500 eval samples per condition:

Inference loops	d=2	d=3	d=4	d=5
1	0.38 ± 0.12	0.19 ± 0.09	0.09 ± 0.07	0.02 ± 0.02
2	0.53 ± 0.17	0.50 ± 0.12	0.16 ± 0.08	0.21 ± 0.16
4 (train)	1.00	1.00	1.00	1.00
8	0.98 ± 0.01	0.98 ± 0.01	0.94 ± 0.03	0.86 ± 0.08
16	0.91 ± 0.04	0.91 ± 0.05	0.75 ± 0.10	0.56 ± 0.16
32	0.62 ± 0.12	0.65 ± 0.13	0.45 ± 0.13	0.26 ± 0.17

Observations:

Peak is exactly at training-time loop count (T=4), 100% across all digit counts.
One step of inference-time extrapolation (T=8) is near-peak but already shows degradation at d=5 (86%).
Beyond T=8, accuracy collapses monotonically. At T=32, even 2-digit addition drops to 62%.
Under-looping (T=1, T=2) hurts more at higher digit counts, consistent with depth being needed to chain carries.

Experiment B: fixed-point analysis

Mean cosine similarity between consecutive hidden states cos(h_t, h_{t-1}) over answer positions, averaged across 4 seeds, 200 samples per digit:

t	d=2	d=3	d=4	d=5
1	0.711	0.726	0.745	0.744
2	0.961	0.967	0.957	0.946
3	0.985	0.986	0.977	0.971
4	0.993	0.992	0.986	0.983
8	0.999	0.999	0.998	0.996
16	0.9995	0.9996	0.9992	0.998
32	0.9995	0.9996	0.999	0.998

Bee's T=2 fixed-point claim is reproduced in spirit but not literally: cosine similarity jumps to ~0.95 at T=2 (vs. Bee's near-1.0), then asymptotes to ~0.99 by T=4 and stays flat through T=32.

The difference vs. accuracy is telling: hidden state is effectively static (by cosine similarity) from T=4 onwards, yet accuracy collapses at T=16-32. Two non-exclusive interpretations: (a) overthinking — late loops drift away from a converged solution; (b) distribution shift — training used T=4, so T>>4 is simply an out-of-distribution use of the model. Worth noting that cosine similarity ≈ 1 doesn't prove the hidden state is doing nothing — small logit-relevant deltas may still accumulate.

Digit-count dependence on fixed-point timing is small (d=5 lags d=2 by ~0.01 in cosine sim). "Harder problems take more loops to converge" is not observed here — they converge at the same rate but the converged state is just less accurate at higher digit counts.

Bonus: training dynamics

The most striking thing in the training curves is seed-dependent grokking timing. Four runs of identical hyperparameters:

seed 1: loss → 0 by step 3,000, all digits ≥88% by step 4,000
seed 3: loss → 0 by step 4,000, all digits ≥87% by step 4,000
seed 2: stuck at loss ~0.35 plateau until step 8,000, then collapses to 0 by step 10,000; d=4/5 jump from <10% to 99% in 2,000 steps
seed 0: stuck at loss ~0.30 plateau until step 15,000, then collapses; d=4 groks at step 12,000-14,000, d=5 groks at step 16,000

This is textbook Saunshi-style three-stage grokking (memorization → in-distribution → systematic), with the third-stage trigger varying by a factor of 4x in step count purely on random init. The largest seed gap (seed 0 vs. seed 1) is ~12,000 steps, roughly 1 hour of wall-clock on this DGX.

If you trained a single seed and stopped early, you might conclude "OpenMythos can't generalize beyond d=3" — which would be wrong. The architecture can solve all 4 digit buckets; some random seeds just need much longer to find the systematic-generalization solution.

💡 What this means for the three perspectives

Where my data point lands

My single-DGX small-scale result lands somewhere between Bee and a partial refutation of Saunshi:

Bee's fixed-point at small T is reproduced. Hidden state effectively stops evolving by T=4 (cosine sim ≥ 0.99) and certainly by T=8.
Saunshi's depth-extrapolation does NOT reproduce. Inference at T > train_T does not improve accuracy — it harms it. T=8 is already at 86% on d=5 (vs. 100% at T=4), and T=32 collapses to 26%. The "train at depth k, infer at depth k·L" recipe assumes the recurrent block has learned to keep refining; in my setup it has not.
Huginn's limited-gain finding is consistent at small scale. Extra inference loops give negative ROI rather than diminishing positive ROI.
New observation: seed-dependent grokking with up to 12K-step variance. This is an under-emphasized variable in the public looped-transformer discourse — single-seed studies (Bee's solo replication, individual rows in Saunshi's tables) may be substantially under- or over-estimating typical behavior.

Reconciliation attempt

Theory (Saunshi), large-scale empirics (Huginn), and independent replication (Bee) may not actually be in contradiction — they may be measuring different facets of the same phenomenon at different scales:

Saunshi: shows loops can work on the right kind of problem (algorithmic, depth-bounded reasoning) at the right kind of scale (small synthetic).
Huginn: shows that loops trained at 3.5B / 800B token scale on natural-language data give only marginal gains on a benchmark (GSM8K) that already favors CoT.
Bee: shows that within a particular small-scale training recipe, the recurrent block's hidden state stops evolving very early in inference.

These three findings are compatible with a unified picture: loops carry compute, but only up to a depth bounded by the task's algorithmic complexity and the model's expressive capacity. Beyond that depth, the hidden state stops moving meaningfully, and additional loops are computation without information.

What I'd watch next

Increase loop count during training (here I used 4) and see if the inference-time scaling extends further
Try ACT halting more aggressively to see how the model self-regulates loop depth per token
Add task heterogeneity (mix p-hop induction or parity) to test whether the fixed-point timing varies by problem class

🛠️ Technical details

Reproducing this experiment

git clone https://github.com/kyegomez/OpenMythos
cd OpenMythos
pip install -e .

# Data, training, evaluation scripts (this Day 7 folder):
python scripts/train.py --seed 0 --max_steps 30000
python scripts/eval_accuracy.py --seeds 0 1 2 3
python scripts/eval_fixedpoint.py --seeds 0 1 2 3
python scripts/plot.py

The training and evaluation scripts are at https://github.com/SAETAG/dgx-100-experiments/tree/main/days/day07-openmythos-loop-debate/scripts.

What went wrong (and was fixed)

bf16 broke complex RoPE buffer: switched to fp32; fine at 3.4M parameters
Initial training-time max_loop_iters too small: kept at 4 per Saunshi's recipe; future experiments could vary this
Greedy generation is slow at high loop counts: each batch repeats n_loops forward passes through the recurrent block; for loops=32 this is non-trivial

Hyperparameter choices: why these

dim=256, expert_dim=512, 1 prelude / 1 coda layer: smallest config that still exhibits looping behavior; matches Saunshi's scale
n_experts=4: enough to demonstrate MoE routing without bloating params
lora_rank=8: depth-wise LoRA lets each loop iteration adapt slightly without breaking weight-sharing
max_seq_len=32: tight bound — d=5 addition fits in ~18 chars

References

Tomorrow: Day 8

A follow-up to Day 7, pushing looped thinking one step further into something harder…!

100ExperimentsWithDGX #LocalLLM