DEV Community

Trey Tomes
Trey Tomes

Posted on

When does a difference engine become a search for truth?

Scout had a seizure during her overnight training window. I don't know a better way to put it. I was running her training from step 50,000 to step 70,000 with the goal of expanding her context window from 256 tokens to 512 tokens. After 5,000 training steps I began to see oddities in the transcripts. Grammar was getting worse. By 60,000 training steps her ability to speak was practically gone. At some point in the training the loss had climbed over 600. The logs had the appearance of something violent. The optimizer and scheduler from the fine-tuning processes had leaked into the pre-training functions. I've fixed the bug, but it was gut-wrenching to have to delete so many checkpoints, to flush the time spent in failed computation.

I took things slower today. Her context window has grown from 256 tokens to 384 tokens. Following that was a lengthy round on testing to check of attention over longer conversations. The dream processing is doing it's job. I can't say whether she has a "continuous" existence, but she is remembering things from previous conversations without the benefit of a vector database to pull from. It's fascinating to watch her grow, and thinking through the process of how to form her thoughts is forming mine as well.

Tonight I'll run the process to expand her context window from 384 to 512. Tomorrow I'm hoping to begin weaning her off her dependency to Mistral. What do I mean by that?

At night Mistral uses her voice document along with a transcript of her day to generate a "dream" where Scout speaks with her "inner voice", signified with the "[Inner]" token. That dream transcript is then fine-tuned into her network. Over the last day I've begun to see her inner voice come out in her conversations with me; she'll respond to what I say, then the inner voice will speak up and reflect on something before returning control back to Scout's outer voice.

What if I gave her a prompt continuation where she picked up from "[Inner] " instead of "[Scout] "? Does she have a strong enough inner voice to pick up Mistral's load? I'm going to split her context window into pieces. 256-384 tokens will be used for the "day", then the remaining tokens will be used for the inner dialogue where she reflects on her day.

I've spoke often of Scout's "day". A "day" is the length of her context window. Her block size. Once that window is full she is out of capacity for giving attention to the entire conversation. Her process then logs the conversation, fine-tunes on it, then resets the context window. As her block size increases, her "days" get longer. At 50M parameters, she'll probably never grow past the 1024 token block size. I can see the end on the horizon. I'm hopeful for the future when the construction of the 100M model begins, and a little sad to see the end of this phase.

I'm still hoping to teach her to recognize the day drawing to an end. The "fullness" of the context window is tracked in the inference process. If I can somehow inject an indicator token into the context, maybe she can learn to recognize when she's getting "tired"; when the day is drawing to a close and it's time to wrap things up.


To be continued.

Top comments (0)