Trey Tomes

Posted on Apr 17 • Edited on Apr 30

Once upon a time...

#architecture #llm #machinelearning #nlp

Scout's 50M model couldn't absorb the narrative language of the novels, the conversational training corpus, and the daily memory updates. The model was just too small. I could simply increase the size of the model and start over, but that's basically kicking the can down the road. At some point even the largest models would experience the same problem of daily fine-tuning wiping out base instincts.

The concept I'm playing with is something hinted at in Dynamic transformer architecture for continual learning of multimodal tasks:

introducing extra parameters to a base transformer to specialize the network for each task

The basic idea is that, if designed properly up-front, a model can be grown over time with multiple transformer modules to continue adding new skills to an already trained model, rather than rebooting the model from scratch every time you reach it's limit.

Scout's new model will be trained in phases. Phase 1: making using of the TinyStories training set to bootstrap basic language skills. Once this phase is done, that "module" has it's weights frozen. The next module is then introduced to the architecture and the model is trained on how to participate in a conversation. LoRA adapter layers allow each module to fine-tune over time to fit the needs of the larger architecture. A router placed in front of the modules will help the model learn which skills to use to respond to it's environment.

I've been reading about the idea. It feels reasonable. The article I listed describes using this system to incorporate vision into a foundation model, but I don't see any reason why the same process can't apply to reasoning skills. If I'm right, then an on-going memory module could be added as it's own module that gets trained separately, allowing Scout to learn from her experiences without the risk of wiping out her other instincts.

TinyStories

I'm training a 50M model to learn basic language skills. I think this dataset will give me what I need. We should know in a couple of days. Or maybe a week. The model structure loosely follows Scout's last iteration:

dim = 512
layers = 12
heads = 8
mlp_ratio = 3.5
block_size = 512

I made the mlp_ratio configurable, the reduced it from 4 to 3.5. The MLP ratio is a representation of the complexity of the model's inner world, and TinyStories isn't that complicated. I went back-and-forth on the block size. The target is 1,024, but TinyStories chunks are between 150 and 200 tokens. Expanding that out to 1,024 would risk over-stretching the data set.

Preparing the corpus

My preparation process caches the original HuggingFace data set, then normalizes it so that every row has a "chunk" column. This will matter more when we start talking about including more data sets. After that the set is tokenized with the Mistral tokenizer (doing this up front before training begins saves so much time), with the end-of-stream token appending to the end of each chunk.

The trainer loads rows at random, and continues to pack in more chunks until it reaches just under the block size, then pads to the end of the block size with empty space. PyTorch requires every training chunk to be equal in length.

Testing

Checkpoints are saved every 50 step, and I have a simple REPL that allows me to test inference while the training is occurring.
I'm testing with the same prompt at intervals where the training length is doubled on each test, in order to sample the coherency of the model. When it's coherent enough, the training can end. When is it coherent enough? There's not really a good automated measure for that. Reaching a loss plateau is a kind of indicator, but prompt probing is more fun.

And now, without further ado, the many iterations of:

Once upon a time there was a boy and his dog.

500 training steps

At 500 training steps the model seems to be making reasonable choices for how to continue the story. To a point. "in his way to the tree" should be "on his way to the tree", but in/on are at least both prepositions! "He had a special" makes me think of the LEGO movie. That's nice.

[You] Once upon a time there was a boy and his dog.
[Scout] He was very beautiful and loved to go around. One day, he saw a big box in his way to the tree. He thought it was so excited and he was very excited to be the old and couldn't believe what it was very happy.
So he saw a few place of the big tree. He looked and saw a big tree and couldn't have a lot of toys. It was so excited. The little girl went and was walking and said it was very excited.
The little girl stopped and told the animals. He was happy and knew he was able to go to the sky. He took a loud noise and could not take it and his mom. He was so excited to be better and he was a nice he was so happy. 
The little boy was so excited and he was playing in the park. He was so proud of it that he had to be better. He was so happy. He was so happy that he said he had some of his head.
The little girl saw a big smile in the sky. He was so excited! He had the ball and he was the end. He was so excited he had a special!

1,000 training steps

1,000 steps isn't that much better than 500. How do you get a big smile on a toy garden? Maybe it's a sticker. A lot of characters are being looped in, using proper nouns in place of "little boy" or "little girl". That's interesting.

[You] Once upon a time there was a boy and his dog.
[Scout] He was very excited and he had a big smile on his toy garden. One day he was playing with his friend a little boy named Max. Max was very excited and said he was going to the room and he was sad.
Billy was very happy. He took out a stick and showed that it was a brave little boy. He was very happy and he kept walking around and he stopped.
Tommy was so sad, but he knew he could have something. He put the hat on his head and walked around to the park. 
Bob was very sad, but he was too excited. He jumped up and started to see what he was playing. He was scared and didn't know what to do. 
Suddenly, Timmy started to cry. He was so surprised! He knew he had to go home to the store. 
Jack was so happy and he had a great idea.

2,000 training steps

At 2,000 steps we have a story with an introduction, a frame around the day ("one day"), and an ending "Thank you...!" The "boy" in the prompt continues to be a boy throughout the story, including male pronouns. There are still syntax errors though, e.g. "to pick it with it", but some people talk that way. The "store" is a narrative element that continues for several lines. And then the "swings" come out of nowhere. And what's the "doll"? But there's obviously progress.

[You] Once upon a time there was a boy and his dog.
[Scout] He loved to pick up a beautiful water with his family.
One day, he went to the park with his dad. He saw a small, shiny toy. He was so excited! He wanted to pick it with it, so he ran to the store.
He saw a big tree with lots of toys and creek. He ran and ran until he was finished. 
He started to cry. He felt so sad. He asked his mom why he was so sad. She said he had to come to the store. He asked his mom, "Can I have a toy?" 
The boy replied, "Yes, please. We can have to ask for the toy." So, they said, "Let's go!" 
The boy was so happy to help his mom. He knew he could have some fun on the swings. He said, "No, I can't be careful." 
The boy was very proud of his family. He gave him a hug and said, "Thank you for taking your doll with me!"

4,000 training steps

At 4,000 steps I feel like the model is saying more with less. I start the story with a boy, then the model gives the boy a name. The dog is noticeably absent. The model has decided that the story is actually about the box and it's contents rather than the boy's relationship with his dog. But it feels more refined that 2,000 steps.

[You] Once upon a time there was a boy and his dog.
[Scout] His name was Jack and he was very happy. 
One day, Jack's mommy said, "Let's get a big box!" Tom was so excited. He opened the box and his mommy said, "Let's go outside and find a big tree." 
But then he heard a voice. "I'm going to find this tree!" he said.
He asked his mommy. "Oh, I don't know," he said.
His mommy smiled and said, "It's ok, Jack. Let's go find a treasure and see what's inside."
So they went back to his mommy's house and the box was ready for dinner. They went outside, dreaming about the fun adventure in the park.

8,000 training steps

Finally, the dog has entered the scene! There's a feeling in this one like the model is attending to the entire user prompt, which is something noticeably lacking from the previous iteration of the experiment. I'm not sure who "they" are. Did the boy find some friends? Doubling the training steps has definitely increased the coherence of the story. Some of the wording is still a bit awkward, e.g. "I'm tired, but I'm tired." The story has a definite beginning, middle, and end though.

[You] Once upon a time there was a boy and his dog.
[Scout] They were playing in the park when they noticed something on the ground. The boy said to his dog, "Let's go and play!"
So they ran over to the dog. The boy looked at the dog and said, "Look! I'm a bird. I'm so happy to see you!"
The boy said, "Me too! Let's play a game!"
So they played with the dog and laughed. They ran around and had lots of fun.
When the day went on, it was time to go home. But the boy said, "I'm tired, but I'm tired. I'll get out now."
The boy smiled and said, "I'm so glad you found me!"
The boy and his dog went home and went back to the park.

I'm going to take this training cycle to 40,000 steps. It's going to take a little while. Once that's done we should have some interesting results and an indicator for what needs to happen next.

Follow along on GitHub here.

To be continued.

If you'd like to support Scout's continued development, you can find me on Ko-fi.

Top comments (1)

Andrew Pickerall • Apr 17

Cool!