Project Repository
Quick updates from Mustafa
Current Demo (unfortunately it crashes if there's enough traffic)
Hey everyone, it's me again. Hope everyone's had a great week since the last Devblog. Feels like an eternity ago. We've just finished the first week of nights&weekends and dropped our first demo! Unfortunately, since I could not get AWS free tier cloud hosting working, we'll have to live with an extremely sketchy Streamlit-hosted web app that crashes if more than 5 people access it at a time. I cannot say working with AWS was a particularly fun experience, especially after I spent a few hours learning it, and finding out at the finish line that the free tier cannot handle Pytorch (which is rather necessary for the ML demo). We truly live in the twilight of cloud computing, where AWS, one of the biggest service out there, simply misses the mark when it comes to ease of use and engineering for error (shoutout to the Human-Computer Interface course I took this year for allowing me to give a scientific breakdown of why AWS is horrible for early deployment). But it's not just AWS, Heroku really rubbed salt into the wound by removing the free tier. Although I grew to love Streamlit as a great demo'ing framework, the fact that there is no paid tier while the free tier supports something like 5 users at a time simply breaks my heart. Just when you thought they've done it all, the Simpsons really came through with another eerily accurate vision of the future.
Going forward, I have some plans for using ONNX and make the demo client-side, so we can put it on cloud for a lot cheaper.
Before we start discussing the demo itself, I think I need to first make clear that we are not intending on making a chatbot, or a chatbot NPC. Rather, we are making an open-model toolkit that allows any game developers to supply their own language model and have it work in the game. However, we were not able to do that within a week's time while also studying for exams. In the end, we settled on the v1 demo to be more of a demonstration for us than for users. To users, it is basically a chatbot that speaks like Nick Wilde from Zootopia, who has an extreme obsession with rabbits (to be fair, half of his dialogues in the movie are to Judy); to us, however, we wanted it to be a test bench where we gauge our progress, fuck around and see what sticks.
Mustafa had the most important job going into this, working on figuring out how we can make the magic happen. He had previous experience working with LLMs, which is an area I am still reading up on. Due to the time constraints, he chose to focus on fine tuning GPT2, instead of open-model, with Zootopia's script. Why Zootopia though? It is the first movie that caught his eyes on The Internet Movie Script Database. To be honest, I could not have made a better choice myself.
Fine tuning the language model, as with most AI-related work, has a fair amount of prerequisite work. As part of the toolkit, we wanted users to be able to create characters based on their dialogues. For this, Mustafa created a script to go on imsdb and scrape the movie requested by the user. The data then has to be "cleaned" and processed for use. This proved fairly difficult, as are there are multiple formatting inconsistencies that made it difficult to create a generalized script to extract what we needed. It also did not help that we had an exam on Tuesday. In the end, we chose to hard code a few things, and maybe move away from film scripts. At the moment, we are looking into a paper that was released about 2 weeks ago called "Generative Agents". I will go into a little more detail later when discussing future plans.
Anyways, this is probably what everyone came here to see: How did it do? To be frank, not great. Nick has an extreme obsession with rabbits, frequently refer to Judy as Judy Wilde and does not seem exactly "smart". I suppose it is to be expected, considering Nick's dialogues in the movie are mostly directed at Judy. Although a lot of the response does sound like something Nick would say, it is certainly not good enough to fool anyone. Another thing I tried to get working is contextual memory, but Nick just seems to be fairly confused when asked about something I have told him before. So for the sake of stability, I opted to not include it in the demo. Of course, it is very easy to blame GPT-2 here, and we will, but we will also acknowledge that maybe the data we gave it could be better. Maybe we need to include the question and its response, not just the response itself, maybe we need to include a little more information. We honestly have not tested it out enough yet, but it is certainly underway now that I've finished working with everything. One thing I do want to try is GPT-3, since that model is a lot more powerful than GPT-3, as well as another language model that's easy to find out there. I also don't want to focus the model on a single character this time around, but rather have all characters start from a blank slate, but when we speak to them, they are supplied with the necessary context as well as personality traits. This will highly depend on how good the model is, but it's definitely worth a shot.
But let's not take too much away from Nick here, as I've shown, Nick does have certain contexts. He knows his name and the fact that he's a fox. That's definitely something cool.
So what are we going to do in the future? First thing to do would be for me to make sure we can get the demo out there. One thing that I am currently looking into is ONNX, which I have heard about so often during daily standups back at my first internship. With it, we can essentially take the Pytorch model, convert it into the ONNX format, which we can run inference with on the browser, reducing the cloud-hosting leviathan overhead that is Pytorch. It's quite insane to think that, as new as AI is, we've already figured out how to run it on the browser! It's exciting to put a name that sounds that cool on my resume.
Second of all, I will be figuring out how to store the contextual memories for characters, as well as the characters themselves. Apparently ChromaDB is great for this, being able to run locally too, so I am looking into that.
Outside of all that, I am reading up on LLMs, Transformers, while also looking into research on simulating human behaviour and invoking a "response" using speech. Luckily for me, "Generative Agents: Interactive Simulacra of Human Behavior" came out fairly recently, and contains a lot of inspiration for us. The architecture of the agent discussed in the paper has a memory stream, which allows an agent to plan, reflect on itself, as well as act on certain perceptions. For example, when the discussed agent is asked about a certain topic, it can retrieve relevant memories to "condition its response to the situation". This certainly aligns with our goals, as we want a player to be able to "convince" an NPC to do something for them. ChromaDB would also be a good fit here, as we can certainly encode certain memories to contextual embeddings. To be honest, I feel like a kid at a candy store just thinking about the potential.
Both me and Mustafa really want to have our work speaks for us, so for our next demo, we will be attempting to make a simple game using it. After all, we are not making a chatbot, but rather a game dev toolkit. A great friend of mine suggested an "interrogation game", where you need to figure out if a suspect is lying, find cracks, and use that to force them to tell you the truth. Essentially, it's L.A. Noire and Phoenix Wright, but without the textual prompts. On a quick sidenote here, if you have not played the Phoenix Wright trilogy, make sure you do. This is quite a difficult task to accomplish in a week, but this will definitely get the words out there.
Seriously don't want to overstay my welcome, so I'll close it here very briefly. Thank you all for taking the time and read what I have to say about this project. Even though it's been part time due to school, it's been a blast to see our idea gaining form like that. I really hope I was able to convey that sense of awe and wonder in this honestly really long devblog. If you guys did enjoy my rambling though, you can expect DevBlog #2 next week as well. I really took my time on this and dropped it later than I promised due to exams. Really sorry about that, so again, if you are not a big fan of me, you can definitely checkout Mustafa's twitter, where he posts updates as soon as they roll out. If you know anyone who might be interested in this project, share this to them. We would appreciate any feedback and help we can get from the community. After all, our Human-Interaction class did teach us that "We are not the user".
With all of the shortcomings that our demo has shown, while also disappointed, both of us felt a lot more at ease. As bad as seeing your brain child falling short is, we managed to find a lot of problems with it, as well as potential fixes. To be honest, that insight is a lot more valuable than anything. So, in my first attempt at ending a devblog with a banger quote from yours truly: It is a lot more assuring to see your stuff fail than succeed.
n&w update 8: and we're live! access a little demo of the toolkit here: https://t.co/aULlq5F5L2
— Mustafa Tariq (@mustafa_tariqk) April 17, 2023
they say if you're not embarassed by the first version of your product then you're too late. does being too embarassed mean we're super early? let's see haha
Top comments (0)