Three Tips for Your Next (Software) Demo

Giulio — Sun, 28 Apr 2024 17:51:56 +0000

Implementing something is always only half of the work; the rest is, well... showtime! An exciting demo can make the difference between inspiring the world with our creations and not even being noticed. Here are three tips we learned from participating in the Flanders Technology & Innovation festival in Antwerp, in March 2024.

#1: Tailor to the audience

Pulling off a successful demo is not easy, especially when the details of your work matter and the environment is not in your favour. We (the AIRO Social Robotics group!) found ourselves in this very situation at the FTI science fair in Antwerp. Very briefly, the mission was to introduce the public to large language models and social robots, so we displayed two Furhat robots having an enjoyable conversation with each other about a topic chosen by the spectators.

There are thousands of little nerdy things we wanted to tell people: all the small challenges we had to overcome to create the demo, the things we learned, how the technology works, and so on. But knowing that the event targeted curious (not tech-savvy) people and families, we desisted. We also knew that the format was a demo stand, so we would have to fight with other stands for the crowd's attention. We greatly simplified the setup, down to the bare minimum: two robots, a topic, a conversation created by AI. Simple, easy to explain, and with a nice novelty effect.

#2: Think about the setting

Forgetting this can lead to disastrous consequences. Planning to use a microphone in a noisy environment, relying on a projector or a monitor in daylight, expecting perfect wi-fi coverage. These are all very easy-to-make mistakes if you forget to think about the environment.

In our case, the demo relied heavily on people getting fascinated by the creative ways a large language model can put together a sound debate on a topic of choice. Among the other things, foreseeing a bustling environment, we decided to display the dialogue on a monitor, so people could follow along and enjoy the show.

We implemented this interface as an easy-to-use web application, detached from the code running the demo. We are planning to use it again in future demos and it's available open-source on GitHub! Since we were unsure whether the demo would be displayed on a monitor or projected onto a wall, we tried to make the text as clear as possible and we included both a light and dark theme, for optimal legibility in any light condition; at the same time, we added some futuristic-looking effects to attract people's attention. Here is how it looks:

#3: Do not forget the brand

Everyone has a brand: the company backing your work, your university, an institution, or even just your name. We shouldn't be afraid of putting our signature on our work. It can give authority to the demo, and contribute to attracting people's attention; overall, it helps to tell the background story of your work, and people love stories! We included our university and lab logos in the top left of the interface. While not fundamental for a good demo, this is something that is easy to forget but can add a professional touch to your work.

Bottom line

Setting up a demo for this year's FTI festival in Antwerp allowed us to reflect on three points:

Tailor your demo to the audience;
Think in advance about the setting of your demo;
Don't forget your brand.

For our demo, we wrote a simple stand-alone application to display conversation messages, easy to use and eye-catching. You can check it out on GitHub! Hopefully you found these three simple tips helpful. Good luck with your future demos!

Implementing Vision-Powered Chit-Chats with Robots: A GPT-4 Adventure 🤖👀

Giulio — Fri, 17 Nov 2023 18:55:56 +0000

Imagine a world where your favourite chatbot or social robot isn't just responding to text-based inputs but is also getting a real-time visual sneak peek into the conversation. Exciting, right? Well, we implemented just that with the help of GPT-4, and I'll explain how you can do it too! But first, here's a video showing the final result:

Check out our paper I Was Blind but Now I See: Implementing Vision-Enabled Dialogue in Social Robots for more details.

In this short adventure, we'll explore how to use large language models and live visual input from a webcam, mix them in an effective prompt, and summarise this to make it run faster and cheaper. We'll be creating a conversational experience that's actually context-aware. Want to dive straight into the code and try it yourself with a webcam or a Furhat robot? Here is the repo. Ready to start? Let's go!

🖼️ GPT-4 and Images

To start, you'll need an OpenAI account and to get yourself an API key. I know... I would have liked an open-source alternative too, but we've tried IDEFICS and LLaVA without good results. So GPT-4 it is for now!

We'll be using Python: run pip install openai opencv-python to get the libraries we need. These are a few lines of code to get you started with GPT-4 vision.

from openai import OpenAI

client = OpenAI(api_key="YOUR-KEY-HERE")
def query(prompt) -> str:
    params = {
        "model": "gpt-4-vision-preview",
        "messages": prompt,
        "max_tokens": 200,
    }
    result = client.chat.completions.create(**params)
    return result.choices[0].message.content

📜 The Prompt

The prompt that you want to send to GPT-4 has a somewhat complex structure, but this is what has worked reliably so far. Basically it's an array of messages. As you probably already know, GPT-4 supports different kinds of messages. Here's a quick overview.

The System Message instructs the model on how to behave. It has this structure:

def format_system(content):
    return {"role": "system", "content": content}

To add text from the user, or base64 images (more on how to load images below), you'll want to use something like this:

def format_text(content):
    return {
        "role": "user",
        "content": [{"type": "text", "text": content}],
    }

def format_image(content):
    return {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {"url": "data:image/jpeg;base64," + content},
            }
        ],
    }

And finally, when you want to incorporate GPT-4's responses into the prompt, this will be how:

def format_assistant(content):
    return {
        "role": "assistant",
        "content": [{"type": "text", "text": content}],
    }

Now, to put together all the text and images you have different options. It is important to keep the right ordering of the elements, and the easiest way it's going to be a big list. The repo with the code of this project contains a Conversation class that does this (and other things too, more on this later).

📷 Taking pictures

In this example, during the conversation with the system, we're going to incorporate images in our prompt by taking a picture with the webcam at the beginning of the user's turn. In the repo you will find how to continuously take snapshots during the conversation at fixed intervals, load a video, or use a Furhat robot as the video source. Here, we will just open the webcam, take a pic, encode it into a string, close the webcam, and return the string.

def get_image():
    vid = cv2.VideoCapture(0)
    _, frame = vid.read()
    _, buffer = cv2.imencode(".jpg", frame)
    string64 = base64.b64encode(buffer).decode("utf-8")
    vid.release()
    return string64

🪄 The System Prompt

Awesome! We have all the components ready... except one: the system prompt. We have to tell GPT-4 how to interpret the images that we send, and how to respond. This takes patience and time, many trials and a bit of prompt engineering magic. Let's cut to the chase and have a peek at the prompt that gave us the results we liked the most.

system = (
    "You are impersonating a friendly kid. "
    "In this conversation, what you see is represented by the images. "
    "For example, the images will show you the environment you are in and possibly the person you are talking to. "
    "Try to start the conversation by saying something about the person you are talking to if there is one, based on accessories, clothes, etc. "
    "If there is no person, try to say something about the environment, but do not describe the environment! "
    "Have a nice conversation and try to be curious! "
    "It is important that you keep your answers short and to the point. "
    "DO NOT INCLUDE EMOTICONS OR SMILEYS IN YOUR ANSWERS. "
)

As you can see, we ask the model to impersonate a friendly kid, sounds strange but this removes most of those annoying warnings and disclaimers from the output of GPT-4. Then we tell the model that the images are what it sees, and that it would be nice to start the conversation by saying something nice about it. GPT-4 will try hard to describe everything it sees and we don't want that; we also don't want the model to ramble on forever so we tell it not to. Finally, the friendly kid persona that we summoned loves putting emojis in its answers, they're of no use to us so we ask to not include them, in uppercase, just to make it extra clear and loud.

🧩 Put Everything Together

Let's glue all of this together, shall we?

Ta-daa! A nice infinite loop and it's done! Save this with a nice name like main.py and run it with python main.py. Fingers crossed, and if everything goes well you'll be taking pics from the webcam, and having a nice chat about it. Nice, isn't it? Have fun exploring what happens when you turn off the lights and how GPT-4 answers to the weirdest scenarios. Be sure to follow OpenAI terms of use and keep an eye on the bill, as sending a lot of full-res pictures can be pricey.

As said, in the repo you can find a version that continuously captures frames from a webcam, from a video or a Furhat robot.

✂️ Cut the prompt size

You'll quickly notice that your prompt will get too big, with slowed-down computation and increased prices. No good. To solve that, we thought of doing what's done with normal dialogue prompts: ask the LLM to summarise the first part of the conversation!

But we can't summarise images and dialogue together, a picture is worth a thousand words and our dialogue will virtually disappear in a sea of image descriptions. Remember when I told you that the Conversation class in the repo was doing other things too? Well, when the prompt gets too long, this class asks GPT-4 to summarise some of the images in it. It will scan all the messages list, find the first n consecutive images, and substitute them with a summary. If you are interested, this paper contains more details about it.

Here is the code that we used in the Conversation class.

def get_fr_summary(self) -> Tuple[List[Message], int]:
    """Summarise the frames and return the new messages and the number of frames removed."""
    # fr_buff_size is the max number of images (frames) in the prompt
    # fr_recap is the max number of frames to summarise
    # Assuming number of frames in prompt > fr_buff_size > fr_recap

    # Find the first frame and the last frame to summarise
    first_fr = None
    i = None
    for i, m in enumerate(self._messages):
        if m.is_frame():
            if first_fr is None:
                first_fr = i

        # Include at most fr_recap frames, and stop if we see a user message
        if first_fr is not None and (m.is_user() or i - first_fr >= self.fr_recap):
            break

    # Split the messages list
    before = self._messages[:first_fr]
    to_summarise = self._messages[first_fr:i]
    after = self._messages[i:]

    # Generate the summary
    prompt = [
        SystemMessage(
            "These are frames from a video. Summarise what's happening in the video in one sentence. "
            "The frames are preceded by a context to help you summarise the video. "
            "Summarise only the frames, not the context."
            "The images can be repeating, this is normal, do not point this out in the description."
            "Respond with only the summary in one sentence. This is very important. "
            "Do not include warnings or other messages."
        ).gpt_format(),
        *[b.gpt_format() for b in before],
        *[s.gpt_format() for s in to_summarise],
    ]
    summary = self.llm.query(prompt)

    # Generate the new message list with the summary
    messages = [
        *before,
        FSummaryMessage(summary),
        *after,
    ]

    return messages, i - first_fr

🔭 What now?

I hope this journey into combining GPT-4 with real-time visual input has sparked your curiosity! The possibilities are as vast as your imagination. Now armed with the knowledge to integrate large language models and live visual input, you can create a truly interactive and context-aware conversational experience. So, what are you waiting for? Dive into the code, explore the fascinating intersection of language and vision, and let your creativity run wild. The future of chatbots and social robots is not just text-based – it's a dynamic fusion of words and images, and you're at the forefront of it. We'll continue to work to improve this approach and explore new and exciting ways to make conversational agents better. Stay tuned!

DEV Community: Giulio