<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Giulio</title>
    <description>The latest articles on DEV Community by Giulio (@giubots).</description>
    <link>https://dev.to/giubots</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1147455%2F3dd69c9d-bc47-4fdc-93f9-01d936847508.jpg</url>
      <title>DEV Community: Giulio</title>
      <link>https://dev.to/giubots</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/giubots"/>
    <language>en</language>
    <item>
      <title>Three Tips for Your Next (Software) Demo</title>
      <dc:creator>Giulio</dc:creator>
      <pubDate>Sun, 28 Apr 2024 17:51:56 +0000</pubDate>
      <link>https://dev.to/giubots/three-tips-for-your-next-software-demo-3p3d</link>
      <guid>https://dev.to/giubots/three-tips-for-your-next-software-demo-3p3d</guid>
      <description>&lt;p&gt;&lt;strong&gt;Implementing something is always only half of the work; the rest is, well... &lt;em&gt;showtime&lt;/em&gt;!&lt;/strong&gt; An exciting demo can make the difference between inspiring the world with our creations and not even being noticed. Here are three tips we learned from participating in the &lt;a href="https://fti.vlaanderen/" rel="noopener noreferrer"&gt;Flanders Technology &amp;amp; Innovation&lt;/a&gt; festival in Antwerp, in March 2024.&lt;/p&gt;

&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/OzSG4oxSnKM"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

&lt;h2&gt;
  
  
  #1: Tailor to the audience
&lt;/h2&gt;

&lt;p&gt;Pulling off a successful demo is not easy, especially when the details of your work matter and the environment is not in your favour. We (&lt;a href="https://airo.ugent.be/projects/socialrobotics/" rel="noopener noreferrer"&gt;the AIRO Social Robotics group!&lt;/a&gt;) found ourselves in this very situation at the &lt;a href="https://fti.vlaanderen/" rel="noopener noreferrer"&gt;FTI&lt;/a&gt; science fair in Antwerp. Very briefly, the mission was to introduce the public to large language models and social robots, so we displayed two &lt;a href="https://furhatrobotics.com/" rel="noopener noreferrer"&gt;Furhat&lt;/a&gt; robots having an enjoyable conversation with each other about a topic chosen by the spectators.&lt;/p&gt;

&lt;p&gt;There are thousands of little nerdy things we wanted to tell people: all the small challenges we had to overcome to create the demo, the things we learned, how the technology works, and so on. But knowing that the event targeted curious (not tech-savvy) people and families, we desisted. We also knew that the format was a demo stand, so we would have to fight with other stands for the crowd's attention. We greatly simplified the setup, down to the bare minimum: two robots, a topic, a conversation created by AI. Simple, easy to explain, and with a nice novelty effect.&lt;/p&gt;

&lt;h2&gt;
  
  
  #2: Think about the setting
&lt;/h2&gt;

&lt;p&gt;Forgetting this can lead to disastrous consequences. Planning to use a microphone in a noisy environment, relying on a projector or a monitor in daylight, expecting perfect wi-fi coverage. These are all very easy-to-make mistakes if you forget to think about the environment.&lt;/p&gt;

&lt;p&gt;In our case, the demo relied heavily on people getting fascinated by the creative ways a large language model can put together a sound debate on a topic of choice. Among the other things, foreseeing a bustling environment, we decided to display the dialogue on a monitor, so people could follow along and enjoy the show. &lt;/p&gt;

&lt;p&gt;We implemented this interface as an easy-to-use web application, detached from the code running the demo. We are planning to use it again in future demos and it's available open-source on &lt;a href="https://github.com/giubots/didisplay/" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;! Since we were unsure whether the demo would be displayed on a monitor or projected onto a wall, we tried to make the text as clear as possible and we included both a light and dark theme, for optimal legibility in any light condition; at the same time, we added some futuristic-looking effects to attract people's attention. Here is how it looks:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fgiubots%2Fdidisplay%2Fmaster%2Fscreenshots%2Fchat.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fgiubots%2Fdidisplay%2Fmaster%2Fscreenshots%2Fchat.gif"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  #3: Do not forget the brand
&lt;/h2&gt;

&lt;p&gt;Everyone has a brand: the company backing your work, your university, an institution, or even just your name. We shouldn't be afraid of putting our &lt;em&gt;signature&lt;/em&gt; on our work. It can give authority to the demo, and contribute to attracting people's attention; overall, it helps to tell the background story of your work, and people love stories! We included our university and lab logos in the top left of the interface. While not fundamental for a good demo, this is something that is easy to forget but can add a professional touch to your work.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bottom line
&lt;/h2&gt;

&lt;p&gt;Setting up a demo for this year's &lt;a href="https://fti.vlaanderen/" rel="noopener noreferrer"&gt;FTI&lt;/a&gt; festival in Antwerp allowed us to reflect on three points:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Tailor your demo to the audience;&lt;/li&gt;
&lt;li&gt;Think in advance about the setting of your demo;&lt;/li&gt;
&lt;li&gt;Don't forget your brand.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For our demo, we wrote a simple stand-alone application to display conversation messages, easy to use and eye-catching. You can check it out on &lt;a href="https://github.com/giubots/didisplay/" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;! Hopefully you found these three simple tips helpful. Good luck with your future demos!&lt;/p&gt;

</description>
      <category>phd</category>
      <category>tips</category>
      <category>demo</category>
    </item>
    <item>
      <title>Implementing Vision-Powered Chit-Chats with Robots: A GPT-4 Adventure 🤖👀</title>
      <dc:creator>Giulio</dc:creator>
      <pubDate>Fri, 17 Nov 2023 18:55:56 +0000</pubDate>
      <link>https://dev.to/giubots/implementing-vision-powered-chit-chats-with-robots-a-gpt-4-adventure-5fhg</link>
      <guid>https://dev.to/giubots/implementing-vision-powered-chit-chats-with-robots-a-gpt-4-adventure-5fhg</guid>
      <description>&lt;p&gt;Imagine a world where your favourite chatbot or social robot isn't just responding to text-based inputs but is also getting a real-time visual sneak peek into the conversation. Exciting, right? Well, we implemented just that with the help of GPT-4, and I'll explain how you can do it too! But first, here's a video showing the final result:&lt;/p&gt;

&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/ihl3zNr2H3E"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Check out our paper &lt;a href="https://doi.org/10.48550/arXiv.2311.08957" rel="noopener noreferrer"&gt;I Was Blind but Now I See: Implementing Vision-Enabled Dialogue in Social Robots&lt;/a&gt; for more details.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In this short adventure, we'll explore how to use large language models and live visual input from a webcam, mix them in an effective prompt, and summarise this to make it run faster and cheaper. We'll be creating a conversational experience that's actually context-aware. Want to dive straight into the code and try it yourself with a webcam or a Furhat robot? &lt;a href="https://github.com/giubots/vision-enabled-dialogue" rel="noopener noreferrer"&gt;Here is the repo&lt;/a&gt;. Ready to start? Let's go!&lt;/p&gt;

&lt;h2&gt;
  
  
  🖼️ GPT-4 and Images
&lt;/h2&gt;

&lt;p&gt;To start, you'll need an &lt;a href="https://openai.com/" rel="noopener noreferrer"&gt;OpenAI account&lt;/a&gt; and to get yourself an API key. I know... I would have liked an open-source alternative too, but we've tried IDEFICS and LLaVA without good results. So GPT-4 it is for now!&lt;/p&gt;

&lt;p&gt;We'll be using Python: run &lt;code&gt;pip install openai opencv-python&lt;/code&gt; to get the libraries we need. These are a few lines of code to get you started with GPT-4 vision.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR-KEY-HERE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4-vision-preview&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h2&gt;
  
  
  📜 The Prompt
&lt;/h2&gt;

&lt;p&gt;The prompt that you want to send to GPT-4 has a somewhat complex structure, but this is what has worked reliably so far. Basically it's an array of messages. As you probably already know, GPT-4 supports different kinds of messages. Here's a quick overview.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;System Message&lt;/strong&gt; instructs the model on how to behave. It has this structure:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;format_system&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;To add &lt;strong&gt;text&lt;/strong&gt; from the user, or &lt;code&gt;base64&lt;/code&gt; &lt;strong&gt;images&lt;/strong&gt; (more on how to load images below), you'll want to use something like this:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;format_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;format_image&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image_url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image_url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data:image/jpeg;base64,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;And finally, when you want to incorporate GPT-4's &lt;strong&gt;responses&lt;/strong&gt; into the prompt, this will be how:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;format_assistant&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;assistant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Now, to put together all the text and images you have different options. It is important to keep the right ordering of the elements, and the easiest way it's going to be a big list. The &lt;a href="https://github.com/giubots/vision-enabled-dialogue" rel="noopener noreferrer"&gt;repo with the code&lt;/a&gt; of this project contains a &lt;code&gt;Conversation&lt;/code&gt; class that does this (and other things too, more on this later).&lt;/p&gt;
&lt;h2&gt;
  
  
  📷 Taking pictures
&lt;/h2&gt;

&lt;p&gt;In this example, during the conversation with the system, we're going to incorporate images in our prompt by taking a picture with the webcam at the beginning of the user's turn. In the &lt;a href="https://github.com/giubots/vision-enabled-dialogue" rel="noopener noreferrer"&gt;repo&lt;/a&gt; you will find how to continuously take snapshots during the conversation at fixed intervals, load a video, or use a Furhat robot as the video source. Here, we will just open the webcam, take a pic, encode it into a string, close the webcam, and return the string.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_image&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;vid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cv2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;VideoCapture&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;frame&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;buffer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cv2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;imencode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.jpg&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;frame&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;string64&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;base64&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;b64encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;vid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;release&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;string64&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h2&gt;
  
  
  🪄 The System Prompt
&lt;/h2&gt;

&lt;p&gt;Awesome! We have all the components ready... except one: the system prompt. We have to tell GPT-4 how to interpret the images that we send, and how to respond. This takes patience and time, many trials and a bit of prompt engineering magic. Let's cut to the chase and have a peek at the prompt that gave us the results we liked the most.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;system&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are impersonating a friendly kid. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;In this conversation, what you see is represented by the images. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;For example, the images will show you the environment you are in and possibly the person you are talking to. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Try to start the conversation by saying something about the person you are talking to if there is one, based on accessories, clothes, etc. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;If there is no person, try to say something about the environment, but do not describe the environment! &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Have a nice conversation and try to be curious! &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;It is important that you keep your answers short and to the point. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DO NOT INCLUDE EMOTICONS OR SMILEYS IN YOUR ANSWERS. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;As you can see, we ask the model to impersonate a &lt;em&gt;friendly kid&lt;/em&gt;, sounds strange but this removes most of those annoying warnings and disclaimers from the output of GPT-4. Then we tell the model that the images are what it sees, and that it would be nice to start the conversation by saying something nice about it. GPT-4 will try hard to describe everything it sees and we don't want that; we also don't want the model to ramble on forever so we tell it not to. Finally, the friendly kid persona that we summoned loves putting emojis in its answers, they're of no use to us so we ask to not include them, in uppercase, just to make it extra clear and loud.&lt;/p&gt;
&lt;h2&gt;
  
  
  🧩 Put Everything Together
&lt;/h2&gt;

&lt;p&gt;Let's glue all of this together, shall we?&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;



&lt;p&gt;Ta-daa! A nice infinite loop and it's done! Save this with a nice name like &lt;code&gt;main.py&lt;/code&gt; and run it with &lt;code&gt;python main.py&lt;/code&gt;. Fingers crossed, and if everything goes well you'll be taking pics from the webcam, and having a nice chat about it. Nice, isn't it? Have fun exploring what happens when you turn off the lights and how GPT-4 answers to the weirdest scenarios. Be sure to follow OpenAI terms of use and keep an eye on the bill, as sending a lot of full-res pictures can be pricey.&lt;/p&gt;

&lt;p&gt;As said, in the &lt;a href="https://github.com/giubots/vision-enabled-dialogue" rel="noopener noreferrer"&gt;repo&lt;/a&gt; you can find a version that continuously captures frames from a webcam, from a video or a Furhat robot.&lt;/p&gt;

&lt;h2&gt;
  
  
  ✂️ Cut the prompt size
&lt;/h2&gt;

&lt;p&gt;You'll quickly notice that your prompt will get too big, with slowed-down computation and increased prices. No good. To solve that, we thought of doing what's done with normal dialogue prompts: ask the LLM to summarise the first part of the conversation!&lt;/p&gt;

&lt;p&gt;But we can't summarise images and dialogue together, &lt;em&gt;a picture is worth a thousand words&lt;/em&gt; and our dialogue will virtually disappear in a sea of image descriptions. Remember when I told you that the &lt;code&gt;Conversation&lt;/code&gt; class in the &lt;a href="https://github.com/giubots/vision-enabled-dialogue" rel="noopener noreferrer"&gt;repo&lt;/a&gt; was doing other things too? Well, when the prompt gets too long, this class asks GPT-4 to summarise some of the images in it. It will scan all the messages list, find the first &lt;em&gt;n&lt;/em&gt; consecutive images, and substitute them with a summary. If you are interested, &lt;a href="https://doi.org/10.48550/arXiv.2311.08957" rel="noopener noreferrer"&gt;this paper&lt;/a&gt; contains more details about it.&lt;/p&gt;

&lt;p&gt;
  Here is the code that we used in the Conversation class.
  &lt;br&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_fr_summary&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Tuple&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Message&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Summarise the frames and return the new messages and the number of frames removed.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="c1"&gt;# fr_buff_size is the max number of images (frames) in the prompt
&lt;/span&gt;    &lt;span class="c1"&gt;# fr_recap is the max number of frames to summarise
&lt;/span&gt;    &lt;span class="c1"&gt;# Assuming number of frames in prompt &amp;gt; fr_buff_size &amp;gt; fr_recap
&lt;/span&gt;
    &lt;span class="c1"&gt;# Find the first frame and the last frame to summarise
&lt;/span&gt;    &lt;span class="n"&gt;first_fr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_messages&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;is_frame&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;first_fr&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;first_fr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;

        &lt;span class="c1"&gt;# Include at most fr_recap frames, and stop if we see a user message
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;first_fr&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;is_user&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;first_fr&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fr_recap&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;

    &lt;span class="c1"&gt;# Split the messages list
&lt;/span&gt;    &lt;span class="n"&gt;before&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_messages&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;first_fr&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;to_summarise&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_messages&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;first_fr&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;after&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_messages&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;:]&lt;/span&gt;

    &lt;span class="c1"&gt;# Generate the summary
&lt;/span&gt;    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="nc"&gt;SystemMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;These are frames from a video. Summarise what&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s happening in the video in one sentence. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The frames are preceded by a context to help you summarise the video. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summarise only the frames, not the context.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The images can be repeating, this is normal, do not point this out in the description.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Respond with only the summary in one sentence. This is very important. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Do not include warnings or other messages.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;gpt_format&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;gpt_format&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;before&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;gpt_format&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;to_summarise&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;summary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Generate the new message list with the summary
&lt;/span&gt;    &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;before&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="nc"&gt;FSummaryMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;after&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;first_fr&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;




&lt;/p&gt;

&lt;h2&gt;
  
  
  🔭 What now?
&lt;/h2&gt;

&lt;p&gt;I hope this journey into combining GPT-4 with real-time visual input has sparked your curiosity! The possibilities are as vast as your imagination. Now armed with the knowledge to integrate large language models and live visual input, you can create a truly interactive and context-aware conversational experience. So, what are you waiting for? Dive into the code, explore the fascinating intersection of language and vision, and let your creativity run wild. The future of chatbots and social robots is not just text-based – it's a dynamic fusion of words and images, and you're at the forefront of it. &lt;strong&gt;We'll continue to work to improve this approach and explore new and exciting ways to make conversational agents better&lt;/strong&gt;. Stay tuned! &lt;/p&gt;

</description>
      <category>python</category>
      <category>tutorial</category>
      <category>openai</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
