DEV Community

Shanelle
Shanelle

Posted on

Testing GPT-4 Vision with Lightrail

Intro

With the recent Developer Day, OpenAI has announced several exciting updates to ChatGPT and its sister products. In addition to faster APIs and the GPT App Store, they also updated the GPT-4 model so that it can take images as input. Before, GPT-4 was a purely text model, whereas now it can take text and images as input and output text in response. As you can imagine, this opens up a variety of exciting applications for its users.

Today, I wanted to walk through some interesting applications of GPT-4 Vision with Lightrail (an open-source Desktop app that I’ve been contributing to). It’s an open-source AI command center for developers. It has an always-on ChatGPT instance and integrates with apps like Jupyter, VScode, & Chrome. I’ve personally found it an easier way to access an LLM and automatically give it the necessary context in my queries without constantly copying & pasting into ChatGPT.

Generate Code from Screenshots

With the plethora of coding assistants on the market today, you can use LLMs to generate code based on text instructions. For example, you can instruct Github Copilot to create a SignUp Button in React or bootstrap a new microservice for an enterprise-scale codebase. However, if you want to generate a clone of a screenshot or replicate the color & component styling of one of your favorite websites, it was previously difficult to give ChatGPT or the other coding assistants the necessary context.

Now, with GPT-4 Vision and Lightrail, you can take a screenshot and describe what you would like to build or how you would like to modify the screenshot, and Lightrail can generate the relevant code right in your VSCode editor. The video below shows how a freelance frontend engineer might use Lightrail to tweak the styling of a React component based on a client request from a screenshot.

Ask Q’s about an Image

Beyond using Lightrail’s vision feature to generate code, you can feed in images and ask questions. Think of it as a more powerful Google Lens that can be accessed cross-application. Some of my favorite applications include analyzing charts & graphs, interpreting content in a PDF, or providing feedback on presentations / UX mockups.

What’s next?

With the latest OpenAI release, it feels like every day we’re getting closer to a future of persistent AI assistants that can ingest and interpret content in a similar manner to the way we do. Imagine having a helpful voice on your shoulder as you navigate the digital world. With Lightrail, you can already access content from across your apps and save long-form text content to a local vectorDB to create a long-term memory for your LLM.

However, I’d be interested to see how technology continues to develop for multi-modal AI. I’d love to be able to access a longterm memory where I could search semantically across images, text, & video — all with equal accuracy to textual content. Similarly, right now, GPT-4 can use images as input, but it would be amazing if it could output both images & text.

Anyways, thanks so much for reading. If you’re interested in testing out Lightrail, you can download it here!

Top comments (0)