gpt-4-vision-preview is the latest and (arguably) the most powerful model released on November 7 2023 during OpenAI’s DevDay presentation and it has been the talk of social media merely hours after it became available.
Developers have already created apps that actively recognize what’s happening during a web live stream in real-time.
Or this person that incorporated just about every OpenAI API to analyze Messi highlights video using gpt-4-vision-preview model, create voiceover script based on the video frames then generate audio using OpenAI Text-to-Speech
Using gpt-4-vision-preview Model With Python
First install the openai pip module
pip install --upgrade openai #to ensure we are using the latest version
Next, we initialize OpenAI object with our API key and create an instance.
How to obtain your OpenAI API key?
- Go to API keys – OpenAI API
- Click “Create new secret key”
- Provide a name and be sure to copy your new key.
client = OpenAI(api_key="sk_YOUR_OPENAI_KEY")
Making your first call with gpt-4-vision-preview Model
import os
from openai import OpenAI
from dotenv import load_dotenv
import base64
import mimetypes
load_dotenv()
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
def image_to_base64(image_path):
# Guess the MIME type of the image
mime_type, _ = mimetypes.guess_type(image_path)
if not mime_type or not mime_type.startswith('image'):
raise ValueError("The file type is not recognized as an image")
# Read the image binary data
with open(image_path, 'rb') as image_file:
encoded_string = base64.b64encode(image_file.read()).decode('utf-8')
# Format the result with the appropriate prefix
image_base64 = f"data:{mime_type};base64,{encoded_string}"
return image_base64
base64_string = image_to_base64("image1.jpg")
response = client.chat.completions.create(
model="gpt-4-vision-preview",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Describe the attached image"},
{
"type": "image_url",
"image_url": {
"url": base64_string,
"detail": "low"
}
},
],
}
],
max_tokens=300,
)
print(response.choices[0].message.content)
Stepping Through the Code
The image_to_base64
function is defined to convert an image file to a base64-encoded string. It first checks if the file is an image and then reads and encodes the binary data of the image. Use this option to submit local images.
def image_to_base64(image_path):
...
You may also submit image URLs using the below example:
response = client.chat.completions.create(
model="gpt-4-vision-preview",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "What’s in these images? Is there any difference between them?",
},
{
"type": "image_url",
"image_url": "https://www.goodfreephotos.com/albums/other-landscapes/rover-and-landscape-scenery.jpg",
}
],
}
],
max_tokens=300,
)
The Completion parameters
You may be familiar with the message object If you have used the OpenAI API in the past. With the release of gpt-4-vision-preview model, OpenAI has introduced a new message type; “image_url”.
The url parameter accepts base 64 encoded images or image URLs.
Adjusting the ‘detail‘ parameter in the GPT-4 Vision API affects the resolution and token budget used to interpret images. Setting ‘detail’ to ‘low’ restricts the model to a 512 x 512 pixel, low-resolution version of the image, using 65 tokens. This mode is quicker and more token-efficient, suitable for scenarios where detailed analysis isn’t critical. Opting for ‘high’ engages the high-resolution mode, which initially presents the model with the low-resolution image, followed by high-resolution 512px segments of the image, with each segment allocated a 130-token budget for a thorough examination.
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Describe the attached image"},
{
"type": "image_url",
"image_url": {
"url": base64_string,
"detail": "low"
}
},
],
}
]
GPT-4 Vision API Pricing
As shown below, OpenAI prices the vision capability based on image resolution. For example a 512×512 image will cost $0.00255 in “high” detail more, or 0.00085 for “low” model.
This may not seem like a lot but cost does add up when processing hundreds or thousands of images. To control your input image sizes, use the below code to resize them:
Ensure you have the Pillow library installed
pip install Pillow
from PIL import Image
import os
def resize_image(image_path, new_width, new_height):
with Image.open(image_path) as img:
# Resize the image
img = img.resize((new_width, new_height), Image.ANTIALIAS)
# Save the resized image
base, ext = os.path.splitext(image_path)
new_image_path = f"{base}_resized{ext}"
img.save(new_image_path)
print(f"Image saved as {new_image_path}")
Real World Use of GPT-4 Vision API: Enhancing Web Experience with a Chrome Extension
Ok so GPT-4 Vision API is cool and all – people have used it to seamlessly create soccer highlight commentary and interact with Webcams but let’s put the gpt-4-vision-preview to the test and see how it fairs with real world problems.
Browser Extension – GPT Vision Assistant
The Chrome extension is designed to harness the GPT-4 Vision API works in a streamlined three-step process:
First, it captures a screenshot of the current tab.
Second, the user is prompted to input a specific question or instruction, articulating what they seek to learn or accomplish with regard to the captured image.
Finally, this input, along with the screenshot, is sent as a request to the OpenAI API. The gpt-4-vision-preview models returns its findings or outputs directly within the extension interface, displaying the results for the user to review. This seamless integration offers a powerful tool for real-time image analysis and interaction, all without leaving the browser tab.
Other Use Cases With GPT-4 Vision API
The GPT-4 Vision API’s capabilities extend beyond simple image recognition and analysis; they open up a world of possibilities for enhancing and streamlining our interactions with digital content. For this purpose, we created a Chrome Extension that leverages this advanced API stands as a testament to its practical utility in everyday web usage. Here are other use cases:
- Quality Assurance for Websites: Website developers and QA testers can use the extension to take screenshots of web pages and submit them directly to the GPT-4 Vision API with prompts like “Identify any visual inconsistencies across these screenshots” or “Does this layout comply with web accessibility standards?” The API can analyze the images for color contrast, element alignment, responsive design issues, and more, providing quick feedback that would traditionally require meticulous manual review.
- UX/UI Design Feedback: Designers can capture snapshots of their work and ask the model questions such as “What improvements can be made to this user interface to enhance user experience?” or “What are the best practices missing from this design?” This not only speeds up the iterative process of design but also injects an objective, data-driven perspective into creative workflows.
- Content Management and Moderation: For content managers and online moderators, the extension could be used to screen website content. By taking screenshots of various posts, images, or comments and querying the GPT-4 Vision API, the system could assist in identifying inappropriate content or copyright issues, streamlining the moderation process.
- Educational Content Interaction: Students and educators could use the extension to capture diagrams, equations, or charts from educational websites and ask the model to explain or solve them. This interactive approach could enhance online learning by providing instant assistance and explanations for complex visual information.
- Competitor Analysis: Marketing professionals might employ the extension to capture the layout of competitors’ websites, asking the model to analyze and compare branding consistency, messaging clarity, and call-to-action placements. This competitive intelligence can be invaluable for strategic planning.
- Accessibility Auditing: The extension could also be used to assist with ensuring web accessibility. Users could take snapshots of websites and ask the model to check if the images contain proper alt text or if the color schemes used are suitable for color-blind individuals.
- Automated Documentation: IT professionals and developers could use the extension to document the behavior of web applications. By taking screenshots and prompting the API to describe the process flow or detect anomalies, they could generate documentation or troubleshooting guides more efficiently.
- Code Generation from Designs: A revolutionary use case would be for front-end developers to send UX designs to the OpenAI API and ask it to generate the necessary HTML/CSS/JavaScript code. This could potentially reduce development time by providing a starting point for building out web interfaces.
If interested in the extension or other custom solutions using OpenAI, please email info@nextideated.com
Top comments (1)
Nice little getting started guide on the chatgpt vision api. Thanks for using dotenv. 🌴💛