DEV Community

Cover image for Unlock the Magic of Images: A Quick and Easy Guide to Using the Cutting-Edge SmolVLM-500M Model
Alexander Uspenskiy
Alexander Uspenskiy

Posted on

1

Unlock the Magic of Images: A Quick and Easy Guide to Using the Cutting-Edge SmolVLM-500M Model

The model SmolVLM-500M-Instruct is a state-of-the-art, compact model with 500 million parameters. Despite its relatively small size, its capabilities are remarkably impressive.

Let's jump to the code:

import torch
from transformers import AutoProcessor, AutoModelForVision2Seq
from PIL import Image
import warnings

warnings.filterwarnings("ignore", message="Some kwargs in processor config are unused")

def upload_and_describe_image(image_path):
    processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM-500M-Instruct")
    model = AutoModelForVision2Seq.from_pretrained("HuggingFaceTB/SmolVLM-500M-Instruct")

    image = Image.open(image_path)

    prompt = "Describe the content of this <image> in detail, give only answers in a form of text"
    inputs = processor(text=[prompt], images=[image], return_tensors="pt")

    with torch.no_grad():
        outputs = model.generate(
            pixel_values=inputs["pixel_values"],
            input_ids=inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
            max_new_tokens=150,
            do_sample=True,
            temperature=0.7
        )

    description = processor.batch_decode(outputs, skip_special_tokens=True)[0]
    return description.strip()

if __name__ == "__main__":
    image_path = "images/bender.jpg"

    try:
        description = upload_and_describe_image(image_path)
        print("Image Description:", description)
    except Exception as e:
        print(f"An error occurred: {e}")
Enter fullscreen mode Exit fullscreen mode

This Python script uses the Hugging Face Transformers library to generate a textual description of an image. It loads a pre-trained vision-to-sequence model and processor, processes an input image, and generates a descriptive text based on the image content. The script handles exceptions and prints the generated description.

You can download it here: https://github.com/alexander-uspenskiy/vlm

Based on this original non-stock image (put it to the image directory of the project):

Image description

Take a look at the description generated by the model (you can play with the prompt and parameters in the code to format the output better for any propose): The robot is sitting on a couch. It has eyes and mouth. He is reading something. He is holding a book with his hands. He is looking at the book. In the background, there are books in a shelf. Behind the books, there is a wall and a door. At the bottom of the image, there is a chair. The chair is white. The chair has a cushion on it. In the background, the wall is brown. The floor is grey. in the image, the robot is silver and cream color. The book is brown. The book is open. The robot is holding the book with both hands. The robot is looking at the book. The robot is sitting on the couch.

It looks excellent, and the model is both fast and resource-efficient compared to LLMs.

Happy coding!

API Trace View

Struggling with slow API calls? πŸ•’

Dan Mindru walks through how he used Sentry's new Trace View feature to shave off 22.3 seconds from an API call.

Get a practical walkthrough of how to identify bottlenecks, split tasks into multiple parallel tasks, identify slow AI model calls, and more.

Read more β†’

Top comments (0)

The Most Contextual AI Development Assistant

Pieces.app image

Our centralized storage agent works on-device, unifying various developer tools to proactively capture and enrich useful materials, streamline collaboration, and solve complex problems through a contextual understanding of your unique workflow.

πŸ‘₯ Ideal for solo developers, teams, and cross-company projects

Learn more

πŸ‘‹ Kindness is contagious

Explore a sea of insights with this enlightening post, highly esteemed within the nurturing DEV Community. Coders of all stripes are invited to participate and contribute to our shared knowledge.

Expressing gratitude with a simple "thank you" can make a big impact. Leave your thanks in the comments!

On DEV, exchanging ideas smooths our way and strengthens our community bonds. Found this useful? A quick note of thanks to the author can mean a lot.

Okay