Small Model, Big Impact: IBM Granite Vision Dominates Document Understanding

#llm #granite #huggingface #rag

A New Leader Emerges: IBM Granite Vision Excels in Document AI

Introduction

As the landscape of artificial intelligence continues to evolve at a rapid pace, with new breakthroughs constantly pushing the boundaries of what’s possible. In the realm of multimodal AI, a significant contender has recently made its mark: the latest IBM Granite 3.3 2B vision model. This compact yet powerful model recently debuted at number two on the OCRBench leaderboard, making it the most performant multimodal model under 7B parameters. This remarkable achievement highlights its capabilities in document understanding and sets a new benchmark for smaller, more efficient AI models in this critical domain.

TL;DR: What are vision model LLMs?

Vision Model Large Language Models (LLMs), often referred to as Vision-Language Models (VLMs) or Multimodal Large Language Models (MLLMs), represent a significant advancement in artificial intelligence by bridging the gap between computer vision and natural language processing. Unlike traditional LLMs that exclusively process text, VLMs are designed to understand and interact with both visual data (like images and videos) and textual data simultaneously. At their core, these models integrate a vision encoder, which extracts meaningful features and representations from visual inputs (e.g., recognizing objects, textures, and spatial relationships), with a language model, which excels at understanding and generating human-like text. These two components work in conjunction, often through sophisticated alignment and fusion mechanisms, to map visual and textual information into a shared embedding space. This allows VLMs to perform a variety of complex tasks, such as generating descriptive captions for images, answering questions about visual content, and even enabling visual search. By unifying perception and expression, VLMs enable AI systems to interpret and communicate about the world in a more holistic and intuitive manner, much closer to how humans perceive information.

Test

To practically test the capabilities of the granite-vision-3.3-2b model, I've leveraged the provided code to run it locally on my machine. A key enhancement to this local setup is the implementation of a conversational interface. Instead of static queries, this allows for dynamic interaction, where I can pose questions or prompts to the model about a given image in a continuous chat format. This interactive mode offers a more flexible and insightful way to explore the model's understanding of visual content and its ability to respond to natural language queries, simulating a real-world application scenario.

Here we go with the test ⬇️

python3 -m venv venv
source venv/bin/activate

pip install --upgrade pip

And install the requirements ⬇️

pip install 'transformers>=4.49' Pillow torch huggingface_hub

And then the sample application 👨‍💻

# app.py

import os
from PIL import Image
from transformers import AutoProcessor, AutoModelForVision2Seq
from huggingface_hub import hf_hub_download
import torch

def run_vision_inference_with_conversation(model_name: str = "ibm-granite/granite-vision-3.3-2b"):
    _name (str): The name of the Hugging Face model to use.
    device = "cuda" if torch.cuda.is_available() else "cpu"
    print(f"Using device: {device}")

    print(f"Attempting to load model and processor: {model_name}")
    try:

        processor = AutoProcessor.from_pretrained(model_name)
        model = AutoModelForVision2Seq.from_pretrained(model_name).to(device)
        print("Model and processor loaded successfully.")
    except Exception as e:
        print(f"Error loading model or processor: {e}")
        print("Please ensure you have 'transformers' and 'torch' (or 'tensorflow') installed.")
        print("You might also need to log in to Hugging Face with 'huggingface-cli login'.")
        return

    print(f"Downloading example image from Hugging Face Hub for model: {model_name}")
    img_path = None
    try:
        # Download an example image provided with the model from the Hugging Face Hub
        img_path = hf_hub_download(repo_id=model_name, filename='example.png')
        print(f"Example image downloaded to: {img_path}")
    except Exception as e:
        print(f"Error downloading example image: {e}")
        print("Please check your internet connection or the model's repository for 'example.png'.")
        return

    print("\n--- Interactive Chat ---")
    print("Type your questions about the image. Type 'quit' or 'exit' to end the session.")

    # Main loop for interactive chat
    while True:
        user_input = input("You: ").strip()


        if user_input.lower() in ["quit", "exit"]:
            print("Exiting application. Goodbye!")
            break

        conversation = [
            {
                "role": "user",
                "content": [
                    {"type": "image", "url": img_path},
                    {"type": "text", "text": user_input},
                ],
            },
        ]

        print("Applying chat template and preparing inputs...")
        try:

            inputs = processor.apply_chat_template(
                conversation,
                add_generation_prompt=True,
                tokenize=True,
                return_dict=True,
                return_tensors="pt"
            ).to(device)
            print("Inputs prepared successfully.")
        except Exception as e:
            print(f"Error applying chat template or preparing inputs: {e}")
            continue # Continue to next iteration if input preparation fails

        print("Generating response from the model...")
        try:

            output = model.generate(**inputs, max_new_tokens=100)


            generated_text = processor.decode(output[0], skip_special_tokens=True)

            print("\n--- Model Output ---")
            print(generated_text)
            print("--------------------\n")

        except Exception as e:
            print(f"An error occurred during inference: {e}")
            print("This could be due to memory constraints or other issues during generation.\n")


if __name__ == "__main__":

    run_vision_inference_with_conversation()

The sample image will be downloaded with rest of necessary files in the huggingface cache directory.

/Users/xxxxxx/.cache/huggingface/hub/models--ibm-granite--granite-vision-3.3-2b/snapshots/7fe917fdafb006f53aedf9589f148a83ec3cd8eb

Results and outputs of the sample code are provided below;

--- Model Output ---
<|system|>
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
<|user|>

What is the highest scoring model on ChartQA and what is its score?
<|assistant|>
The highest scoring model on ChartQA is Granite-vision-3.3-2b with a score of 0.87.



--- Model Output ---
<|system|>
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
<|user|>

what are the evaluations?
<|assistant|>
We compare the performance of granite-vision-3.3-2b with previous versions of granite-vision models.

Despite the inherently demanding nature of running large language models, especially those with vision capabilities, the granite-vision-3.3-2b model demonstrates commendable performance when executed locally. Even on a CPU-only laptop, the inference process completes within a reasonable timeframe. This efficiency is particularly noteworthy given that such models often benefit greatly from GPU acceleration, underscoring the model's optimization for more accessible hardware environments.

Conclusion

In summary, the IBM Granite 3.3 2B vision model represents a significant leap forward in multimodal AI, notably securing a top position on the OCRBench leaderboard as the most performant model under 7B parameters. This achievement underscores the growing power of Vision Model LLMs, which seamlessly integrate visual and textual understanding to process complex information. Our practical exploration demonstrated this model’s capabilities in a local, interactive chat environment, allowing for dynamic engagement with visual content. Critically, the granite-vision-3.3-2b model exhibited commendable and reasonable execution times even on a CPU-only laptop, highlighting its efficiency and potential for broader accessibility beyond high-end, GPU-accelerated systems. This combination of strong performance and efficient local execution positions IBM Granite Vision as a compelling solution for various document understanding and multimodal AI applications.

DEV Community

Small Model, Big Impact: IBM Granite Vision Dominates Document Understanding

Introduction

TL;DR: What are vision model LLMs?

Test

Conclusion

Links

Top comments (0)