Aryan Kargwal

Posted on Oct 14

Stress Testing VLMs: Multi QnA and Description Tasks

#tutorial #streamlit #vlm #benchmarking

Video Link: https://youtu.be/pwW9zwVQ4L8
Repository Link: https://github.com/aryankargwal/genai-tutorials/tree/main

In the fast-evolving world of AI, Vision-Language Models (VLMs) have garnered attention for their ability to understand and generate responses based on visual and textual inputs. However, testing these models in a structured environment and comparing their performance across various scenarios is still a challenging task. This blog will walk you through an experiment where we used a custom-built Streamlit web application to stress test multiple VLMs like Llama 3.2, Qwen 2 VL, and GPT 4o on a range of tasks. We analyzed their response tokens, latency, and accuracy in generating answers to complex, multimodal questions.

However, please note that most of the findings are still hidden, as this application is part of my process of making a VLM benchmark, the first of which you can check out on Huggingface as SynCap-Flickr8K!

Why Compare Vision-Language Models?

The ability to compare the performance of different VLMs across domains is critical for:

Understanding model efficiency (tokens used, latency).
Measuring how well models can generate coherent responses based on image inputs and textual prompts.
Creating benchmark datasets to improve further and fine-tune VLMs.

To achieve this, we built a VLM Stress Testing Web App in Python, utilizing Streamlit for a user-friendly interface. This allowed us to upload images, input textual prompts, and obtain model-generated responses in real time. The app also calculated and logged critical metrics such as the number of tokens used in responses and latency.

Project Setup

Our main application file, app.py, uses Streamlit as the frontend and is integrated with API requests to call different VLM models. Each query to a model includes:

Image: Encoded in Base64 format.
Question: A text input by the user.
Model ID: We allow users to choose between multiple VLMs.

The API response includes:

Answer: The model-generated text.
Latency: Time taken for the model to generate the answer.
Token Count: Number of tokens used by the model in generating the response.

Below is the code structure for querying the models:

def query_model(base64_image, question, model_id, max_tokens=300, temperature=0.9, stream=False, frequency_penalty=0.2):
    image_content = {
        "type": "image_url",
        "image_url": {
            "url": f"data:image/jpeg;base64,{base64_image}"
        }
    }

    prompt = question

    data = {
        "model": model_id,
        "messages": [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                    image_content
                ]
            }
        ],
        "max_tokens": max_tokens,
        "temperature": temperature,
        "stream": stream,
        "frequency_penalty": frequency_penalty
    }

    response = requests.post(url, headers=headers, json=data)
    return response.json()

Task Definitions and Experiments

We tested four different tasks across multiple domains using the following models:

Llama 3.2
Qwen 2 VL
GPT 4o

Domains:

Medical: Questions related to complex medical scenarios.
Retail: Product-related queries.
CCTV: Surveillance footage analysis.
Art: Generating artistic interpretations and descriptions.

The experiment involved five queries per task for each model, and we recorded the following metrics:

Tokens: The number of tokens used by the model to generate a response.
Latency: Time taken to return the response.

Results

Token Usage Comparison

The tables below highlight the token usage across the four domains for both Llama and GPT models.

Task	Q1 Tokens	Q2 Tokens	Q3 Tokens	Q4 Tokens	Q5 Tokens	Mean Tokens	Standard Deviation (Tokens)
Medical (Llama)	1	12	1	1	1	3.2	4.81
Retail (Llama)	18	39	83	40	124	60.8	32.77
CCTV (Llama)	18	81	83	40	124	69.2	37.29
Art (Llama)	11	71	88	154	40	72.2	51.21

Task	Q1 Tokens	Q2 Tokens	Q3 Tokens	Q4 Tokens	Q5 Tokens	Mean Tokens	Standard Deviation (Tokens)
Medical (GPT)	1	10	1	1	1	2.4	4.04
Retail (GPT)	7	13	26	14	29	17.8	8.53
CCTV (GPT)	7	8	26	14	29	16.8	7.69
Art (GPT)	10	13	102	43	35	40.6	35.73

Latency Comparison

Latency, measured in seconds, is another critical factor in evaluating the model's performance, especially for real-time applications. The following tables display latency results for the same set of tasks.

Task	Q1 Latency	Q2 Latency	Q3 Latency	Q4 Latency	Q5 Latency	Mean Latency	Standard Deviation (Latency)
Medical (Llama)	0.74	0.97	0.78	0.98	1.19	0.73	0.19
Retail (Llama)	1.63	3.00	3.02	1.67	3.14	2.09	0.74
CCTV (Llama)	1.63	3.00	3.02	1.67	3.14	2.09	0.74
Art (Llama)	1.35	2.46	2.91	4.45	2.09	2.46	1.06

Task	Q1 Latency	Q2 Latency	Q3 Latency	Q4 Latency	Q5 Latency	Mean Latency	Standard Deviation (Latency)
Medical (GPT)	1.35	1.50	1.21	1.50	1.23	1.38	0.10
Retail (GPT)	1.24	1.77	2.12	1.35	1.83	1.63	0.29
CCTV (GPT)	1.20	2.12	1.80	1.35	1.83	1.68	0.32
Art (GPT)	1.24	1.77	7.69	3.94	2.41	3.61	2.29

Observations

Token Efficiency: Llama models generally use fewer tokens in response generation for simpler tasks like Medical compared to more complex domains like Art.
Latency: Latency is higher for more complex images, especially for tasks like Retail and Art, indicating that these models take more time when generating in-depth descriptions or analyzing images.
GPT vs. Llama: GPT models generally had lower token counts across the tasks, but the latency was comparable, with GPT showing slightly more variability in complex tasks like Art.

Conclusion and Future Work

This experiment highlights the importance of evaluating both token efficiency and latency when stress testing VLMs. The VLM Stress Test App allows us to quickly compare multiple models and analyze their performance across a variety of real-world tasks.

Future Plans:

Additional Models: We plan to add more models like Mistral and Claude to the comparison.
Expanded Dataset: New tasks in

domains like Legal and Education will be added to challenge the models further.

Accuracy Metrics: We'll also integrate accuracy metrics like BLEU and ROUGE scores in the next iteration.

Check out our GitHub repository for the code and further instructions on how to set up and run your own VLM experiments.

DEV Community

Stress Testing VLMs: Multi QnA and Description Tasks

Why Compare Vision-Language Models?

Project Setup

Task Definitions and Experiments

Domains:

Results

Token Usage Comparison

Latency Comparison

Observations

Conclusion and Future Work

Top comments (0)

Read next

Building your Portfolio in 2025

10 Game-Changing CSS🔥Practices That Will Level Up Your Code

How to Use KitOps with MLflow

How to Get Current Full URL in Laravel 11