DEV Community

swift
swift

Posted on

I Benchmarked 9 Multimodal AI APIs So You Don't Have To

I Benchmarked 9 Multimodal AI APIs So You Don't Have To

Last month I needed to pick a vision model for a document-processing pipeline. Simple ask, right? Wrong. The more I dug, the more I realized the "multimodal" label gets slapped on everything from "literally just OCR" to "I can hear a guitar solo and tell you it's in D minor." So I did what any sensible backend engineer would do: I spun up a test harness, queued up 9 models, and started throwing images and audio at them like a QA engineer with a grudge.

This is the writeup I wish I'd had before I started. All prices and benchmarks are from my own runs via Global API, which — fwiw — has become my default playground for this kind of thing because it doesn't lock you into one provider's quirks.

The Contenders

Before we get into the carnage, here's the roster. I focused on models exposed through Global API's unified endpoint because a) I don't want to manage 9 different API keys and 9 different auth flows, and b) the pricing shown is what you'd actually pay, not the "contact us for enterprise pricing" nonsense.

Model Provider Modalities Output $/M Context
Qwen3-VL-32B Qwen Image + Text $0.52 32K
Qwen3-VL-30B-A3B Qwen Image + Text $0.52 32K
Qwen3-VL-8B Qwen Image + Text $0.50 32K
Qwen3-Omni-30B Qwen Image + Audio + Video + Text $0.52 32K
GLM-4.6V Zhipu Image + Text $0.80 32K
GLM-4.5V Zhipu Image + Text $0.01 32K
Hunyuan-Vision Tencent Image + Text $1.20 32K
Hunyuan-Turbo-Vision Tencent Image + Text $1.20 32K
Doubao-Seed-2.0-Pro ByteDance Image + Text $3.00 128K

A few things stand out before we even run anything. First, GLM-4.5V at $0.01/M is suspiciously cheap. I'm not saying it's bad, but that's "is this a typo?" cheap. Second, the Qwen3-VL family clusters tightly around $0.50/M, which makes picking between them a quality question, not a budget question. Third, Doubao-Seed-2.0-Pro at $3.00/M is the expensive date — and it's the only one with a 128K context window, so there's a reason.

My Test Harness (Yes, It's Ugly)

Under the hood, the whole test rig is a Python loop with the OpenAI client pointed at Global API's base URL. This is, imo, the biggest reason to use a unified gateway — one client, one schema, and the multimodal content blocks work the same way for every provider:

from openai import OpenAI

client = OpenAI(
    api_key="GLOBAL_API_KEY",
    base_url="https://global-apis.com/v1"
)

MODELS = [
    "Qwen/Qwen3-VL-32B-Instruct",
    "Qwen/Qwen3-VL-30B-A3B-Instruct",
    "Qwen/Qwen3-VL-8B-Instruct",
    "Qwen/Qwen3-Omni-30B-A3B-Instruct",
    "THUDM/GLM-4.6V",
    "THUDM/GLM-4.5V",
    "Tencent/HunyuanVision",
    "Tencent/HunyuanTurboVision",
    "Doubao/Doubao-Seed-2.0-Pro",
]

def test_object_recognition(image_url: str, model: str) -> str:
    resp = client.chat.completions.create(
        model=model,
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe everything you see in this image."},
                {"type": "image_url", "image_url": {"url": image_url}},
            ],
        }],
        max_tokens=1024,
    )
    return resp.choices[0].message.content
Enter fullscreen mode Exit fullscreen mode

For the audio tests, I swapped the image block for an audio block, and only one model didn't yell at me about it. Spoiler: it's the Omni one.

Round 1: Object Recognition

I used a chaotic street scene — signs in three languages, a cat doing something unhinged, a parked scooter, the works. Prompt: "Describe everything you see in this image."

Qwen3-VL-32B came back with fifteen-plus distinct objects, including the brand on the scooter and the partially obscured sign in the back. It even caught the text on a shop window I'd personally squinted at for thirty seconds. Five stars, no notes.

GLM-4.6V was almost as good and noticeably better on Asian context — it correctly identified a regional noodle shop sign that the Qwen model just called "Chinese characters." Trade-offs are real.

Qwen3-Omni-30B performed very well, just slightly less detailed than its non-omni sibling. I assume some of its capacity is reserved for the audio/video branches, which is a fair architectural choice.

Hunyuan-Vision dropped the ball on small details — missed the cat entirely, misread the shop signage. GLM-4.5V was, for $0.01/M, surprisingly usable. Not great, but "acceptable" is the right word. You'd ship it to production for a low-stakes use case. You would not ship it for anything medical.

Model Accuracy Detail Level Notes
Qwen3-VL-32B ⭐⭐⭐⭐⭐ Excellent 15+ objects, brands, text
GLM-4.6V ⭐⭐⭐⭐ Very good Strong on Asian context
Qwen3-Omni-30B ⭐⭐⭐⭐ Very good Slightly less detail than VL
Hunyuan-Vision ⭐⭐⭐ Good Missed small details
GLM-4.5V ⭐⭐⭐ Adequate Budget option, acceptable

Round 2: OCR

OCR is where vision models either prove themselves or get exposed. I threw a multi-language document at them — mixed English, Simplified Chinese, and a Japanese subtitle. Nothing fancy, just a real-world mess.

Qwen3-VL-32B was a five-star wrecking ball across the board. English, Chinese, mixed — all clean. GLM-4.6V actually edged it out on pure Chinese extraction, which tracks given Zhipu's data lineage. Qwen3-Omni-30B was right there with the 32B. Hunyuan-Vision was fine on Chinese but sloppy on English, which is an interesting data point if you care about that.

Model English OCR Chinese OCR Mixed
Qwen3-VL-32B ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐⭐
GLM-4.6V ⭐⭐⭐⭐ ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐⭐
Qwen3-Omni-30B ⭐⭐⭐⭐ ⭐⭐⭐⭐ ⭐⭐⭐⭐
Hunyuan-Vision ⭐⭐⭐ ⭐⭐⭐⭐ ⭐⭐⭐

If your pipeline is Chinese-first, GLM-4.6V deserves a serious look. If it's mixed, the Qwen3-VL-32B is the safer default.

Round 3: Charts and Diagrams

Bar chart, stacked area, and a flow diagram. Prompt: "Analyze this bar chart and summarize the key trends."

Qwen3-VL-32B: perfect data extraction, excellent trend analysis, clean formatting. It gave me bullet points, identified the inflection point, and called out the outlier series by name. This is the kind of output you'd paste directly into a Slack message to your PM.

GLM-4.6V was excellent on extraction and very good on the analysis. The formatting was a hair less polished but nothing a re-prompt couldn't fix.

Qwen3-Omni-30B was very good across all three, with formatting that was honestly indistinguishable from the 32B. If you're already paying for Omni, the chart performance is a bonus.

Model Data Extraction Trend Analysis Formatting
Qwen3-VL-32B Perfect Excellent Clean
GLM-4.6V Excellent Very good Good
Qwen3-Omni-30B Very good Very good Clean

Round 4: Code Screenshot → Code

This is the test I ran for myself, because I've been meaning to automate "recreate this Stack Overflow answer from a screenshot." The test image was a Python function with weird indentation and a Unicode arrow.

Qwen3-VL-32B nailed it — 95% accurate, handled the indentation properly, preserved the Unicode arrow. GLM-4.6V came in at 90% with some minor formatting cleanup needed. Qwen3-Omni-30B hit 92% with a noticeable latency bump — nothing deal-breaking, but if you're doing real-time processing, the 32B is snappier.

Model Accuracy Edge Cases
Qwen3-VL-32B 95% Handled indentation, special chars
GLM-4.6V 90% Minor formatting issues
Qwen3-Omni-30B 92% Good, slight delay

The Audio Question (And Why Omni Is Worth the Hype)

Here's the thing nobody tells you: of these 9 models, exactly one accepts audio input. Qwen3-Omni-30B. Everyone else just looks at you like you asked them to smell the file.

I tested it on a multilingual podcast clip, a phone call recording, and a 30-second guitar riff. Results:

Task Result
Speech-to-text transcription Excellent across multiple languages
Audio Q&A Good — answered "what's being said in this recording?" correctly
Emotion detection Works — picked up the sarcastic tone in a voicemail
Music description Basic — described the guitar riff as "plucked string instrument, mid-tempo"

The "basic" rating on music description is honest, not dismissive. The model isn't a music analyst, and it doesn't pretend to be. For speech, it's genuinely strong.

Here's the audio request pattern, in case you're building something with it:

def transcribe_audio(audio_url: str) -> str:
    resp = client.chat.completions.create(
        model="Qwen/Qwen3-Omni-30B-A3B-Instruct",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": "Transcribe this audio."},
                {"type": "audio_url", "audio_url": {"url": audio_url}},
            ],
        }],
    )
    return resp.choices[0].message.content
Enter fullscreen mode Exit fullscreen mode

Same client, same base URL, different modality block. This is the dream, honestly. RFC 7231 would be proud.

Show Me The Money (Pricing Breakdown)

I keep a spreadsheet for this stuff because at scale, half a dollar per million tokens turns into real money. Here's the same models ranked by what you'd actually pay if you processed 1,000 images per run, 10,000 images per month:

Model $/M Output 1,000 Image Analyses Monthly (10K imgs)
GLM-4.5V $0.01 ~$0.05 $0.50
Qwen3-VL-8B $0.50 ~$2.50 $25
Qwen3-VL-32B $0.52 ~$2.60 $26
Qwen3-Omni-30B $0.52 ~$2.60 (+ audio) $26
GLM-4.6V $0.80 ~$4.00 $40
Hunyuan-Vision $1.20 ~$6.00 $60
Doubao-Seed-2.0-Pro $3.00 ~$15.00 $150

A few things worth highlighting. The jump from GLM-4.5V at $0.50/month to Doubao at $150/month is 300x. That's not a pricing tier, that's a different product category. The Qwen3 cluster (8B, 32B, Omni) is effectively priced identically, which means the choice is purely about quality and features. GLM-4.6V is the premium Zhipu option and you pay roughly 50% more for it. Hunyuan and Doubao are the "big context, big invoice" options.

What I'd Actually Ship

After all this, here's my decision tree for a typical backend use case:

  • Bulk document OCR on a budget → GLM-4.5V. The 0.01 price is real and the quality is "good enough for non-critical paths."
  • General-purpose image understanding → Qwen3-VL-32B. Best all-rounder, fair price, handles edge cases.
  • Audio + image + video pipeline → Qwen3-Omni-30B. Only real choice

Top comments (0)