swift

Posted on Jun 21

I Benchmarked 9 Multimodal AI APIs So You Don't Have To

#programming #python #machinelearning #deepseek

Last month I needed to pick a vision model for a document-processing pipeline. Simple ask, right? Wrong. The more I dug, the more I realized the "multimodal" label gets slapped on everything from "literally just OCR" to "I can hear a guitar solo and tell you it's in D minor." So I did what any sensible backend engineer would do: I spun up a test harness, queued up 9 models, and started throwing images and audio at them like a QA engineer with a grudge.

This is the writeup I wish I'd had before I started. All prices and benchmarks are from my own runs via Global API, which — fwiw — has become my default playground for this kind of thing because it doesn't lock you into one provider's quirks.

The Contenders

Before we get into the carnage, here's the roster. I focused on models exposed through Global API's unified endpoint because a) I don't want to manage 9 different API keys and 9 different auth flows, and b) the pricing shown is what you'd actually pay, not the "contact us for enterprise pricing" nonsense.

Model	Provider	Modalities	Output $/M	Context
Qwen3-VL-32B	Qwen	Image + Text	$0.52	32K
Qwen3-VL-30B-A3B	Qwen	Image + Text	$0.52	32K
Qwen3-VL-8B	Qwen	Image + Text	$0.50	32K
Qwen3-Omni-30B	Qwen	Image + Audio + Video + Text	$0.52	32K
GLM-4.6V	Zhipu	Image + Text	$0.80	32K
GLM-4.5V	Zhipu	Image + Text	$0.01	32K
Hunyuan-Vision	Tencent	Image + Text	$1.20	32K
Hunyuan-Turbo-Vision	Tencent	Image + Text	$1.20	32K
Doubao-Seed-2.0-Pro	ByteDance	Image + Text	$3.00	128K

A few things stand out before we even run anything. First, GLM-4.5V at $0.01/M is suspiciously cheap. I'm not saying it's bad, but that's "is this a typo?" cheap. Second, the Qwen3-VL family clusters tightly around $0.50/M, which makes picking between them a quality question, not a budget question. Third, Doubao-Seed-2.0-Pro at $3.00/M is the expensive date — and it's the only one with a 128K context window, so there's a reason.

My Test Harness (Yes, It's Ugly)

Under the hood, the whole test rig is a Python loop with the OpenAI client pointed at Global API's base URL. This is, imo, the biggest reason to use a unified gateway — one client, one schema, and the multimodal content blocks work the same way for every provider:

from openai import OpenAI

client = OpenAI(
    api_key="GLOBAL_API_KEY",
    base_url="https://global-apis.com/v1"
)

MODELS = [
    "Qwen/Qwen3-VL-32B-Instruct",
    "Qwen/Qwen3-VL-30B-A3B-Instruct",
    "Qwen/Qwen3-VL-8B-Instruct",
    "Qwen/Qwen3-Omni-30B-A3B-Instruct",
    "THUDM/GLM-4.6V",
    "THUDM/GLM-4.5V",
    "Tencent/HunyuanVision",
    "Tencent/HunyuanTurboVision",
    "Doubao/Doubao-Seed-2.0-Pro",
]

def test_object_recognition(image_url: str, model: str) -> str:
    resp = client.chat.completions.create(
        model=model,
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe everything you see in this image."},
                {"type": "image_url", "image_url": {"url": image_url}},
            ],
        }],
        max_tokens=1024,
    )
    return resp.choices[0].message.content

For the audio tests, I swapped the image block for an audio block, and only one model didn't yell at me about it. Spoiler: it's the Omni one.

Round 1: Object Recognition

I used a chaotic street scene — signs in three languages, a cat doing something unhinged, a parked scooter, the works. Prompt: "Describe everything you see in this image."

Qwen3-VL-32B came back with fifteen-plus distinct objects, including the brand on the scooter and the partially obscured sign in the back. It even caught the text on a shop window I'd personally squinted at for thirty seconds. Five stars, no notes.

GLM-4.6V was almost as good and noticeably better on Asian context — it correctly identified a regional noodle shop sign that the Qwen model just called "Chinese characters." Trade-offs are real.

Qwen3-Omni-30B performed very well, just slightly less detailed than its non-omni sibling. I assume some of its capacity is reserved for the audio/video branches, which is a fair architectural choice.

Hunyuan-Vision dropped the ball on small details — missed the cat entirely, misread the shop signage. GLM-4.5V was, for $0.01/M, surprisingly usable. Not great, but "acceptable" is the right word. You'd ship it to production for a low-stakes use case. You would not ship it for anything medical.

Model	Accuracy	Detail Level	Notes
Qwen3-VL-32B	⭐⭐⭐⭐⭐	Excellent	15+ objects, brands, text
GLM-4.6V	⭐⭐⭐⭐	Very good	Strong on Asian context
Qwen3-Omni-30B	⭐⭐⭐⭐	Very good	Slightly less detail than VL
Hunyuan-Vision	⭐⭐⭐	Good	Missed small details
GLM-4.5V	⭐⭐⭐	Adequate	Budget option, acceptable

Round 2: OCR

OCR is where vision models either prove themselves or get exposed. I threw a multi-language document at them — mixed English, Simplified Chinese, and a Japanese subtitle. Nothing fancy, just a real-world mess.

Qwen3-VL-32B was a five-star wrecking ball across the board. English, Chinese, mixed — all clean. GLM-4.6V actually edged it out on pure Chinese extraction, which tracks given Zhipu's data lineage. Qwen3-Omni-30B was right there with the 32B. Hunyuan-Vision was fine on Chinese but sloppy on English, which is an interesting data point if you care about that.

Model	English OCR	Chinese OCR	Mixed
Qwen3-VL-32B	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐
GLM-4.6V	⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐
Qwen3-Omni-30B	⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐
Hunyuan-Vision	⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐

If your pipeline is Chinese-first, GLM-4.6V deserves a serious look. If it's mixed, the Qwen3-VL-32B is the safer default.

Round 3: Charts and Diagrams

Bar chart, stacked area, and a flow diagram. Prompt: "Analyze this bar chart and summarize the key trends."

Qwen3-VL-32B: perfect data extraction, excellent trend analysis, clean formatting. It gave me bullet points, identified the inflection point, and called out the outlier series by name. This is the kind of output you'd paste directly into a Slack message to your PM.

GLM-4.6V was excellent on extraction and very good on the analysis. The formatting was a hair less polished but nothing a re-prompt couldn't fix.

Qwen3-Omni-30B was very good across all three, with formatting that was honestly indistinguishable from the 32B. If you're already paying for Omni, the chart performance is a bonus.

Model	Data Extraction	Trend Analysis	Formatting
Qwen3-VL-32B	Perfect	Excellent	Clean
GLM-4.6V	Excellent	Very good	Good
Qwen3-Omni-30B	Very good	Very good	Clean

Round 4: Code Screenshot → Code

This is the test I ran for myself, because I've been meaning to automate "recreate this Stack Overflow answer from a screenshot." The test image was a Python function with weird indentation and a Unicode arrow.

Qwen3-VL-32B nailed it — 95% accurate, handled the indentation properly, preserved the Unicode arrow. GLM-4.6V came in at 90% with some minor formatting cleanup needed. Qwen3-Omni-30B hit 92% with a noticeable latency bump — nothing deal-breaking, but if you're doing real-time processing, the 32B is snappier.

Model	Accuracy	Edge Cases
Qwen3-VL-32B	95%	Handled indentation, special chars
GLM-4.6V	90%	Minor formatting issues
Qwen3-Omni-30B	92%	Good, slight delay

The Audio Question (And Why Omni Is Worth the Hype)

Here's the thing nobody tells you: of these 9 models, exactly one accepts audio input. Qwen3-Omni-30B. Everyone else just looks at you like you asked them to smell the file.

I tested it on a multilingual podcast clip, a phone call recording, and a 30-second guitar riff. Results:

Task	Result
Speech-to-text transcription	Excellent across multiple languages
Audio Q&A	Good — answered "what's being said in this recording?" correctly
Emotion detection	Works — picked up the sarcastic tone in a voicemail
Music description	Basic — described the guitar riff as "plucked string instrument, mid-tempo"

The "basic" rating on music description is honest, not dismissive. The model isn't a music analyst, and it doesn't pretend to be. For speech, it's genuinely strong.

Here's the audio request pattern, in case you're building something with it:

def transcribe_audio(audio_url: str) -> str:
    resp = client.chat.completions.create(
        model="Qwen/Qwen3-Omni-30B-A3B-Instruct",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": "Transcribe this audio."},
                {"type": "audio_url", "audio_url": {"url": audio_url}},
            ],
        }],
    )
    return resp.choices[0].message.content

Same client, same base URL, different modality block. This is the dream, honestly. RFC 7231 would be proud.

Show Me The Money (Pricing Breakdown)

I keep a spreadsheet for this stuff because at scale, half a dollar per million tokens turns into real money. Here's the same models ranked by what you'd actually pay if you processed 1,000 images per run, 10,000 images per month:

Model	$/M Output	1,000 Image Analyses	Monthly (10K imgs)
GLM-4.5V	$0.01	~$0.05	$0.50
Qwen3-VL-8B	$0.50	~$2.50	$25
Qwen3-VL-32B	$0.52	~$2.60	$26
Qwen3-Omni-30B	$0.52	~$2.60 (+ audio)	$26
GLM-4.6V	$0.80	~$4.00	$40
Hunyuan-Vision	$1.20	~$6.00	$60
Doubao-Seed-2.0-Pro	$3.00	~$15.00	$150

A few things worth highlighting. The jump from GLM-4.5V at $0.50/month to Doubao at $150/month is 300x. That's not a pricing tier, that's a different product category. The Qwen3 cluster (8B, 32B, Omni) is effectively priced identically, which means the choice is purely about quality and features. GLM-4.6V is the premium Zhipu option and you pay roughly 50% more for it. Hunyuan and Doubao are the "big context, big invoice" options.

What I'd Actually Ship

After all this, here's my decision tree for a typical backend use case:

Bulk document OCR on a budget → GLM-4.5V. The 0.01 price is real and the quality is "good enough for non-critical paths."
General-purpose image understanding → Qwen3-VL-32B. Best all-rounder, fair price, handles edge cases.
Audio + image + video pipeline → Qwen3-Omni-30B. Only real choice

DEV Community

I Benchmarked 9 Multimodal AI APIs So You Don't Have To

The Contenders

My Test Harness (Yes, It's Ugly)

Round 1: Object Recognition

Round 2: OCR

Round 3: Charts and Diagrams

Round 4: Code Screenshot → Code

The Audio Question (And Why Omni Is Worth the Hype)

Show Me The Money (Pricing Breakdown)

What I'd Actually Ship

Top comments (0)