I Spent Two Weeks Testing Multimodal AI APIs: Heres My Take
okay so I've been building a side project for the last few months — basically a tool that helps small e-commerce stores auto-tag their product photos and pull text out of screenshots customers send in. Sounds simple right? Wrong. It sent me down a rabbit hole of testing EVERY multimodal AI API I could get my hands on, and honestly, I wish someone had just written this all down for me before I started. So here we are.
I'm gonna walk you through what I tried, what blew up, what cost way too much, and the one model I'm telling all my indie hacker friends about. Buckle up.
Why I Even Cared About Multimodal AI
Heres the thing — I was using GPT-4o for image stuff at first and yeah it works, but the bill started looking like a mortgage payment. $10/M output tokens is fine for an enterprise, but for a solo dev doing product tagging? I needed something cheaper that didn't make me want to cry every time I checked Stripe.
So I went hunting. I already had a Global API account (more on that later) and realized they had a bunch of multimodal models I'd never even heard of. Qwen3-VL, GLM-4.6V, Hunyuan stuff from Tencent, ByteDance's Doubao... most of these I'd never even tried. Time to put em through the wringer.
The Models I Actually Tested
Here's the lineup. I'll keep the exact prices because honestly, when you're bootstrapping, every cent matters.
- Qwen3-VL-32B — Image + Text, $0.52/M output, 32K context
- Qwen3-VL-30B-A3B — Image + Text, $0.52/M output, 32K context
- Qwen3-VL-8B — Image + Text, $0.50/M output, 32K context
- Qwen3-Omni-30B — Image + Audio + Video + Text, $0.52/M output, 32K context
- GLM-4.6V — Image + Text, $0.80/M output, 32K context
- GLM-4.5V — Image + Text, $0.01/M output, 32K context
- Hunyuan-Vision — Image + Text, $1.20/M output, 32K context
- Hunyuan-Turbo-Vision — Image + Text, $1.20/M output, 32K context
- Doubao-Seed-2.0-Pro — Image + Text, $3.00/M output, 128K context
Nine models. Two weeks. Way too much coffee.
The Test Setup (aka My Garage Lab)
I'm not a fancy enterprise with GPU clusters. I'm one dude with a laptop and a bunch of test images I scraped from my own e-commerce inventory plus some tricky stuff I found on Reddit. I tested four main things:
- Object recognition on busy scenes
- OCR (text extraction, especially mixed English/Chinese)
- Chart and diagram understanding
- Code screenshot → actual code
I rated everything 1-5 stars because, honestly, I just made a spreadsheet and went to town. No fancy benchmarks framework. Just vibes and pixel peeping.
Test 1: "Describe What You See"
I threw a gnarly street scene at these models — like, Tokyo Shibuya crossing at night, signs everywhere, crowds, ads, the works. Heres what happened:
Qwen3-VL-32B absolutely crushed it. Got 5 stars. It picked out like 15+ objects, identified brands I didn't even remember were in the image, read the Japanese text on signs correctly. I was genuinely impressed. This is the model I ended up shipping with.
GLM-4.6V got 4 stars — really good, especially on anything Asian context. It actually beat Qwen3 slightly on Chinese text recognition, which is wild for a model that's not even primarily Chinese-tuned anymore.
Qwen3-Omni-30B also 4 stars. Slightly less detail than the VL-32B, but the fact it can do audio too made me forgive it.
Hunyuan-Vision got 3 stars. It missed small details. Like, it could tell me "there are people walking" but couldn't tell me what brand was on their bag. For the price ($1.20/M), I expected more.
GLM-4.5V got 3 stars. Its the budget option at $0.01/M (yes, ONE CENT) and honestly for that price? Pretty good. Not great, but good enough if you're processing millions of images and need to save every penny.
Test 2: OCR — Where Things Get Interesting
OCR is weirdly where I had the strongest opinions. I made a test document with English, Chinese, and some mixed-language product labels. Heres how they did:
| Model | English | Chinese | Mixed |
|---|---|---|---|
| Qwen3-VL-32B | 5/5 | 5/5 | 5/5 |
| GLM-4.6V | 4/5 | 5/5 | 5/5 |
| Qwen3-Omni-30B | 4/5 | 4/5 | 4/5 |
| Hunyuan-Vision | 3/5 | 4/5 | 3/5 |
Qwen3-VL-32B was perfect on all three. Like, flawless. I tried to trip it up with weird fonts, smudged text, the kind of garbage you get from a phone camera in bad lighting — it still nailed it.
GLM-4.6V was actually slightly better on Chinese-only documents. So if you're doing Chinese OCR specifically and don't need English, GLM is your friend. The 4/5 on English was because it occasionally got weird with cursive fonts.
For my use case (mixed language e-commerce), Qwen3-VL-32B was the winner. Pretty much no contest.
Test 3: Charts and Diagrams (My Nemesis)
I hate chart understanding. Its the kind of task that makes me want to throw my laptop into the sea. But I needed it for one feature where users upload analytics screenshots and want a text summary. Heres how the top three did:
| Model | Data Extraction | Trend Analysis | Formatting |
|---|---|---|---|
| Qwen3-VL-32B | Perfect | Excellent | Clean |
| GLM-4.6V | Excellent | Very good | Good |
| Qwen3-Omni-30B | Very good | Very good | Clean |
Qwen3-VL-32B literally just nailed it. I gave it a bar chart with like 12 data points and a wonky pie chart, and it pulled the numbers out EXACTLY right and summarized the trends in clean bullet points. No "I cant quite tell" hedging. Just "this happened, this trend, this number." Chef's kiss.
GLM-4.6V was almost as good but its output formatting was a bit messier. I had to do more post-processing.
Test 4: Code Screenshots — The Real Indie Hacker Test
This one was personal. As a dev, I take a LOT of code screenshots from Twitter, Stack Overflow, blog posts. I wanted to know if these models could just... turn them back into code I could paste. Heres what I got:
- Qwen3-VL-32B: 95% accuracy. Handled weird indentation, special characters, even some handwritten code. I'm STILL shook.
- GLM-4.6V: 90% accuracy. Minor formatting issues, but I could fix them in like 30 seconds.
- Qwen3-Omni-30B: 92% accuracy. Slight delay in response but the output was clean.
The 5% gap between Qwen3-VL-32B and GLM-4.6V is the difference between "paste and run" and "paste and fix one line." Honestly, both were usable. I was NOT expecting this to work as well as it did.
The Audio Stuff (Qwen3-Omni Is Weird And Cool)
Okay so heres where things get interesting. Qwen3-Omni-30B is the ONLY model in this list that does audio. And video. Its truly omni-modal. I tested it with some podcast clips and customer voice notes.
| Task | Result |
|---|---|
| Speech-to-text transcription | Excellent (multiple languages) |
| Audio Q&A | Good |
| Emotion detection | Works |
| Music description | Basic |
For speech-to-text it was GREAT. Like, really great. I threw Spanish, Mandarin, and English at it and it handled all three. Emotion detection was hit-or-miss but it could tell if a speaker sounded frustrated vs. calm, which is way more useful than I thought it'd be.
The video part I didn't test as much because, honestly, video processing is expensive and my use case doesn't need it. But its there if you need it.
The Pricing Reality Check
Okay heres where it gets REAL. Let me show you what these cost at scale because thats what actually matters when you're building something:
- GLM-4.5V at $0.01/M output: 1,000 image analyses = ~$0.05. Monthly at 10K images = $0.50. CHEAP.
- Qwen3-VL-8B at $0.50/M: 1,000 analyses = ~$2.50. Monthly at 10K = $25.
- Qwen3-VL-32B at $0.52/M: 1,000 = ~$2.60. Monthly = $26.
- Qwen3-Omni-30B at $0.52/M: 1,000 = ~$2.60 (plus audio). Monthly = $26.
- GLM-4.6V at $0.80/M: 1,000 = ~$4.00. Monthly = $40.
- Hunyuan-Vision at $1.20/M: 1,000 = ~$6.00. Monthly = $60.
- Doubao-Seed-2.0-Pro at $3.00/M: 1,000 = ~$15.00. Monthly = $150.
Compare that to GPT-4o at $10/M. For 10K images a month? I was looking at like $500+ on OpenAI. On Qwen3-VL-32B its $26. Thats not even a rounding error for most businesses. I literally cut my API bill by 95% by switching.
Honestly, I gotta say — the GLM-4.5V at $0.01/M is INSANE. Yes its not as good as the bigger models, but for bulk processing where you just need "good enough," its a game changer. I'm using it as a pre-filter now — let GLM-4.5V do the first pass, then send uncertain cases to Qwen3-VL-32B for the harder stuff. Costs me like $2/month for that combo.
My Actual Code (For The Devs Reading This)
Heres how I'm actually calling these models through Global API. Super simple, OpenAI-compatible:
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ.get("GLOBAL_API_KEY"),
base_url="https://global-apis.com/v1"
)
response = client.chat.completions.create(
model="Qwen/Qwen3-VL-32B-Instruct",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "What products are in this image? List brand names and colours."},
{"type": "image_url", "image_url": {"url": "https://example.com/product.jpg"}}
]
}],
max_tokens=500
)
print(response.choices[0].message.content)
And heres the audio example with Qwen3-Omni — which is honestly the coolest one:
# Audio processing with Qwen3-Omni
response = client.chat.completions.create(
model="Qwen/Qwen3-Omni-30B-A3B-Instruct",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Transcribe this audio and tell me the speaker's tone"},
{"type": "audio_url", "audio_url": {"url": "https://example.com/voice-note.mp3"}}
]
}],
max_tokens=1000
)
print(response.choices[0].message.content)
Pretty much drop-in compatible with the OpenAI SDK. I didn't have to change much in my codebase besides swapping the base URL and model names.
What I'd Actually Recommend
If you're an indie hacker building something with multimodal AI heres my honest take after two weeks of testing:
For pure image understanding (best bang for buck): Qwen3-VL
Top comments (0)