RileyKim

Posted on Jun 2

Title: **Building a Multimodal App From Scratch: What Nobody Tells You About Vision & Audio APIs

#tutorial #python #deepseek #programming

Title: Building a Multimodal App From Scratch: What Nobody Tells You About Vision & Audio APIs

Honestly, I gotta say, the hype around "multimodal AI" in 2026 is real, but it’s also kinda confusing. Everyone’s talking about models that can "see" and "hear," but when you actually sit down to build something, the devil is in the details—and the pricing.

I’ve been tinkering with these APIs for the past few months, trying to figure out which one actually works for a real-world project (think: an app that scans receipts, transcribes meetings, and maybe even tells me if my cat is plotting something based on a photo). So I decided to run my own tests. No corporate fluff, just raw results from my terminal.

Here’s the deal: I compared the major multimodal models available through a single endpoint (Global API), and I’m gonna break down what I found. Spoiler: Qwen3-VL-32B is the MVP for vision, and Qwen3-Omni-30B is the only thing that handles audio without making you cry.

The Lineup: Who’s Who in the Zoo

Before we dive into the benchmarks, heres the table of models I tested. I’m not gonna pretend I used every single one for a month, but I spent enough time with each to get a feel for their quirks.

Model	Provider	What It Does	Output $/M	Context Window
Qwen3-VL-32B	Qwen	Image + Text	$0.52	32K
Qwen3-VL-30B-A3B	Qwen	Image + Text	$0.52	32K
Qwen3-VL-8B	Qwen	Image + Text	$0.50	32K
Qwen3-Omni-30B	Qwen	Image + Audio + Video + Text	$0.52	32K
GLM-4.6V	Zhipu	Image + Text	$0.80	32K
GLM-4.5V	Zhipu	Image + Text	$0.01	32K
Hunyuan-Vision	Tencent	Image + Text	$1.20	32K
Hunyuan-Turbo-Vision	Tencent	Image + Text	$1.20	32K
Doubao-Seed-2.0-Pro	ByteDance	Image + Text	$3.00	128K

Right off the bat, GLM-4.5V at $0.01/M output looks like a steal, right? I mean, that’s basically free. But hold that thought—because "cheap" doesn't always mean "useful." We’ll get to that.

Test 1: "What the Heck Am I Looking At?" (Object Recognition)

I took a photo of my messy desk—coffee cup, laptop, a pile of receipts, a stray sock (don’t ask), and a book titled "Python for Hackers." I asked each model: "Describe everything you see in this image, including brands and text."

Qwen3-VL-32B was the clear winner here. It identified 15+ objects, nailed the brand on my coffee mug ("Starbucks—probably from 2019"), and even read the title of the book. Accuracy? ⭐⭐⭐⭐⭐. Detail? Excellent. Honestly, I was a little creeped out.

GLM-4.6V came in second—⭐️⭐️⭐️⭐️. It was great at recognizing Asian brands (like the label on my tea can), but it missed the sock entirely. Maybe it was being polite.

Qwen3-Omni-30B was good (⭐️⭐️⭐️⭐️), but it felt a bit slower and less detailed than the dedicated vision model. Makes sense—it’s doing a lot more under the hood.

Hunyuan-Vision (⭐️⭐️⭐️) was decent for large objects, but it hallucinated a "laptop charger" that wasn’t there. Classic.

GLM-4.5V (⭐️⭐️⭐️) was... adequate. It described the laptop and the coffee cup, but it missed the book and the sock. You get what you pay for, I guess.

Test 2: "Read This Messy Document" (OCR)

I threw a multi-language document at them—half English, half Chinese, with some handwritten notes in the margins. The task: "Extract all text from this image."

Model	English OCR	Chinese OCR	Mixed Language
Qwen3-VL-32B	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐
GLM-4.6V	⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐
Qwen3-Omni-30B	⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐
Hunyuan-Vision	⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐

Qwen3-VL-32B crushed it. Perfect on both languages, even caught a smudged email address. GLM-4.6V was almost as good, especially on Chinese characters. Hunyuan-Vision struggled a bit with the English handwriting—it thought "urgent" was "urgentt." Close, but no cigar.

Test 3: "Explain This Chart Like I’m Five" (Chart Understanding)

I gave them a bar chart showing Q1 sales for a fake company. The bars were labeled "Product A: $10K," "Product B: $15K," etc. I asked: "Summarize the key trends."

Qwen3-VL-32B gave me a perfect breakdown: "Product B leads with 50% more revenue than Product A. Product C is flat. Suggestion: invest in Product B." It even formatted it as bullet points. Clean.

GLM-4.6V did a solid job, but it got a bit verbose. Qwen3-Omni-30B was fine, but again, slower.

Test 4: "Convert This Screenshot to Working Code" (Code Extraction)

I took a screenshot of a Python script—a simple function for a web scraper. I asked: "Give me the actual code from this image."

Qwen3-VL-32B scored 95% accuracy. It handled indentation perfectly and even caught a special character (\n in a string). One small bug: it missed a comment line. Still, impressive.

Qwen3-Omni-30B was at 92%—good, but it had a slight delay. GLM-4.6V was 90%, but it messed up the indentation on line 7.

The Audio Game: Only One Contender

Here’s where things get interesting. Out of all these models, only Qwen3-Omni-30B supports audio input. I mean, it’s in the name—"Omni"—but still, it’s a bit lonely at the top.

I tested it with a recording of me rambling about a project idea (in English), a clip of a Spanish podcast, and a snippet of someone singing (badly). Here’s what I found:

Speech-to-text transcription: ⭐⭐⭐⭐⭐. Perfect transcription of my rambling, even caught the word "synergy" which I hate myself for saying.
Audio Q&A: ⭐⭐⭐⭐. I asked, "What’s the speaker’s main point?" It answered correctly.
Emotion detection: ⭐⭐⭐. I asked, "Is the speaker angry?" It said "neutral," which was accurate—I was just tired.
Music description: ⭐⭐⭐. It described the song as "upbeat with a guitar riff." Basic, but not wrong.

Here’s the Python code I used (note the base URL—this is for Global API):

import requests

url = "https://global-apis.com/v1/chat/completions"
headers = {"Authorization": "Bearer YOUR_API_KEY", "Content-Type": "application/json"}

data = {
    "model": "Qwen/Qwen3-Omni-30B-A3B-Instruct",
    "messages": [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Transcribe this audio"},
                {"type": "audio_url", "audio_url": {"url": "https://example.com/audio.mp3"}}
            ]
        }
    ]
}

response = requests.post(url, json=data, headers=headers)
print(response.json())

Pretty straightforward. The response came back clean. No weird latency.

Pricing: The Real Talk

Alright, let’s talk money. Because honestly, this is where most indie hackers get screwed. Here’s the breakdown of what I calculated for my use case (1,000 image analyses per day):

Model	$/M Output	Cost per 1,000 Images	Monthly (10K images)
GLM-4.5V	$0.01	~$0.05	$0.50
Qwen3-VL-8B	$0.50	~$2.50	$25
Qwen3-VL-32B	$0.52	~$2.60	$26
Qwen3-Omni-30B	$0.52	~$2.60 (+ audio)	$26
GLM-4.6V	$0.80	~$4.00	$40
Hunyuan-Vision	$1.20	~$6.00	$60
Doubao-Seed-2.0-Pro	$3.00	~$15.00	$150

GLM-4.5V at $0.01/M is tempting, but honestly, it’s a budget model. You get what you pay for. For a prototype? Sure. For production? I wouldn’t trust it with critical OCR.

Qwen3-VL-32B at $0.52/M is the sweet spot. It’s basically the same price as the smaller 8B model, but WAY better. I’m using it for my receipt scanner app now.

Doubao-Seed-2.0-Pro at $3.00/M is a hard pass for me. Unless you absolutely need that 128K context window (which is rare for vision tasks), it’s overkill.

My Verdict (For What It’s Worth)

If you’re building a multimodal app in 2026, here’s my advice:

For vision-only tasks (OCR, object recognition, chart analysis): Use Qwen3-VL-32B. It’s the best balance of accuracy and price. I’m not gonna lie, I was skeptical at first, but it’s genuinely impressive.
For audio + vision (meeting transcriptions, video analysis, etc.): You have no choice but Qwen3-Omni-30B. It’s the only game in town that does it all. And honestly, it’s pretty good.
For Chinese-language heavy tasks: GLM-4.6V is a solid alternative. It’s a bit pricier, but it handles Chinese characters like a native.
Avoid the "budget" trap: GLM-4.5V is cheap, but it’s gonna cost you in accuracy. Trust me, I learned the hard way.

Final Thoughts

Building a multimodal app isn’t rocket science, but it’s also not as simple as just picking the cheapest model. You gotta test, test, test. I spent a weekend just running these benchmarks, and it saved me from making a costly mistake.

If you want to try these models yourself without signing up for a dozen different APIs, I’ve been using Global API as my single endpoint. It gives you access to all these models with one key. No bullshit, no juggling multiple accounts. Check it out if you want to skip the headache.

Now go build something cool. Just don’t forget to test your audio model with a bad karaoke session first. Trust me, it’s worth it.

DEV Community