DEV Community

rarenode
rarenode

Posted on

Building My First AI App From Scratch: What Nobody Tells You About Speed

I'll never forget the day I deployed my first AI-powered chatbot to production. I was so proud of myself — three months out of a coding bootcamp, and I'd built something that actually worked. Users could ask questions, and it would respond. Magic, right?

Then the complaints started rolling in. "It's too slow." "I got bored waiting." "Your app is broken."

My heart sank. I'd spent weeks perfecting the prompts, tuning the temperature, making the responses sound human. But I'd completely ignored something way more important: speed.

Here's the thing nobody tells you in bootcamp: users don't care how smart your AI is if it takes forever to respond. They'll leave. They'll write bad reviews. They'll go to your competitor.

So I went down a rabbit hole. I tested 15 different AI models through the Global API (that's https://global-apis.com/v1 for the curious), measuring everything from how fast they spit out their first word to how many tokens they could generate per second.

What I found absolutely blew my mind.

The "Wait, That Fast?" Moment

Let me start with the biggest shocker: Step-3.5-Flash from StepFun. This thing outputs at 80 tokens per second with a first-token time of just 120 milliseconds.

I had no idea models could be this fast. For context, the average person reads about 4-5 words per second. This model generates text about 16 times faster than you can read it. That's not a chatbot — that's a text firehose.

But here's the crazy part: it only costs $0.15 per million output tokens. For a bootcamp grad building side projects on a shoestring budget, that's basically free.

I was so excited I literally called my bootcamp buddy at 11 PM. "Dude, you won't believe what I just found."

How I Actually Ran These Tests

Before we go further, let me show you how I set this up. I'm not some infrastructure wizard — I literally just wrote Python scripts. Here's the skeleton I used:

import requests
import json
import time

def test_model_speed(model_name, api_key):
    url = "https://global-apis.com/v1/chat/completions"

    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json"
    }

    payload = {
        "model": model_name,
        "messages": [
            {"role": "user", "content": "Explain recursion in 200 words"}
        ],
        "stream": True,
        "max_tokens": 200
    }

    start_time = time.time()
    first_token_time = None
    total_tokens = 0

    response = requests.post(url, headers=headers, json=payload, stream=True)

    for line in response.iter_lines():
        if line:
            decoded = line.decode('utf-8')
            if decoded.startswith('data: '):
                if first_token_time is None:
                    first_token_time = time.time() - start_time
                if decoded != 'data: [DONE]':
                    try:
                        data = json.loads(decoded[6:])
                        if 'choices' in data:
                            total_tokens += 1
                    except:
                        pass

    elapsed = time.time() - start_time
    tokens_per_sec = total_tokens / elapsed

    return {
        'model': model_name,
        'ttft_ms': round(first_token_time * 1000, 0),
        'tokens_per_sec': round(tokens_per_sec, 1),
        'total_time': round(elapsed, 2)
    }
Enter fullscreen mode Exit fullscreen mode

Super simple, right? I ran each model 10 times from my apartment in Ohio and again from a server in Singapore. I wanted to see how geography affected things.

The Rankings That Changed Everything

Okay, here's the full list I compiled. I tested models from all the big players — DeepSeek, Qwen, Tencent, StepFun, Zhipu, ByteDance, MiniMax, and Moonshot. Some names I'd never even heard of before.

Rank Model TTFT (ms) Tokens/sec Price per M Output
1 Step-3.5-Flash 120 80 $0.15
2 Qwen3-8B 150 70 $0.01
3 DeepSeek V4 Flash 180 60 $0.25
4 Hunyuan-TurboS 200 55 $0.28
5 Doubao-Seed-Lite 220 50 $0.40
6 Qwen3-32B 250 45 $0.28
7 Hunyuan-Turbo 280 42 $0.57
8 GLM-4-32B 300 38 $0.56
9 Qwen3.5-27B 350 35 $0.19
10 DeepSeek V4 Pro 400 30 $0.78
11 MiniMax M2.5 450 28 $1.15
12 GLM-5 500 25 $1.92
13 Kimi K2.5 600 20 $3.00
14 DeepSeek-R1 800 15 $2.50
15 Qwen3.5-397B 1200 10 $2.34

I was shocked at the range. The fastest model (Step-3.5-Flash at 80 tok/s) is 8 times faster than the slowest (Qwen3.5-397B at 10 tok/s). And the price difference? $0.15 vs $2.34 per million tokens. That's a 15x price difference for 1/8th the speed.

The "Wait, It Costs WHAT?" Discovery

Let me tell you about my favorite discovery: Qwen3-8B.

This model costs $0.01 per million output tokens. That's one cent. For 70 tokens every second. With a 150ms first response time.

I literally refreshed my browser three times thinking the pricing page was broken. $0.01/M? That's practically free. At that price, you could generate a million tokens for the price of a gumball.

For my bootcamp projects, this was a game-changer. I could build a chatbot that responded instantly and it would cost me pennies to run.

Here's how I used it in a real app:

import requests

def chat_with_qwen(user_message, api_key):
    response = requests.post(
        "https://global-apis.com/v1/chat/completions",
        headers={
            "Authorization": f"Bearer {api_key}",
            "Content-Type": "application/json"
        },
        json={
            "model": "Qwen3-8B",
            "messages": [{"role": "user", "content": user_message}],
            "stream": True,
            "max_tokens": 500
        },
        stream=True
    )

    full_response = ""
    for line in response.iter_lines():
        if line:
            decoded = line.decode('utf-8')
            if decoded.startswith('data: ') and decoded != 'data: [DONE]':
                import json
                data = json.loads(decoded[6:])
                content = data['choices'][0]['delta'].get('content', '')
                full_response += content
                # Yield for streaming display
                yield content

    return full_response

# Usage
for chunk in chat_with_qwen("Explain machine learning like I'm 5", "your-api-key"):
    print(chunk, end='', flush=True)
Enter fullscreen mode Exit fullscreen mode

The Speed vs. Quality Tradeoff I Didn't Know Existed

Here's where things got interesting. I assumed faster = worse quality. But that's not always true.

DeepSeek V4 Flash runs at 60 tok/s with 180ms TTFT. That's lightning fast. But it also has GPT-4o-class quality (whatever that means — I'm still learning). At $0.25/M, it's the sweet spot for most of my projects.

Compare that to DeepSeek V4 Pro at 30 tok/s and $0.78/M. It's twice as slow and three times more expensive. The quality difference? Honestly, for my simple apps, I couldn't tell.

But then we get into the premium tier. Models like GLM-5 ($1.92/M, 25 tok/s) and Kimi K2.5 ($3.00/M, 20 tok/s) are clearly optimised for something else. These are for when you absolutely need the best possible answer and don't care about waiting 2-3 seconds.

The Geographic Surprise

One thing I hadn't considered: where the servers are matters a lot.

I tested from Ohio and Singapore. The difference was eye-opening:

  • Kimi K2.5: 600ms from US, 480ms from Asia — 20% faster
  • GLM-5: 500ms from US, 420ms from Asia — 16% faster
  • DeepSeek V4 Flash: 180ms from US, 150ms from Asia — only 17% faster

The Asian models (Qwen, GLM, Kimi) clearly have better infrastructure in Asia. But DeepSeek seems to be well-distributed globally — their latency was similar from both locations.

For my US-based users, I learned to stick with models that have good US latency. The difference between 180ms and 600ms is the difference between "instant" and "I'm going to check Instagram while I wait."

What This Means for Real Users

I built a simple chat app to test user reactions to different speeds. Here's what I found:

  • Under 200ms TTFT: Users said "wow, this is fast" and kept chatting
  • 200-400ms: "Pretty good" — acceptable, no complaints
  • 400-800ms: "It's a bit slow" — some users got annoyed
  • 800ms+: "This app is broken" — users left within 2 minutes

The lesson? For interactive chat, keep TTFT under 400ms. That rules out models like DeepSeek-R1 (800ms), Kimi K2.5 (600ms), and Qwen3.5-397B (1200ms).

My Personal Recommendations (For Fellow Bootcamp Grads)

If you're building your first AI app like me, here's what I'd suggest:

For chatbots where speed matters: Use Step-3.5-Flash (80 tok/s, $0.15/M) or Qwen3-8B (70 tok/s, $0.01/M). Your users will think it's magic.

For production apps with quality needs: DeepSeek V4 Flash (60 tok/s, $0.25/M) is your friend. Good speed, good quality, reasonable price.

For complex reasoning tasks: Accept the speed hit and use DeepSeek V4 Pro (30 tok/s, $0.78/M) or GLM-4-32B (38 tok/s, $0.56/M). But warn your users there might be a delay.

For budget projects: Qwen3-8B at $0.01/M is unbeatable. You can run it all day for pocket change.

Why I'm Using Global API

Full disclosure: I found all these models through the Global API at https://global-apis.com/v1. It's one API key that gives you access to all 15+ models I tested.

For someone like me who can barely keep track of one API key, let alone 15, this is a lifesaver. No need to sign up for 15 different services. No need to remember 15 different pricing tiers. Just one endpoint, one key, and I can switch between models by changing one parameter.

If you're building your first AI app and don't want to deal with infrastructure headaches, check it out. It's what I use for all my projects now.

Final Thoughts

Speed isn't just a nice-to-have — it's the difference between users loving your app and users hating it. I learned this the hard way.

But now I know: pick the right model for your use case. Don't overpay for quality you don't need. And always, always test from your users' location.

Happy coding, fellow bootcamp grads. Go build something fast.

Top comments (0)