eagerspark

Posted on Jun 15

The Developer's Guide to Building AI Document Q&A Systems

#ai #machinelearning #webdev #python

Three months ago I graduated from a coding bootcamp. I could spin up a React app, build a REST API, and probably debug a CORS issue in my sleep. But AI? That was the scary stuff the senior devs talked about in hushed tones. I had no idea where to even start.

Then I got assigned a side project: build something that lets people upload a PDF and ask questions about it. My first thought was, "Cool, that's probably like, a few lines of code, right?" Spoiler: it was more than a few lines, but also way more achievable than I ever imagined. And honestly, the whole journey kind of blew my mind.

Let me walk you through what I learned, because if a bootcamp grad can figure this out, you definitely can too.

When It Clicked

The first thing that surprised me was how the whole "AI answering questions about your document" thing actually works under the hood. I thought it was some magical black box. Turns out it's not that complicated when you break it down.

You basically take a document, chunk it up into smaller pieces, send those chunks along with the user's question to a language model, and ask it to answer based only on what you've provided. That's it. The fancy term is "Retrieval Augmented Generation" but I refuse to call it that because it made me feel dumb for way too long.

The part that really got me, though, was discovering how many different AI models are out there. I always assumed "AI" meant "OpenAI" meant "ChatGPT" meant "really expensive." I was shocked when I found out there's this thing called Global API that gives you access to 184 different models through one connection point. One hundred and eighty-four. I still can't believe that's a real number.

The Pricing Rabbit Hole

Here's where things got interesting. I went down a pricing rabbit hole that I wasn't prepared for. Let me share some of what I found, because these numbers genuinely changed how I think about building with AI.

The cheapest models start at like $0.01 per million tokens. Million. Tokens. I had to Google what a token even was, and now I know it's roughly a piece of a word. So a million tokens is a lot of text. The most expensive models go up to $3.50 per million tokens. That's still a wide range, but compared to what I expected going in? Wild.

Let me show you the comparison that really made me sit up straight. I made a little table in my notes app, and I'm going to share it here because the difference is stark:

Model	Input ($/M tokens)	Output ($/M tokens)	Context Window
DeepSeek V4 Flash	0.27	1.10	128K
DeepSeek V4 Pro	0.55	2.20	200K
Qwen3-32B	0.30	1.20	32K
GLM-4 Plus	0.20	0.80	128K
GPT-4o	2.50	10.00	128K

Look at that last row. GPT-4o costs $10.00 per million output tokens. The cheapest model in this list (GLM-4 Plus) costs $0.80. That's more than 12 times cheaper. For the exact same kind of task.

I had no idea. I genuinely thought GPT-4o was the only serious option. I was paying the "name brand tax" without even realizing it.

The deeper I dug, the more I found that for document Q&A specifically, you can expect a 40-65% cost reduction compared to just throwing GPT-4o at everything. The quality is comparable, sometimes better. Let me say that again: comparable or better quality, for literally half (or less) the price. My bootcamp self would never have believed this.

My First Working Prototype

Okay so once I had my pricing revelation, I needed to actually build the thing. Here's where Global API came in clutch. I was expecting to need separate accounts, separate API keys, separate SDKs for every model I wanted to test. Nope. One endpoint, one key, 184 models.

Here's the actual code I used for my first successful test. I remember how good it felt when this ran without errors:

import openai
import os

client = openai.OpenAI(
    base_url="https://global-apis.com/v1",
    api_key=os.environ["GLOBAL_API_KEY"],
)

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V4-Flash",
    messages=[{"role": "user", "content": "Summarize the key points from this document."}],
)

print(response.choices[0].message.content)

That's it. That's the whole thing. I was using the OpenAI Python library, just pointing it at a different base URL, and suddenly I was talking to a completely different model. I almost didn't believe it worked the first time. I ran it three times before I accepted reality.

For my document Q&A system, I built a slightly more involved version that includes the document chunks in the prompt:

import openai
import os

client = openai.OpenAI(
    base_url="https://global-apis.com/v1",
    api_key=os.environ["GLOBAL_API_KEY"],
)

def answer_question(document_chunks, user_question):
    context = "\n\n".join(document_chunks)

    response = client.chat.completions.create(
        model="deepseek-ai/DeepSeek-V4-Flash",
        messages=[
            {
                "role": "system",
                "content": "Answer the user's question based only on the provided context. If the answer isn't in the context, say so."
            },
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion: {user_question}"
            }
        ],
    )

    return response.choices[0].message.content

# Example usage
chunks = [
    "The company was founded in 2019 by three former Google engineers.",
    "Annual revenue reached $50M in 2024, up 200% from the previous year.",
    "The team is headquartered in Austin, Texas."
]

answer = answer_question(chunks, "When was the company founded?")
print(answer)

The deepseek-ai/DeepSeek-V4-Flash model is what I landed on for most queries. It costs $0.27 per million input tokens and $1.10 per million output tokens, with a 128K context window. For document Q&A, that context window is plenty for most documents. The 200K version (DeepSeek V4 Pro) was overkill for what I was building, but it's nice to know it's there.

Things I Wish Someone Had Told Me Earlier

After a few weeks of building, testing, breaking things, and rebuilding, I came up with a list of tips that would've saved me a ton of time. These are the things I now consider non-negotiable when building any AI-powered system.

Cache Aggressively

This was the single biggest cost saver for me. Once I added caching, I saw a 40% hit rate on common questions. Forty percent of my API calls just... disappeared. They served cached responses instead. That's a 40% reduction in cost with zero downside.

I was shocked at how simple this was. Just a Redis instance and a hash of the question, and boom, free answers for repeat queries.

Stream Your Responses

Honestly, I added streaming not for the cost benefit but because the user experience was terrible without it. Watching a loading spinner for 5 seconds while the model "thinks" feels broken. Streaming lets words appear as they're generated, which feels way more responsive even though the total time is the same.

The throughput numbers I saw were around 320 tokens per second with about 1.2 seconds of average latency. That's fast, but those 1.2 seconds feel like forever if you're not streaming.

Use Cheaper Models for Simple Stuff

Not every question needs the most powerful model. For stuff like "summarize this paragraph" or "extract the names from this text," I started using the GA-Economy tier models, and it cut costs by another 50%. The expensive models are like using a Ferrari to go to the corner store. Sometimes you just need a bicycle, and the bicycle is actually perfect.

Monitor Quality Like a Hawk

This one I learned the hard way. I thought I had a working system, deployed it, and then got flooded with feedback that the answers were "weird" sometimes. Turns out cheap models are great until they aren't, and the failure mode is subtle. You need to track user satisfaction scores and watch for patterns.

I now log every interaction and have a simple thumbs up/thumbs down on the frontend. It's not fancy, but it tells me when something's off.

Always Have a Fallback

The first time I hit a rate limit in production, my entire app went down. I was mortified. Now I have a fallback model configured. If the primary model fails or rate-limits, the system gracefully switches to a different one. Users don't even notice.

This was a 30-minute fix that probably saved me from getting fired.

The Numbers That Made Me a Believer

Let me share some benchmarks I saw while researching. The quality score across these models averaged around 84.6% on standard Q&A benchmarks. That's not just "good enough," that's genuinely good. And remember, this is the cheaper models. The expensive ones might score a bit higher, but for document Q&A specifically, the difference isn't worth the cost.

The setup time was the other big surprise. I went from "I have no idea what I'm doing" to "I have a working prototype" in under 10 minutes. I timed it. Most of that was waiting for pip install to finish. The actual code was maybe 20 lines.

What I'd Tell Past Me

If I could go back to the version of me that was terrified of building anything with AI, I'd say a few things.

First: it's not magic. It's just APIs. If you can call a REST endpoint, you can build with AI. The abstractions are good now.

Second: stop assuming the most expensive option is the best. It's not. Especially for document Q&A, where the task is well-defined and the models have gotten really good across the board.

Third: the pricing is way more reasonable than you think. We're talking fractions of a cent per query in most cases. You can build something, put it in front of real users, and iterate without going bankrupt.

Fourth: the setup is fast. I cannot stress this enough. I spent more time on the HTML for the upload form than I did on the actual AI integration.

My Current Setup

For anyone curious what I actually ended up shipping, here's the gist. I'm using DeepSeek V4 Flash as the primary model for most document Q&A tasks. It runs at $0.27 per million input tokens and $1.10 per million output tokens. For super simple queries, I fall back to a GA-Economy model. For documents that exceed the 128K context window, I use DeepSeek V4 Pro with its 200K context.

The whole thing routes through Global API's unified endpoint. One base URL (https://global-apis.com/v1), one API key, and I can swap models whenever I want without rewriting a single line of integration code. That last part is what sealed the deal for me. I'm not locked into any one provider. If a better model comes out next week, I can test it in an afternoon.

Final Thoughts

Building this thing as a bootcamp grad was a turning point for me. I went from "AI is too complicated for me" to "AI is just another tool in my toolbox" in the span of a few weeks. The barrier to entry is way lower than I thought, and the cost is way lower than I thought.

If you're in a similar position, where you've been putting off building with AI because it seems expensive or scary, I'd say just start. Pick a use case. Write some code. Hit that API endpoint and see what happens. I was shocked at how quickly the pieces came together, and I think you will be too.

If you want to explore Global API and see what 184 models feel like, check it out at global-apis.com. They have a unified SDK that makes everything I described above even easier than what I showed you here. And I think they give out some free credits to get you started, so you can experiment without committing any money.

That's about it. Happy building, and don't let the AI stuff intimidate you. We're all just figuring it out as we go.

DEV Community

The Developer's Guide to Building AI Document Q&A Systems

When It Clicked

The Pricing Rabbit Hole

My First Working Prototype

Things I Wish Someone Had Told Me Earlier

Cache Aggressively

Stream Your Responses

Use Cheaper Models for Simple Stuff

Monitor Quality Like a Hawk

Always Have a Fallback

The Numbers That Made Me a Believer

What I'd Tell Past Me

My Current Setup

Final Thoughts

Top comments (0)