fastapier (Freelance Backend)

Posted on Mar 25

Building a Production-Aware AI Backend with FastAPI

#ai #fastapi #openai #python

Most AI backend examples stop at one thing:

send a prompt, get a response.

That is fine for demos, but real systems usually need more than that.

Once you try to use AI inside an actual product, a few practical questions show up immediately:

How much does each request cost?
How long does each response take?
Can we stream output instead of waiting for a full response?
Can we reduce hallucinations by grounding responses in known data?
Can we log usage for billing and analytics?

I wanted to build something closer to that reality.

So instead of making another thin OpenAI wrapper, I built a FastAPI-based AI backend with:

synchronous responses
streaming responses
usage logging
token-based cost estimation
response time monitoring
lightweight context-based answering
Docker reproducibility

The result is a backend that feels much closer to something you could actually extend into an internal AI tool or SaaS feature.

Why I Built It This Way

A lot of AI tutorials focus on model output.

I wanted to focus on backend behavior.

That means:

responses should be explainable
system behavior should be predictable
logs should support observability
the backend should be structured for extension, not just for demo screenshots

In other words, I was less interested in “Can this call OpenAI?”
and more interested in “Can this behave like a real backend feature?”

Tech Stack

Python 3.11
FastAPI
SQLAlchemy 2.0
Alembic
OpenAI API
SQLite
Docker

What the Backend Does

The project currently includes:

POST /ai/test for standard AI responses
POST /ai/stream for streaming output
POST /ai/upload for adding text-based context data
POST /seed for inserting sample context
GET /ai/logs for inspecting stored usage logs

It also stores request-level data such as:

prompt
response
total tokens
estimated cost
response time
endpoint name
user id
timestamp

That logging layer turned out to be one of the most important parts of the project.

Because once you can see how AI is being used, the backend starts becoming operational instead of experimental.

Lightweight RAG-Style Answering

One of the goals was to reduce hallucinated answers.

Instead of letting the model answer freely from its general knowledge, I added a lightweight retrieval flow:

search relevant records from the database
inject the retrieved content into the prompt
instruct the model to answer only from that context

This is not a full vector database setup.
It is intentionally lightweight.

The retrieval logic uses:

simple keyword-based matching
basic query pre-processing for Japanese text
AND search first
OR fallback if needed
safe fallback responses when no context is found

So the project is better described as a lightweight RAG-style backend rather than a full enterprise retrieval system.

That was deliberate.

I wanted something small enough to understand, but structured enough to feel useful.

Why Streaming Matters

Streaming changes the feel of an AI product a lot.

Without streaming, the user waits for the full answer.
With streaming, the user gets feedback immediately.

That makes the backend feel much closer to a real assistant feature.

So I added /ai/stream and then made sure streaming requests were not treated like second-class citizens.

I wanted them logged too.

That meant tracking:

total tokens
estimated cost
response time
endpoint name

This was important because a lot of examples show streaming output, but do not show how to observe or measure it properly.

In practice, that observability layer is what makes the feature maintainable.

Cost Tracking and Latency Monitoring

For regular /ai/test responses, token usage was straightforward to capture.

For streaming, it required a bit more work.

I refactored the provider layer so the streaming flow could still capture usage data at the end, then calculate an estimated cost and store it together with the final response log.

That gave me a much more useful log structure.

Instead of only storing “prompt” and “response,” I could now see:

how many tokens were used
how much the request approximately cost
how long the request took
which endpoint generated it

That is a much stronger foundation for:

cost visibility
future billing models
usage analytics
performance monitoring

One Practical Issue I Hit

Alembic autogeneration tried to include unrelated schema changes while I was extending the logging table.

It detected the new columns I wanted:

estimated_cost
response_time_ms
endpoint

But it also tried to remove an unrelated documents table.

That was a good reminder that migration generation is helpful, but not magical.

I manually cleaned the migration so it only included the actual intended schema change.

That was one of those small but very real backend moments:
not “how do I make the feature work?”
but “how do I make the change safe?”

Current Logging Model

The request log now stores:

prompt
response
total_tokens
estimated_cost
response_time_ms
endpoint
user_id
created_at

That makes the backend feel much more production-aware than a simple AI demo.

What This Project Is Really About

The most important thing I learned is that AI backend work is not just model integration.

It is also about:

structure
safety
logging
reproducibility
monitoring
extension paths

Calling an API is easy.

Building something that behaves predictably when it grows is the harder part.

That is what I wanted this project to reflect.

What I Would Add Next

The next natural steps would be:

JWT authentication
token quota control
admin-facing usage analytics
Stripe integration
richer retrieval strategies
vector-based search when the use case really needs it

But even in its current form, the backend already demonstrates something important:

AI features become much more valuable when they are treated as backend systems, not just model calls.

Repository

GitHub: fastapi-ai-core

The repository includes Docker setup, logging examples, and a lightweight context-retrieval flow.

If you are building AI-enabled backend systems with FastAPI, I think this kind of structure is worth caring about early.

Because once usage grows, observability stops being a nice-to-have.
It becomes part of the product itself.

DEV Community