DEV Community

Building a Production-Aware AI Backend with FastAPI

Most AI backend examples stop at one thing:

send a prompt, get a response.

That is fine for demos, but real systems usually need more than that.

Once you try to use AI inside an actual product, a few practical questions show up immediately:

  • How much does each request cost?
  • How long does each response take?
  • Can we stream output instead of waiting for a full response?
  • Can we reduce hallucinations by grounding responses in known data?
  • Can we log usage for billing and analytics?

I wanted to build something closer to that reality.

So instead of making another thin OpenAI wrapper, I built a FastAPI-based AI backend with:

  • synchronous responses
  • streaming responses
  • usage logging
  • token-based cost estimation
  • response time monitoring
  • lightweight context-based answering
  • Docker reproducibility

The result is a backend that feels much closer to something you could actually extend into an internal AI tool or SaaS feature.


Why I Built It This Way

A lot of AI tutorials focus on model output.

I wanted to focus on backend behavior.

That means:

  • responses should be explainable
  • system behavior should be predictable
  • logs should support observability
  • the backend should be structured for extension, not just for demo screenshots

In other words, I was less interested in “Can this call OpenAI?”
and more interested in “Can this behave like a real backend feature?”


Tech Stack

  • Python 3.11
  • FastAPI
  • SQLAlchemy 2.0
  • Alembic
  • OpenAI API
  • SQLite
  • Docker

What the Backend Does

The project currently includes:

  • POST /ai/test for standard AI responses
  • POST /ai/stream for streaming output
  • POST /ai/upload for adding text-based context data
  • POST /seed for inserting sample context
  • GET /ai/logs for inspecting stored usage logs

It also stores request-level data such as:

  • prompt
  • response
  • total tokens
  • estimated cost
  • response time
  • endpoint name
  • user id
  • timestamp

That logging layer turned out to be one of the most important parts of the project.

Because once you can see how AI is being used, the backend starts becoming operational instead of experimental.


Lightweight RAG-Style Answering

One of the goals was to reduce hallucinated answers.

Instead of letting the model answer freely from its general knowledge, I added a lightweight retrieval flow:

  1. search relevant records from the database
  2. inject the retrieved content into the prompt
  3. instruct the model to answer only from that context

This is not a full vector database setup.
It is intentionally lightweight.

The retrieval logic uses:

  • simple keyword-based matching
  • basic query pre-processing for Japanese text
  • AND search first
  • OR fallback if needed
  • safe fallback responses when no context is found

So the project is better described as a lightweight RAG-style backend rather than a full enterprise retrieval system.

That was deliberate.

I wanted something small enough to understand, but structured enough to feel useful.


Why Streaming Matters

Streaming changes the feel of an AI product a lot.

Without streaming, the user waits for the full answer.
With streaming, the user gets feedback immediately.

That makes the backend feel much closer to a real assistant feature.

So I added /ai/stream and then made sure streaming requests were not treated like second-class citizens.

I wanted them logged too.

That meant tracking:

  • total tokens
  • estimated cost
  • response time
  • endpoint name

This was important because a lot of examples show streaming output, but do not show how to observe or measure it properly.

In practice, that observability layer is what makes the feature maintainable.


Cost Tracking and Latency Monitoring

For regular /ai/test responses, token usage was straightforward to capture.

For streaming, it required a bit more work.

I refactored the provider layer so the streaming flow could still capture usage data at the end, then calculate an estimated cost and store it together with the final response log.

That gave me a much more useful log structure.

Instead of only storing “prompt” and “response,” I could now see:

  • how many tokens were used
  • how much the request approximately cost
  • how long the request took
  • which endpoint generated it

That is a much stronger foundation for:

  • cost visibility
  • future billing models
  • usage analytics
  • performance monitoring

One Practical Issue I Hit

Alembic autogeneration tried to include unrelated schema changes while I was extending the logging table.

It detected the new columns I wanted:

  • estimated_cost
  • response_time_ms
  • endpoint

But it also tried to remove an unrelated documents table.

That was a good reminder that migration generation is helpful, but not magical.

I manually cleaned the migration so it only included the actual intended schema change.

That was one of those small but very real backend moments:
not “how do I make the feature work?”
but “how do I make the change safe?”


Current Logging Model

The request log now stores:

  • prompt
  • response
  • total_tokens
  • estimated_cost
  • response_time_ms
  • endpoint
  • user_id
  • created_at

That makes the backend feel much more production-aware than a simple AI demo.


What This Project Is Really About

The most important thing I learned is that AI backend work is not just model integration.

It is also about:

  • structure
  • safety
  • logging
  • reproducibility
  • monitoring
  • extension paths

Calling an API is easy.

Building something that behaves predictably when it grows is the harder part.

That is what I wanted this project to reflect.


What I Would Add Next

The next natural steps would be:

  • JWT authentication
  • token quota control
  • admin-facing usage analytics
  • Stripe integration
  • richer retrieval strategies
  • vector-based search when the use case really needs it

But even in its current form, the backend already demonstrates something important:

AI features become much more valuable when they are treated as backend systems, not just model calls.


Repository

GitHub: fastapi-ai-core

The repository includes Docker setup, logging examples, and a lightweight context-retrieval flow.

If you are building AI-enabled backend systems with FastAPI, I think this kind of structure is worth caring about early.

Because once usage grows, observability stops being a nice-to-have.
It becomes part of the product itself.

Top comments (0)