Part 3 of building a retail inventory API and then giving it a brain.
In Part 1 I explained why I archived my first API and started over. In Part 2 I restructured it properly, migrated to Supabase, and got to 97% test coverage.
The retail API is solid now. Working in production. Tests passing. Architecture I can explain.
So I started the next thing: a chatbot API. Same stack, new layer. The goal: a conversational AI service that remembers what you said, manages long conversations intelligently, and eventually connects to the retail inventory data.
This is what the last few weeks looked like.
Why Build the API at All
I could use a ready-made chatbot SDK. Drop in a library, wrap the OpenAI call, done in an afternoon.
The problem is the same one I had with the retail API the first time. I could make it work without understanding any of it.
I wanted to know what happens when a conversation gets too long for the context window. How tokens get counted. Why the response structure changed between model versions. What "streaming" actually means at the transport layer.
The only way to learn that is to build it yourself and break it repeatedly.
The Stack and the Plan
Same core as the retail API: FastAPI, SQLModel, PostgreSQL. Add OpenAI's Python SDK.
Three-layer architecture: routers take requests, controllers handle logic, services talk to external APIs. Models define the data shapes. Everything lives under app/.
The plan was a roadmap of PRs, one feature at a time. No commits directly to main. Every change reviewed before merge.
I had a senior dev reviewing. Same as the retail API. He reviews my PRs, asks hard questions, and doesn't let things slide.
The First Six PRs Were Fast
Scaffold, config, models, logging, error handling, health endpoint. These were mechanical. I'd done the patterns before on the retail API. Took a few days.
The models are worth mentioning because they drove everything else.
Two tables: Conversation (user_id, title, timestamps) and Message (conversation_id, user_message, ai_response, model used, tokens consumed, latency). Every API call gets stored. I wanted to track exactly what the model said, which version said it, how many tokens it used, and how long it took.
This felt like overkill at the time. It wasn't. That data saved me multiple times during debugging.
PR 7: First Working Chat Endpoint
This is where it got real.
The chat endpoint needed to do a few things at once: create or continue a conversation, load the message history, build the right payload for OpenAI, call the API, store the result, return a clean response.
I wrote a single chat_controller function that handles both new and existing conversations. If no conversation_id is in the request, create one. If there is one, fetch the history and continue.
The controller builds the messages array like this:
messages = [{"role": "system", "content": config.openai_system_prompt}]
for msg in history:
messages.append({"role": "user", "content": msg.user_message})
messages.append({"role": "assistant", "content": msg.ai_response})
messages.append({"role": "user", "content": request.user_message})
Simple. Obvious in hindsight.
The decorator for OpenAI error handling was more interesting. A @handle_openai_errors wrapper catches APITimeoutError, APIError, and generic exceptions, and converts them into clean HTTP responses with consistent error codes.
The controller doesn't need to know how OpenAI fails( it just calls the service ).
PR 8: The Project Structure Refactor
The retail API taught me about app/ structure. I used it from the start here. But after seven PRs, I had a problem: config, database, and logging were scattered.
PR 8 moved everything infrastructure-related into app/core/. Config lives there. Database engine and session dependency live there. Logger setup lives there. Custom exceptions live there under app/core/errors/.
This sounds like housekeeping. It was. But now when I onboard someone, I can say: "The business logic is in controllers/. The infrastructure is in core/. The data shapes are in models/. Nothing bleeds between them."
That's worth a whole PR.
PR 11: The Debug Session From Hell
I had a working structure. Clean architecture. Good patterns.
Then I ran the API for the first time locally and nothing worked.
The first error was a datetime serialization crash. JSONResponse couldn't serialize a Python datetime object. The fix was one method call — model_dump(mode="json") instead of model_dump(). Mode "json" converts datetimes and UUIDs to strings. Mode "" leaves them as Python objects. I didn't know that distinction existed.
The second error was the OpenAI API key not being found. I'd configured it with pydantic-settings, which reads .env files into a Settings object. What I didn't know: pydantic-settings populates the Settings object, but it doesn't write to os.environ. The OpenAI library reads os.environ. So the key existed in config.openai_api_key but the OpenAI client couldn't see it.
Fix: explicitly pass the key at client initialization.
client = AsyncOpenAI(api_key=config.openai_api_key)
I'd never had to think about this before because I'd never mixed a settings manager with a third-party SDK that reads environment variables directly.
Third error: max_tokens is deprecated for GPT-5 models. Use max_completion_tokens. Got a 400.
Fourth error: temperature isn't supported by GPT-5 mini. Another 400.
Fifth error: responses came back empty. The conversation history was being sent in reverse order ( newest messages first ). The model received the conversation backwards and returned nothing useful. Changed order_by(desc()) to order_by(asc()).
Sixth error: message history serialized as {} in the response. SQLModel's table=True models loaded from the database don't serialize cleanly through Pydantic v2 when using sa_column. The fix: Message.model_validate(msg) on each history item before returning.
Six separate bugs in one session. Each one taught me something I couldn't have read in a tutorial, because tutorials don't show you what breaks when you combine five technologies at once.
The API worked at the end of the day.
Context Window: The Feature I Underestimated
GPT-5 mini has a 128k token context window. Sounds like enough. It isn't a reason to be lazy.
Sending the entire conversation history on every request is wasteful. At scale it's expensive. And for a long conversation, the model spends time processing messages from hours ago that aren't relevant anymore.
The initial implementation was message-count based: keep the last 15 messages. Simple, configurable via env var.
But message count and token count aren't the same thing. Ten short messages is not the same as ten essay-length responses.
After seeing GPT's assessment of the code, I moved to proper token-based trimming using tiktoken. The trimmer preserves the system prompt and the latest user message always, those are non-negotiable. Then it walks backwards through history, removing the oldest pairs of messages until the total token count fits within the limit.
When messages get evicted, they don't disappear. A background task runs after the main response is returned. It takes the evicted messages and calls a cheap utility model to update a rolling summary, which gets stored on the Conversation record and injected back into context on the next request.
The result: conversations can run indefinitely without the model losing the thread of what was discussed early on.
This is the kind of problem that seems solved by "just use a big context window" until you start thinking about cost, latency, and what actually matters in a conversation.
PR 12: CRUD Endpoints and a Database Lesson
PATCH to update a conversation title. DELETE to remove a conversation entirely.
The DELETE looked simple. Fetch the conversation, delete the messages, delete the conversation, commit. Four lines.
It crashed with a foreign key violation. PostgreSQL won't let you delete a conversation record if message records still reference it, even if you've already queued those messages for deletion in the same session. SQLAlchemy batches the deletes and executes them in the wrong order.
The fix is one line: db.flush() after deleting the messages. This forces the message deletions to hit the database before the conversation deletion is issued.
I knew foreign keys existed. I didn't know SQLAlchemy's unit of work could silently reorder operations in a way that breaks FK constraints. Now I do.
What the Code Looks Like Now
The API has:
- POST
/chat/— new conversation - POST
/chat/{conversation_id}— continue existing conversation - GET
/chat/{conversation_id}— full message history - GET
/chat/conversations/{user_id}— list user's conversations - PATCH
/chat/{conversation_id}/title— rename a conversation - DELETE
/chat/{conversation_id}— delete conversation and all messages
Token-based context trimming. Rolling conversation summary. Exponential backoff retry for empty responses. Dual model setup — expensive model for chat, cheap model for background tasks like summarization.
The architecture is clean enough that adding streaming later won't require rewriting the controller ( just changing how the service returns its response ).
What I Learned That I Didn't Expect to Learn
OpenAI changes their API more than I expected. Parameters get deprecated between minor versions. A field that works on GPT-4o doesn't work on GPT-5 mini. If you don't read the changelog you get 400 errors with no obvious explanation.
Background tasks in FastAPI are useful but need their own database sessions. The main request session might be closed by the time the background task runs. Passing the engine and creating a new session inside the task is the right pattern.
TypeVar matters for decorator type safety. @handle_openai_errors originally had Any as the return type, which silently disabled type checking on every wrapped function. Fixing it to Callable[..., Awaitable[T]] took ten minutes and caught nothing immediately, but it makes the codebase defensible.
PR descriptions are not commit messages. A commit says what changed. A PR description explains what you changed, why you changed it, and what wasn't obvious. I'm still getting this wrong, still getting feedback from the senior dev on it, still improving.
Where It Goes From Here
Auto-title generation: the model names the conversation from the first message instead of truncating it to 50 characters.
Streaming responses: instead of waiting for the full reply, tokens arrive as they're generated.
Prompt loader: load system prompts from a file, select them via config. This is where the behavioral prompt work I've been doing separately gets wired in.
After that: Docker, tests, then the RAG layer that connects this chatbot to the retail inventory data.
The retail API is the foundation. This is the layer that makes it conversational. The third layer will make it actually know something about the domain.
Transitioning from retail operations to AI engineering. Building a fashion-focused retail API with a chatbot layer on top.
Follow the journey: GitHub | LinkedIn
Top comments (0)