DEV Community

Cover image for Building a Chatbot API From Scratch — Part 2: Streaming, Prompt Engineering and Docker
Chris Kechagias
Chris Kechagias

Posted on

Building a Chatbot API From Scratch — Part 2: Streaming, Prompt Engineering and Docker

Part 4 actually of building a retail inventory API and then giving it a brain.

In Part 3 I built the chatbot foundation: FastAPI, PostgreSQL, conversation memory, context trimming, rolling summarization, and 13 PRs worth of broken things. The API worked. It remembered what you said. It didn't fall over when the context got too long.

That was enough to call it functional. But it didn't feel finished. No streaming. No real identity. No way to run it anywhere except my machine.

Five PRs later, all of that changed. Some of it was clean. Some of it was not.


PR 14 — Auto-Title Generation

Small PR. Big quality-of-life improvement.

Every new conversation started with the title "New Chat..." and stayed that way forever. I wanted it to generate automatically from the first message, without blocking the response.

The approach: fire a background task after the conversation is created.

if not request.title: ""
    asyncio.create_task(
        update_conversation_title(engine, conversation.id, request.user_message)
    )
Enter fullscreen mode Exit fullscreen mode

asyncio.create_task() schedules it and moves on. The 201 Created fires immediately. The title shows up a second or two later. Clean.

But before I got there, I spent an embarrassing amount of time debugging. The first version of generate_conversation_title was calling the main model (gpt-5-mini) and getting back empty responses. Latency was around 22 seconds. 22 seconds for a title.

The problem was max_completion_tokens. I had it set to 1000 which is too low for reasoning models (they need token budget to think before responding). But even after bumping it, the main model was overkill for something this simple.

The fix was a dual model setup. A utility model (gpt-5-nano) for cheap background tasks, and the main model only for actual chat. After the switch, latency dropped from 22 seconds to under 2. While testing the fix I noticed OpenAI had released gpt-5.4-mini and gpt-5.4-nano in March 2026, so I bumped both models while I was in there. 3x faster, same quality.

latency_ms: 4527  # gpt-5-mini
latency_ms: 1418  # gpt-5.4-mini
Enter fullscreen mode Exit fullscreen mode

The background task lives in summarizer.py and uses that utility model:

async def update_conversation_title(engine, conversation_id, user_message: str):
    """Background task to generate and set a title for a newly created conversation."""
    try:
        with Session(engine) as session:
            conv = session.get(Conversation, conversation_id)
            if not conv:
                return
            title = await generate_conversation_title(user_message)
            if title: ""
                conv.title = title
                session.add(conv)
                session.commit()
                logger.info(f"Title updated for {conversation_id}: {title}")
    except Exception as e:
        logger.error(f"Title generation failed: {e}")
Enter fullscreen mode Exit fullscreen mode

Two lessons that burned me:

Background tasks need their own DB session. You can't pass the request session in (it gets closed before the task runs). Always create a fresh Session(engine) inside the background function.

@handle_openai_errors cannot be used on background tasks. The decorator wraps exceptions into HTTP responses, which makes no sense in a fire-and-forget context. Plain try/except is the right pattern.


PR 15 — Streaming (SSE)

This one took the most time.

The goal was to replace the blocking endpoint (wait for the full response, return it) with a streaming one. Tokens arrive at the client as they're generated, using Server-Sent Events.

The Service Layer

The streaming function is an async generator. This is where the first real problem appeared.

I tried to decorate it with @handle_openai_errors like everything else:

@handle_openai_errors  # THIS BREAKS IT
async def get_chat_completion_stream(...):
    ...
    async for chunk in response:
        yield chunk
Enter fullscreen mode Exit fullscreen mode

The decorator wraps the function with return await func(...). But func is an async generator — you can't await a generator. It returns a generator object, not a coroutine. The error:

object async_generator can't be used in 'await' expression
Enter fullscreen mode Exit fullscreen mode

The fix: remove the decorator entirely and handle errors inline.

async def get_chat_completion_stream(
    messages: list[dict], model: str | None = None, max_retries: int = 2
):
    """Streams a chat completion response from the OpenAI API."""
    model = model or config.openai_model

    for attempt in range(max_retries + 1):
        try:
            response = await client.chat.completions.create(
                model=model,
                messages=messages,
                stream=True,
                stream_options={"include_usage": True},
                max_completion_tokens=config.openai_max_completion_tokens,
            )
            async for chunk in response:
                yield chunk
            return

        except (openai.APIError, openai.APITimeoutError) as e:
            if attempt < max_retries:
                wait_time = (attempt + 1) * 2
                await asyncio.sleep(wait_time)
                continue
            raise OpenAIServiceException(
                message=f"OpenAI stream error: {str(e)}",
                status_code=getattr(e, "status_code", 502),
                error_code="OPENAI_STREAM_ERROR",
            )
Enter fullscreen mode Exit fullscreen mode

Note the stream_options={"include_usage": True}. This tells OpenAI to include token usage in the final chunk, so you don't need tiktoken on the streaming path.

The final chunk has chunk.choices == [] and chunk.usage populated:

async for chunk in get_chat_completion_stream(trimmed_messages):
    if chunk.choices and chunk.choices[0].delta.content:
        delta = chunk.choices[0].delta.content
        full_content += delta
        yield f"data: {json.dumps({'content': delta})}\n\n"

    if chunk.usage:
        metadata = {
            "model": chunk.model,
            "tokens": chunk.usage.total_tokens,
        }
Enter fullscreen mode Exit fullscreen mode

The Controller

The controller returns a StreamingResponse wrapping an async generator. DB writes happen after the stream completes (never during).

async def stream_generator():
    full_content = ""
    metadata = {}
    start_time = time.perf_counter()

    try:
        async for chunk in get_chat_completion_stream(trimmed_messages):
            if chunk.choices and chunk.choices[0].delta.content:
                delta = chunk.choices[0].delta.content
                full_content += delta
                yield f"data: {json.dumps({'content': delta})}\n\n"
            if chunk.usage:
                metadata = {"model": chunk.model, "tokens": chunk.usage.total_tokens}

        # Stream done — now persist
        latency_ms = (time.perf_counter() - start_time) * 1000
        new_msg = Message(
            conversation_id=conversation.id,
            user_message=request.user_message,
            ai_response=full_content,
            ai_model=metadata.get("model", config.openai_model),
            tokens_used=metadata.get("tokens", 0),
            latency_ms=latency_ms,
        )
        db.add(new_msg)
        db.commit()

        yield "data: [DONE]\n\n"

    except Exception as e:
        logger.error(f"Mid-stream failure: {e}")
        yield f"data: {json.dumps({'error': 'Stream interrupted'})}\n\n"

return StreamingResponse(stream_generator(), media_type="text/event-stream")
Enter fullscreen mode Exit fullscreen mode

Use time.perf_counter() for latency, not time.time(). It's higher resolution and not affected by system clock changes.

To verify streaming is actually working (not just dumping everything at once):

curl -N -X POST http://localhost:8000/chat/ \
  -H "Content-Type: application/json" \
  -d '{"user_id": "your-uuid", "user_message": "Tell me a short story"}'
Enter fullscreen mode Exit fullscreen mode

The -N flag disables buffering. If it works, you'll see chunks appear one by one in the terminal.


PR 16 — Composable Prompt System

This is the one I'm most proud of.

The system prompt was hardcoded in config. One string. Not flexible, not maintainable, and impossible to tune without touching code. I wanted something I could compose and experiment with.

The Architecture

The idea: YAML as a control layer, markdown files as the actual content. Each prompt is assembled at request time by layering components in a fixed order:

base → core → styles → rules → intensity
Enter fullscreen mode Exit fullscreen mode

The folder structure:

prompts/
├── prompts.yaml          # which components each prompt uses
├── base/                 # the foundation (default, concise, tutor...)
├── core/                 # identity and persona
├── styles/               # tone modifiers (casual, formal, sarcastic)
├── rules/                # behavioral constraints (communication, factuality...)
└── intensity/            # tone calibration (low, medium, high)
Enter fullscreen mode Exit fullscreen mode

The YAML defines each named prompt:

stoic:
  base: base/default.md
  core: [identity, persona]
  rules: [communication, factuality]
  intensity: medium

summarizer:
  base: base/summarizer.md
  core: [identity]
  rules: [suppression, factuality]
  intensity: high
Enter fullscreen mode Exit fullscreen mode

The PromptLoader

The loader reads everything at startup and caches it in memory. No file I/O on every request.

class PromptLoader:
    def __init__(self):
        self.base_path = Path(__file__).resolve().parent.parent.parent / "prompts"

        with open(self.base_path / "prompts.yaml", "r") as f:
            self.config = yaml.safe_load(f)

        self.cache = {}
        self._preload_files()

    def _preload_files(self):
        for folder in ["base", "core", "rules", "styles", "intensity"]:
            dir_path = self.base_path / folder
            if not dir_path.exists():
                continue
            for file in dir_path.glob("*.md"):
                key = f"{folder}/{file.stem}"
                self.cache[key] = file.read_text().strip()

    def build(self, name: str, intensity_override: str | None = None, **kwargs) -> str:
        cfg = self.config[name]
        parts = []

        base_name = Path(cfg["base"]).stem
        parts.append(self._get(f"base/{base_name}"))

        for section in ["core", "styles", "rules"]:
            for item in cfg.get(section, []):
                parts.append(self._get(f"{section}/{item}"))

        intensity = intensity_override or cfg.get("intensity", "medium")
        parts.append(self._get(f"intensity/{intensity}"))

        prompt = "\n\n".join(p for p in parts if p)
        return prompt.format(**kwargs)
Enter fullscreen mode Exit fullscreen mode

The **kwargs handles variable injection. The summarizer prompt has {input} and {existing_summary} placeholders:

prompt = loader.build(
    "summarizer",
    input=new_content,
    existing_summary=conv.summary or "No previous summary.",
)
Enter fullscreen mode Exit fullscreen mode

One thing to watch: any { or } in a markdown prompt file that isn't a variable placeholder will break prompt.format(**kwargs) with a KeyError. Avoid curly braces in prompt content unless they're intentional variables.

The Persona

The identity prompt is the philosophical foundation of the whole thing. This is what I actually care about:

The assistant exists to sharpen thinking, not replace it.
It does not comfort. It does not flatter. It does not fill silence with noise.
When a question is asked, it answers. When a problem is presented, it cuts to what matters. When thinking is lazy or circular, it names that — once, directly, without judgment.
The goal is not to be helpful in the way a tool is helpful. The goal is to leave the person thinking more clearly than they arrived.

The chat endpoint now accepts a prompt_key field. Omit it and it defaults to "stoic". Every conversation persists the prompt key so it stays consistent across messages.

{
  "user_id": "...",
  "user_message": "What is RAG?",
  "prompt_key": "tutor"
}
Enter fullscreen mode Exit fullscreen mode

The prompt library lives outside app/ as a standalone content layer. It's not application code — it's configuration. Same reasoning as why tests live outside app/.


PR 17 — Post-Feature Cleanup

After three feature PRs, the codebase had accumulated some debt:

  • chat_controller was still in the file (fully functional, no longer routed anywhere)
  • Three separate PromptLoader() instances across different files
  • Missing docstrings on PromptLoader, build(), stream_generator()
  • No success logging on the streaming path

The singleton fix: export one shared instance from prompt_loader.py and import it everywhere.

# prompt_loader.py — bottom of file
loader = PromptLoader()

# everywhere else
from .prompt_loader import loader
Enter fullscreen mode Exit fullscreen mode

Same pattern as client = AsyncOpenAI(...) in the OpenAI service. One instance, loaded once at startup.

Cleanup PRs don't feel exciting but they're the difference between a codebase you're proud of and one you're embarrassed to show.


PR 18 — Containerization

Dockerfile is deliberately minimal:

FROM python:3.11-slim

WORKDIR /app

COPY pyproject.toml .
COPY uv.lock .

RUN pip install uv
RUN uv sync --frozen --no-dev

COPY . .

EXPOSE 8000

CMD [".venv/bin/uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
Enter fullscreen mode Exit fullscreen mode

uv sync --frozen --no-dev is the important part — reproducible install, no dev dependencies in the image.

The docker-compose.yml spins up the API and a PostgreSQL 15 container, with a health check on the DB before the API starts:

depends_on:
  db:
    condition: service_healthy
Enter fullscreen mode Exit fullscreen mode

Without condition: service_healthy, the API container starts before Postgres is ready and crashes. Learned this the first time.

The .dockerignore uses a whitelist approach (deny everything, explicitly allow what's needed):

*

!pyproject.toml
!uv.lock
!main.py
!app/
!prompts/
Enter fullscreen mode Exit fullscreen mode

The !prompts/ line is easy to miss. The prompt library lives outside app/, so without it the container starts with an empty cache and every loader.build() call returns an empty string. No crash, no error, just silently wrong behavior.

All Docker commands are wired into taskipy:

task build    # docker compose up --build -d
task start    # docker compose up -d
task stop     # docker compose down
task logs     # docker compose logs -f
Enter fullscreen mode Exit fullscreen mode

What's Next

Phase 2 is complete. The API streams responses, has a composable identity, runs in Docker, and manages context intelligently.

Phase 3 starts with a Telegram bot as a real frontend (no more Swagger UI demos). It'll live in a separate repo, store a per-user OpenAI key, and talk to this API over HTTP. After that: testing, then RAG.

The repo is public: GitHub

Follow the journey: LinkedIn | Medium


Built PR by PR. Mistakes included.

Top comments (0)