DEV Community: David

AI Companion in Production by Month Three: 5 Architecture Decisions and Infra Tuning

David — Sun, 12 Jul 2026 04:22:43 +0000

AI Companion in Production by Month Three: 5 Architecture Decisions and Infra Tuning

Anyone who has tried to build an AI chat product using the most obvious stack — a chat-completions API, OpenAI-style memory, and a single Stable Diffusion endpoint — eventually hits the same walls.

The bot forgets the conversation after ten messages. Sometimes the server returns HTTP 200 as if everything is fine, but the response body contains an empty string: no error, no timeout, no exception. The model simply refuses to speak, and it does so silently. The same text prompt produces two different faces. And if you try to put a generated character into a specific dress from a catalog, it does not work at all.

For the last three months I have been running an AI companion in production. The same backend serves both a Telegram bot and a web app. The audience is hundreds of daily users, not hundreds of thousands. Free-to-paid conversion is in the single digits, which is normal for an early-stage product.

So this article will not contain “millions of MAU” numbers. Instead, it will contain token prices, cache hit effects, daily cost ceilings, production tuning, and the before/after of the infrastructure changes that actually moved our DAU ceiling.

This post combines four engineering build logs from the “Building HoneyChat” series into one article. I also added two sections that were not part of the original posts: unit economics in month three and the operational tuning that more than doubled the DAU ceiling without rewriting the architecture.

Memory: Redis + ChromaDB
LLM routing and prompt caching
Visual consistency: LoRA and IP-Adapter
Unit economics in month three
Production tuning in month three
What I would rebuild differently today
Where this runs in production

TL;DR

Memory: Redis for the hot message buffer plus ChromaDB for compressed summaries of conversation chunks. Three reads happen in parallel. Turning every single message into a vector is a direct path to millions of low-quality documents and noisy retrieval.
LLM routing: the user can choose the relationship pace in the UI: slow_burn, instant, plus the legacy default natural. Each pace and each plan can use a different model. There is also a fallback chain across different providers. The main trap: a model may return HTTP 200 with an empty response because a content filter fired. That is not an exception, not a timeout, just silence.
Prompt caching: on Gemini 3.1 Flash Lite, a single cache_control: ephemeral marker on top of the system prompt saves about 75% on the cached part of the request. In my case, this one marker covers roughly a quarter of the whole LLM budget.
Images: LoRA is a small adapter trained separately for each character. It teaches Stable Diffusion to recognize a specific face. On top of that, IP-Adapter, with moderate strength and early cutoff, can render a specific catalog item without destroying the character’s face.
Production tuning: LRU eviction in ChromaDB, uvicorn worker restarts by request count, 90 seconds for graceful shutdown, and a higher daily cost ceiling. Together, these moved the API memory ceiling from about 500 DAU to about 1,200 DAU, and the ChromaDB ceiling from about 800 DAU to more than 2,000 DAU. The architecture itself did not change.

Below is the detailed version: code, real numbers, and the things I would do differently if I started again today.

1. Memory: Redis + ChromaDB

Why a rolling summary is not enough

The standard beginner path is simple: put the last N messages into context and forget everything else. After 10–20 messages, the context falls out and the bot forgets the user’s name, earlier agreements, or the emotional thread of the scene.

The obvious fix is: “Let’s just increase the context window.”

That hits two problems:

The token cost grows quickly in long conversations.
Even with a long context, models still start losing details from the middle.

The next obvious solution is a rolling summary: every N messages, generate a compressed version.

It is cheap, but it loses nuance when summarized repeatedly. Run this manually:

Message 1: “She said she hated her boss because he takes credit for her work.”

Summary 1: “The user mentioned tension with a manager at work.”

Summary 2: “The user is stressed because of work.”

Summary 3: “The user has a job.”

By the fourth iteration, the reason is gone. The bot starts sounding like a broken record.

The fix is to split memory into layers:

recent messages are stored verbatim;
only truly old chunks are compressed;
semantic search can still retrieve any summary from the conversation history.

Architecture: two independent layers

Redis is the hot buffer.

It is keyed by (user_id, character_id, session_id), has bounded length, a short TTL, and is updated synchronously on every message. Think of it as short-term memory: the latest 20–30 messages.

ChromaDB is the vector store.

It stores compressed summaries of dialogue chunks, not individual messages. Writes are asynchronous and batched. Search works through embedding similarity.

The key idea: vectorize summaries, not every message.

Ten weeks of active chat becomes 30–50 documents per collection, not tens of thousands. The index stays compact. Search quality does not get polluted by short replies like “yeah” or “ok”, which produce weak vectors and create noisy matches.

A note on session_id.

In the web chat, I support “scenes”: a user can start a new conversation with the same character in a different setting, and memory should not leak from the previous scene. That is why Redis keys and ChromaDB collections include session_id when it exists.

The Telegram bot still runs in a compatibility mode without session separation. That layer exists for backwards compatibility.

A summary document in ChromaDB looks like this:

{
  "id": "summary:uid42:char_anna:sess_kn3a:turn_120",
  "document": "Anna and the user discussed his problems at work...",
  "metadata": {
    "type": "summary",
    "turn_range": "100-120",
    "ts": "2026-05-20T14:32:00Z",
    "lang": "en"
  }
}

The type: "summary" field did not exist in the first version. Initially, documents had almost no metadata. Later, when I added new document types such as event and fact, I had to write a backwards compatibility layer.

My advice: put type into metadata from day one, even if you currently have only one document type.

Writing to Redis: bounded list + TTL in one pipeline

async def save_message(user_id: int, char_id: str, role: str, content: str) -> None:
    r = get_redis()
    key = f"chat:{user_id}:{char_id}:messages"

    msg = json.dumps({
        "role": role,
        "content": content,
        "ts": datetime.now(timezone.utc).isoformat(),
    })

    pipe = r.pipeline()
    pipe.rpush(key, msg)
    pipe.ltrim(key, -HOT_BUFFER_SIZE, -1)
    pipe.expire(key, 86400 * HOT_BUFFER_TTL_DAYS)
    await pipe.execute()

Three things matter here.

First, ltrim runs on every write. The list keeps only the latest N messages. Memory usage per user is constant and does not grow with conversation length.

Second, the key TTL is refreshed on every write. Inactive users disappear automatically. Also, set Redis maxmemory-policy to allkeys-lru. The default noeviction policy refuses writes when memory is full, and that surprise usually happens at the worst possible moment.

Third, rpush + ltrim + expire are pipelined. That is one Redis round trip, not three.

Reading: three sources in parallel

async def build_prompt_context(user_id: int, char_id: str, user_query: str) -> dict:
    recent, summary, memories = await asyncio.gather(
        get_recent(user_id, char_id),
        get_latest_summary(user_id, char_id),
        get_relevant_memories(user_id, char_id, user_query),
    )

    return {
        "recent": recent,
        "summary": summary,
        "memories": memories,
    }

The fresh buffer lives in Redis with a short TTL.

If a summary cache is stale, the system reads from ChromaDB and writes the result back to Redis so the next request is hot again.

Production traps I hit

Race condition between two summarization tasks

Two user messages may arrive almost at the same time. Both launch summarization. Two overlapping documents get written to the collection.

In production, I keep a global dictionary:

_SUMMARIZE_TASKS: dict[str, asyncio.Task]

The key is:

f"{user_id}:{char_id}:{session_id}"

When a new task appears, the previous one is cancelled with task.cancel().

The user clears history while summarization is still running

The user presses “reset chat” while a background task is still working. The summary arrives into a collection that should already be gone.

The fix: check whether the Redis key still exists before writing. If the key disappeared, the task exits silently.

Empty summaries cached with long TTL

Sometimes the LLM returns an empty string because of a rate limit or provider issue. I cached that empty string for three days.

The fix is trivial:

if summary:
    await cache_summary(summary)

Missing collections for new users

A query to a non-existing ChromaDB collection throws an exception. This is normal for first messages from a new user. Wrap it in try/except and return an empty result.

2. LLM routing and prompt caching

Why one model for everything does not work

At first, I wanted to pick one good model and stop thinking about it.

After a couple of weeks in production, it became obvious why this does not work.

There are three reasons.

Free and paid plans pull the economics in opposite directions

A free user can send 20 messages per day.

If every message goes to a flagship model, the free user costs you more than they pay. And they pay exactly zero.

A top-tier paid user, on the other hand, expects quality. If they pay for the premium plan, they should get the premium model.

One model for everyone creates one of two bad outcomes:

free users are silently subsidized by paid users;
paid users get the same quality as free users and feel cheated.

The average solution satisfies nobody.

Models treat content differently

GPT and Claude-style models often refuse scenes that are completely normal and legal for an adult companion product.

Less regulated models are more permissive, but often worse at long-context coherence. They forget who said what ten messages ago.

Something is always a trade-off.

Users choose their own relationship pace

In the UI, the user chooses between two relationship modes:

slow_burn: “let’s get to know each other first, no instant 18+”
instant: “get to the interesting part without long setup”

There is also a legacy database value:

natural: the default for users who never opened the relationship pace setting.

These modes affect not only the story, but also expectations from the model.

An instant user usually sends shorter messages and expects an answer in about three seconds.

A slow_burn or natural user is more likely to write long descriptive scenes and tolerate a 10-second response.

A hardcoded single model loses in one of those cases: either it is too slow for short chat, or too dry for scenes.

There is also a separate response style dimension:

standard
cinematic: a long scene with action markers like ✦action✦
brief: one or two short sentences
slang: SMS-style
conversational: natural dialogue without action markers

Style and pace are independent. A user can pick any style on any pace.

The response style almost does not affect model routing, but it does affect the final prompt. brief, slang, and conversational remove cinematic markup and limit answer length at the prompt level.

So the routing scheme is:

plan × relationship pace → primary model

Plus a fallback chain through different providers.

I use OpenRouter as an LLM provider aggregator: one API, one key, many backend providers, and visibility into which backend is actually serving each model.

Current model map

Plan	Relationship pace	Model	Input/output price per 1M tokens	Cache
Free / Basic / Premium	`slow_burn`, `natural`	`qwen/qwen3-235b-a22b-2507`	$0.07 / $0.10	no
Free / Basic / Premium	`instant` + explicit request	`deepseek/deepseek-v4-flash`	$0.14 / $0.28	implicit, automatic
VIP / Elite	any pace	`google/gemini-3.1-flash-lite`	$0.25 / $1.50	explicit marker
Fallbacks	if refusal or empty response	`x-ai/grok-4.20`, then `minimax/minimax-m2-her`	usage-based	—

Qwen3-235B-A22B-2507 is the cheapest decent model I have tried.

It is a 235B parameter MoE model. MoE means “Mixture of Experts”: internally, the model has several specialized expert subnetworks, and only part of them are activated per request. This makes it faster and cheaper than a dense model of the same size.

It has a 131k token context window and costs $0.10 per million output tokens. For the free plan, that is enough.

Gemini 3.1 Flash Lite on paid plans gives a 1M token context window and better coherence for long scenes. But without caching it is much more expensive than Qwen, especially on output.

That is why the next section is about caching.

Prompt caching on Gemini 3.1: where 25% of the budget was hiding

The principle is embarrassingly obvious once you learn it.

Every request to the model starts with the same large system prompt: character description, behavior rules, tone instructions, safety rules. This part is identical across requests within the same dialogue.

The provider can cache it after the first call and charge less for the cached part on later requests. You pay mostly for the unique part: the user’s fresh message.

OpenRouter supports prompt caching, but the details vary a lot by model.

I spent about one and a half months paying full price for Gemini before reading the docs carefully enough and adding one marker to one line.

The effect: about 25% of the entire project LLM budget.

One of those commits where you do not know whether to feel proud or annoyed.

Empirical picture:

DeepSeek V4 Flash caches automatically. My test showed 1.5k–1.8k cached tokens per turn.
DeepSeek Chat v3.1 showed no visible cache. Skipped.
Qwen3-235B is not in the OpenRouter supported-cache list. Skipped.
Gemini 3.1 Flash Lite requires an explicit cache_control marker. In a test, 3,772 out of 3,779 prompt tokens were cached. Cached reads cost about 25% of normal input, so the saving is about 75% on the cached part.

The cache lifetime is short, about five minutes (ephemeral), but OpenRouter uses sticky routing: subsequent requests in the same dialogue go to the same backend provider. So the cache stays hot while the user is actively chatting.

The minimum block size is about 1,024 characters. The system prompt almost always passes that threshold.

Here is the production helper:

_EXPLICIT_CACHE_PREFIXES = ("google/gemini-",)

def _apply_prompt_caching(messages: list[dict], model: str) -> list[dict]:
    """
    Wrap the first system message in block format with cache_control: ephemeral
    for providers that require an explicit marker, such as Gemini.

    Other providers either cache implicitly, like DeepSeek V4,
    or silently ignore the marker.
    """
    if not model.startswith(_EXPLICIT_CACHE_PREFIXES):
        return messages

    out: list[dict] = []
    cached_one = False

    for m in messages:
        if not cached_one and m.get("role") == "system":
            content = m.get("content")

            if isinstance(content, str) and len(content) >= 1024:
                new_msg = dict(m)
                new_msg["content"] = [{
                    "type": "text",
                    "text": content,
                    "cache_control": {"type": "ephemeral"},
                }]
                out.append(new_msg)
                cached_one = True
                continue

        out.append(m)

    return out

Four details are easy to miss.

First, add the marker only for model families that require it. Other providers may ignore it, but some OpenAI SDK clients can reject the field during validation.

Second, place the marker only on the system prompt. The message history is different on every turn, so caching it is pointless. The system prompt is the largest stable part of the request.

Third, keep the 1,024-character threshold. Below that, OpenRouter does not cache anything.

Fourth, cache_control requires a different content format. A normal content: "string" will not work. It must become a block array:

[
  {
    "type": "text",
    "text": "...",
    "cache_control": {"type": "ephemeral"}
  }
]

This is easy to miss in the docs, but required by OpenRouter.

The HTTP 200 empty-response trap

Some reasoning models run content checks before returning the final answer.

On a borderline request, they do not return an HTTP error. They return HTTP 200 with a body like this:

{
  "choices": [{
    "finish_reason": "content_filter",
    "message": { "content": "" }
  }]
}

Empty string. No exception. No status code that your retry logic can catch.

If your retry logic only handles httpx.HTTPStatusError, the empty answer goes straight to the user.

The fix is one function that validates the choice before passing it downstream:

def _is_silent_refusal(choice: dict) -> bool:
    """
    Reasoning models can return HTTP 200 + finish_reason=content_filter
    + content="". If you only look at the HTTP status, the user gets an
    empty message.
    """
    reason = choice.get("finish_reason")
    content = choice.get("message", {}).get("content") or ""

    return reason in ("content_filter", "length") and not content.strip()

I also check content.strip() separately from finish_reason. Some models return finish_reason=stop with empty content when they refuse softly.

Fallback chain

async def complete(messages, *, primary=None, chain=None) -> CompletionResult:
    models = list(chain) if chain is not None else _build_chain(primary)

    async with httpx.AsyncClient() as client:
        for attempt, model in enumerate(models):
            try:
                data = await _call(client, model, messages)

            except httpx.HTTPStatusError as e:
                if e.response.status_code in TRANSIENT_CODES:
                    continue
                raise

            except (httpx.ReadTimeout, httpx.ConnectError):
                continue

            choice = (data.get("choices") or [{}])[0]

            if _is_silent_refusal(choice):
                continue

            content = choice.get("message", {}).get("content") or ""
            if not content.strip():
                continue

            return CompletionResult(
                content=content,
                model=model,
                attempt=attempt,
            )

    raise AllModelsFailedError(f"no model returned usable content; tried {models}")

The fallback rule is: use different providers.

If the primary model is hosted by provider A, the fallback should go through provider B. A fallback from the same provider often fails on the same content, because the moderation filter may sit at the API gateway layer before the model itself.

OpenRouter makes this visible in the model metadata.

What I log:

Which model actually answered: primary or fallback, and the fallback index.
Time to first token versus total response time.
Token cost split by plan and model.

If the primary model refuses 10% of a certain request class, that is not a retry problem. It is a routing problem. Move that class to another primary model.

3. Visual consistency: LoRA and IP-Adapter

This section covers two related tasks:

Keep the character’s face stable from generation to generation. That is LoRA.
Render a specific catalog item on top of that character without breaking the face. That is IP-Adapter layered on top of LoRA.

3.1. Why “same prompt = same face” does not work

It seems intuitive:

anime girl, long silver hair, green eyes, Arknights operator outfit
seed=12345

Should produce Anna. Always.

In practice, it does not.

Three reasons.

Batch size changes the output

In many Stable Diffusion configurations, one image with batch_size=1 and the first image from a batch_size=4 batch can differ even with the same seed. The random generator state depends on tensor dimensionality. This is not a bug; it is a sampler implementation detail.

External APIs shift samplers and defaults

If you call external services such as fal.ai, Replicate, or Together, the provider can update the model, change default parameters, or switch samplers. Your “fixed” character drifts over weeks.

One of my characters aged about five years in a month because a provider rolled back a minor model version without warning anyone.

Long prompts saturate

After a certain number of tags, adding more details stops helping. The model works with an approximate template of the character and interpolates inside it. In other words, it inserts an average of something similar that it has seen during training.

3.2. IP-Adapter alone is weak for faces

IP-Adapter is a technique that passes a reference image along with the text prompt. The model uses visual features from that image during generation.

It is great for rendering products.

It is weak for preserving only the face.

The problem is that IP-Adapter pulls everything from the reference image: lighting, pose, background, sometimes even clothes. If you lower the weight, the face weakens. If you increase it, the reference dominates the whole generation.

IP-Adapter works well when the reference is exactly what you want to preserve: a specific product or object. When you want to preserve only a face, it is the wrong tool.

3.3. LoRA per character: scale and cost

LoRA, or Low-Rank Adaptation, is a small set of additional weights layered over a base Stable Diffusion model.

A LoRA file is usually 100–300 MB versus 6–7 GB for the base SDXL model. So it is 20–30 times smaller than the base model.

When trained on 20–30 clean images of one character in different poses, angles, and lighting, it encodes the character’s face into neural weights, not into a text prompt. The model “learns” what that character looks like and can generate them consistently across prompts.

Real production numbers:

The catalog has 100+ characters. Each has a separate LoRA.
Training runs on rented GPU through Vast.ai, usually RTX 3090/4090 with 24 GB VRAM.
The GPU is rented monthly, so the marginal cost of training and inference is close to zero. We pay a fixed rental fee; the number of LoRAs trained during the month is mostly a capacity question.
We start paying per image only if the GPU goes offline and we fall back to paid providers such as OpenRouter Flux/Riverflow at about $0.03–$0.07 per image.
One SDXL LoRA takes 15–25 minutes of GPU time depending on dataset size and number of steps.
Checkpoints live in Storj S3. One LoRA safetensors file is 100–300 MB depending on rank, usually 16–32 in our setup.

Inference runs on a separate GPU with the base SDXL model preloaded and hot-swapped LoRAs per request. The base model is loaded into VRAM once. Switching LoRA for a specific character takes tens of milliseconds.

The pipeline:

workflow = [
    "Checkpoint",           # base SDXL
    f"LoRA: {char.lora}",   # character-specific LoRA
    "FreeU",                # noise rebalance, quality boost without much compute
    "KSampler",
]

For projects with one or two characters, a LoRA pipeline may be overkill. For 50+ characters, it is the only reasonable architecture.

3.4. What matters during training

A skeleton config for a Kohya_ss trainer:

[model_arguments]
pretrained_model_name_or_path = "<path/to/sdxl-base.safetensors>"

[dataset_arguments]
train_data_dir = "./dataset/train"
resolution = "1024,1024"
caption_extension = ".txt"

[training_arguments]
output_dir = "./output"
output_name = "<your_character_v1>"
learning_rate = "<tune>"
max_train_steps = "<tune>"
train_batch_size = "<tune>"

[network_arguments]
network_module = "networks.lora"
network_dim = "<tune>"
network_alpha = "<tune>"

The parameters that actually matter — learning rate, steps, rank, alpha, dataset size — depend on what you are training.

Anime faces converge differently from realistic faces. There is no universal best setting. Run three or four variants and compare grids.

Rules that worked for me:

Dataset quality matters more than size. Twenty clean diverse images beat one hundred noisy ones.
Use different poses and lighting for the same face. Thirty copies of the same angle teach the model that angle, not the character.
Captions should describe the scene, not the character. “girl in a garden” is better than “Anna in a garden”. You want the model to learn the face from visual context, not bind it to the word “Anna”.
Tune rank separately. If rank is too low, the model underfits and the face is vague. If rank is too high, the model overfits and the face becomes stiff and resists pose or emotion changes.

3.5. IP-Adapter on top of LoRA for catalog items

Now we have a stable Anna LoRA.

The user buys a specific dress.

We need both:

Anna’s face does not drift;
this exact dress is recognizable.

Prompt engineering does not solve this. A prompt like:

Anna wearing a red silk dress with a white collar

will produce some red silk dress, not the exact SKU.

SKU-level accuracy requires a visual reference.

The conflict:

LoRA fixes the face.
IP-Adapter pulls the reference image.
If IP-Adapter strength is too high, Anna starts looking like the reference.
If it is too low, the dress becomes vaguely similar, not exact.

The solution is controlled by two parameters.

`weight`

How strongly IP-Adapter affects generation. Value range: 0 to 1.

Below the middle range, the reference becomes more like a mood. Above the middle range, it dominates everything.

The lower half, about 0.2–0.5, usually gives the best face/clothing balance.

`end_at`

At which share of generation steps IP-Adapter turns off.

Stable Diffusion does not draw an image in one step. It gradually removes noise during 20–50 denoising steps.

If IP-Adapter runs through all steps, it also affects final face details. If it stops at 70–90% of the way, the last steps run only under the LoRA-modified model, and the face gets pulled back toward the character.

Roughly speaking:

The product gets its shape in the middle of generation, and the face is polished at the end.

Node order in ComfyUI:

[Checkpoint Loader]
  -> [LoRA Loader: character_lora]
  -> [FreeU: quality touch-up]
  -> [IPAdapter Advanced: reference, weight=W, end_at=E]
  -> [KSampler]
  -> [VAE Decode]

LoRA goes before IP-Adapter. LoRA modifies the base model weights. IP-Adapter modifies intermediate cross-attention layers during generation. When IP-Adapter stops at end_at, the remaining steps run on the LoRA-modified model without IP-Adapter influence.

That is what lets the face return to normal.

How to tune `weight` and `end_at` in practice

Use a reference with a clean background and a character with an already stable LoRA.
Start with weight=0.4, end_at=0.8. In my production pipeline, this usually gives a stable balance of face and clothing.
If the face drifts, lower weight or end_at by 0.05.
If the product is not close enough to the reference, raise weight by 0.05.
Do not jump by 0.1 too early. The working range is narrower than it looks.

When you switch the base model, both numbers will probably move. Do not hardcode them forever.

3.6. How this is assembled in the product

Catalog references: each visual item stores a link to its reference image in S3.
Previews are generated in advance: when the user opens the shop, they see each item rendered on the active character. These previews are not generated on page load. They are generated by a background Celery task and served from S3 cache.
The same weight and end_at values go into the video start frame: tune them on static images first, then carefully pass them into the video pipeline.
Not every product is visual: some catalog items are stat bonuses, dialogue unlocks, or relationship flags. They do not have images. The catalog has an explicit visual: true|false flag, and the API rejects non-visual items before they enter the GPU queue.

3.7. Visual stack traps

Face drifted on shop previews

I set IP-Adapter weight too high because I wanted the clothes to match better. After user complaints, I moved it back to the lower half of the range.

The lesson is boring but true: tune one variable at a time, even when it feels slow.

Presigned reference URLs expired during the task

The catalog in S3 was served through short-lived presigned URLs. The background task picked up the URL in the queue, but ComfyUI downloaded it later. By then, the URL was dead.

The fix: the task downloads the image itself and sends a local filename to ComfyUI.

IP-Adapter version mismatch with SDXL

IP-Adapter Plus ships in several files tied to specific SDXL versions. A mismatch may not crash. It can simply produce a weaker result.

Pin the IP-Adapter version in deployment config together with the base model. Treat them as a pair.

A non-visual item crashed the image pipeline

The API tried to run a product with no image through the image pipeline. The fix is the visual flag and a boundary check before the task reaches the queue.

4. Unit economics in month three

Context

The project has been live for three months. The audience is hundreds of daily users, not hundreds of thousands. Conversion to paid plans is in the single digits, which is typical for an early-stage product.

The implication: free users must be cheap to serve. Otherwise, they burn the budget before you even understand whether you have found paying users.

Every percent saved on the cost of a free user is extra runway.

The real daily ceilings live in one config file:

daily_cost_alert_usd: float = 30.0      # sends a Telegram alert to admin, does not block
daily_cost_hard_stop_usd: float = 50.0  # is_generation_allowed() returns False

Without intervention, the monthly budget is roughly $900–$1,500. On a normal day, we spend much less: tens of dollars. Peak scenarios, such as massive image or video generation, can approach the alert level.

Counters and the “block on measurement failure” rule

Counters live in Redis with a seven-day TTL:

costs:daily:{YYYY-MM-DD}
costs:user:{uid}:{YYYY-MM-DD}

Writes use atomic INCRBYFLOAT. One Redis command increments a floating-point number. No races under parallel calls.

A function checks whether a new generation is allowed by reading the counter and comparing it with the ceiling.

The key detail:

If Redis is unavailable, the function returns “not allowed”.

This is a fail-closed approach.

It sounds paranoid until you wake up and find that some runaway generation loop burned the daily budget overnight because the counter was silent.

The rule is now strict: if we cannot measure, we block.

The cost of the false negative is low: the user did not get one response.

The cost of the false positive is high: an uncontrolled loop can burn money at $1.50 per million output tokens.

Where the money goes

Cost item	Share	Notes
Gemini 3.1 FL for VIP / Elite	40–50%	Expensive output at $1.50/1M tokens, partially offset by caching
Qwen3-235B for Free / Basic / Premium	15–20%	Cheap, but high request volume
Fallback LLMs: Grok / MiniMax	<5%	Trigger only when primary model refuses or fails
Voice: Inworld TTS-1.5 Max	15–20%	$10 per million characters, roughly $0.005–$0.01 per voice message
Images: Vast.ai GPU + fallback	15–25%	Fixed monthly GPU fee, fallback around $0.03–$0.07 per image
Video: WaveSpeed + fal Pixverse	spikes	$0.16–$0.25 per clip, available from Premium plans

What prompt caching saves

A rough estimate for one VIP reply.

Without cache:

system prompt ~3,800 tokens × $0.25/1M = $0.00095 input
output ~300 tokens × $1.50/1M = $0.00045
total ≈ $0.0014 per reply

With cache hit on the system prompt after the first message in a session:

system prompt ~3,800 tokens × $0.0625/1M = $0.000238
output ~300 tokens × $1.50/1M = $0.00045
total ≈ $0.00069 per reply

That is about a 50% saving on replies after the first one.

Almost every active session is mostly “reply 2+”, so this has a real budget impact. In my mix, caching covers about a quarter of the total LLM budget.

DeepSeek V4 adds another 10–15% saving on instant pace replies through implicit caching. Qwen3 does not participate in OpenRouter prompt caching.

5. Production tuning in month three

Architecture alone does not move the DAU ceiling if the operational layer is falling apart.

You also need to tune memory limits, timeouts, worker restarts, and daily cost ceilings. This part rarely appears in “how to build a chatbot” tutorials, but it directly determines whether you hit the wall at 500 daily users or 1,500.

Below are four concrete changes I made in the last few weeks, with before/after numbers.

5.1. ChromaDB: LRU eviction and higher memory limit

This problem appeared gradually.

First, the ChromaDB container started eating 2 GB. I raised it to 4 GB. Two weeks later, it hit OOMKill again, so I raised it to 6 GB. After another week, it looked like I was simply paying for the database’s growing appetite.

The culprit matched open ChromaDB issues #3336 and #5843 in the 0.5.x branch.

The internal segment cache keeps loaded segments forever and does not evict them automatically, even when a collection has not been used for a long time.

I had one collection per (user × character × session) pair: 2,233 collections. Each gradually pulled its index into memory and never let it go.

Container memory grew steadily: about 250 MB per week.

The fix was in the docs: enable LRU eviction.

LRU means Least Recently Used. Old unused collections are evicted first; active ones stay in memory.

Config:

chromadb:
  environment:
    CHROMA_SEGMENT_CACHE_POLICY: LRU
    CHROMA_MEMORY_LIMIT_BYTES: "8589934592"  # 8 GB, LRU triggers above this
  deploy:
    resources:
      limits:
        memory: 10G

After that, the active in-memory dataset collapsed to 50–200 MB: mostly sessions from the last five minutes. Cold collections are evicted automatically. Linear memory growth stopped.

Without LRU, memory death would arrive in six to eight weeks. Bonus risk: if ChromaDB crashes during a write, embeddings from the latest session can be silently lost.

I added another 2 GB over the LRU limit as a buffer for backups. A 3.3 GB tar snapshot competed for IO with ChromaDB, and Sentry showed intermittent backup failures I could not reproduce manually. After raising the limit, the problem disappeared.

The architectural limitation remains:

2,233 collections are a consequence of my own memory architecture.

At 5,000 DAU, this can become tens of thousands of collections. LRU will start evicting too aggressively, and latency will rise due to “evict-load-evict-load” behavior.

At that point, I will need to migrate to one shared collection with session_id filtering in metadata. That is a couple of weeks of refactoring. I postponed it until it becomes necessary.

5.2. Restarting uvicorn workers by request count

Before:

A FastAPI/uvicorn worker leaked memory:

480 MB at startup -> ~800 MB after 8 hours

With four workers and a 1.5 GB per-worker budget, that becomes about 6 GB, which is the container limit. Estimated memory death: around 500 DAU.

I added:

api:
  command: >
    uvicorn api.main:app
    --workers 4
    --limit-max-requests 5000
    --timeout-graceful-shutdown 90

--limit-max-requests 5000 restarts a worker after it handles 5,000 requests. The leak does not have time to accumulate.

After:

At 300 DAU and about 30k requests per day, each worker restarts one or two times per day. Memory returns to about 480 MB. Four workers use around 1.9 GB, leaving a 3x safety margin inside the 6 GB limit.

The API memory ceiling moved from about 500 DAU to about 1,200 DAU.

5.3. Graceful shutdown: 90 seconds for LLM requests to finish

The previous fix created a new problem.

Workers now restart regularly. If a worker restarts while an LLM request is in progress, it can interrupt a 10–60 second request. The user sees “Network error”, HTTP 502, or an empty answer.

On paper:

6 restarts per day × ~3 in-flight requests = ~18 visible errors/day

And that was caused by my own optimization.

The fix is to give the worker 90 seconds to finish active requests.

Important: this must be set in two layers:

api:
  command: uvicorn ... --timeout-graceful-shutdown 90
  stop_grace_period: 90s

The p99 of an LLM request is around 60 seconds, so 90 seconds gives a 50% buffer.

After that, user-visible restart errors dropped to zero.

A common mistake: setting only the uvicorn option is not enough. Docker still kills the container after its own default timeout, around 10 seconds. You need both settings.

5.4. Daily cost ceiling: $30 → $50

One day in May, daily spend jumped to $21.70. That was too close to the old $30 hard stop.

If the same spend happened in the first half of the day, we would hit the hard stop in the evening and users would get “generation unavailable” instead of replies.

I raised the hard stop to $50 and kept the alert at $30.

The alert at 60% of the ceiling gives five to six hours to notice and react before blocking.

At the current user profile, about $0.06 per DAU per day, the $50 hard stop should not fire until roughly 800 DAU. After that, we need either better paid conversion, upsells, or a higher ceiling with updated economics.

Summary: where the ceilings moved

Bottleneck	Before	After	Change
API memory / OOM risk	~500 DAU	~1,200 DAU	+140%
ChromaDB memory	~800 DAU	~2,000+ DAU	+150%
Daily hard cost stop	~480 DAU	~830 DAU	+73%
502 on worker restart	~18/day	0	fixed
IO competition during backup	intermittent failures	fixed	fixed

Cost of changes:

+4 GB host memory for ChromaDB.
No meaningful CPU overhead from LRU.
About 0.5 seconds of downtime during container recreation.
Low risk: all settings are from official ChromaDB and uvicorn docs, no hacks.

Result:

This tuning bought roughly 6–12 months without touching the ChromaDB architecture or hardware.

The next bottlenecks are Vast.ai GPU as a single point of failure and the single 32 GB host that runs everything. If the machine dies, the product dies. After 1,500 DAU, this becomes critical.

6. What I would rebuild differently today

If I could rewind three months and rebuild this with everything I know now, here is what I would change.

Memory

I would not use pgvector for this exact workload. On short queries over summaries, retrieval quality was worse than ChromaDB. For other workloads, pgvector may win.
I would not vectorize every message. The index grows, but search quality does not.
I would summarize fixed windows by number of messages, not by time. A daily summary is useless for a user who sends 500 messages in one day.
I would add background-task cancellation and metadata.type in ChromaDB documents from day one.

LLM routing

I would route by relationship pace from the first day.
I would also allow response style to override the model when needed.
I would add cache_control on Gemini immediately. I lost about one and a half months of unnecessary spend.
I would create a separate metric for silent model refusals: HTTP 200 with an empty body. It is rare, but without a metric you will not see it.
I would not use the same OpenRouter key in dev and prod. The rate limit is shared, and development noise eats production quota.

Images

I would launch the image pipeline with LoRA from day one, even with only three characters. Inconsistent images on the free tier kill the first impression before the user reaches the strong parts of the product.
I would build datasets manually instead of scraping. Five iterations of 20 hand-picked images beat a noisy scrape of 200.
I would version LoRAs: char_v1, char_v2. They should live in parallel, so a regression can be rolled back for one character without rolling back the whole pipeline.
I would store IP-Adapter settings such as weight and end_at as deployment parameters, not code constants. When the base model changes, those values move.

Unit economics and production tuning

I would add cache_control on day one. It is a one-function helper, but I lived without it for a month and a half.
I would unify cost counters across all generation types. Right now, a mapping dictionary has to be updated every time a new generation type appears.
I would add a daily per-user cost ceiling alongside the global hard stop. Currently, one very active user could theoretically consume a large share of the daily budget before rate limiting catches up.
I would build a dashboard for cache hit rate. OpenRouter returns prompt_tokens_details.cached_tokens in each response, but without aggregation you will not notice when caching breaks because a prompt format changed.

Still open bottlenecks

Vast.ai GPU is a single point of failure. The solution is a second hot-standby GPU. It starts paying for itself around 1,500+ DAU.
One 32 GB host runs the whole product. If the machine dies, the product dies. At our current scale, this is acceptable. Later, it is not.
2,233 ChromaDB collections are an architectural limitation. LRU hides the issue but does not solve it.

Where this runs in production

Everything above runs in one backend behind HoneyChat: an AI companion product available both as a Telegram bot and as a web app.

The same chat, memory, characters, and LoRA pipeline are available from both Telegram and the browser, with history synchronized between them.

If you want to try the architecture described in this article:

Telegram: @HoneyChatAIBot, run /start. The free tier gives 20 messages per day without registration.
Web: honeychat.bot. Same backend, full chat interface, images, and voice.
Code examples: public tutorial folders with runnable examples for each engineering part are available at github.com/sm1ck/honeychat/tree/main/tutorial. They can be cloned and started with Docker Compose.

If you are building something similar and hit one of the same walls, I would be interested in hearing about it, especially around:

race conditions caused by user actions such as clear history or switch character;
tuning weight and end_at on newer SDXL forks;
memory architectures for long-lived AI companions.

There is surprisingly little public material about this outside of anime-generation communities.

Sources and related docs

Memory:

ChromaDB docs
ChromaDB issue #3336: segment cache growth
ChromaDB issue #5843: LRU eviction
Redis LTRIM

LLM:

OpenRouter model list
OpenRouter prompt caching docs
Chat Completions finish_reason semantics

Visual stack:

LoRA paper, Hu et al., 2021
Kohya_ss SDXL training
IP-Adapter by Tencent AI Lab
ComfyUI IPAdapter Plus
SDXL base model

Infrastructure:

uvicorn deployment docs
Docker Compose stop_grace_period

AI Chatbot Memory Architecture in 2026 — RAG, Long Context, and Hybrid Approaches Compared

David — Mon, 08 Jun 2026 06:53:33 +0000

Building a chatbot that "remembers" conversations is one of the most misunderstood problems in production AI systems.

Marketing copy at every consumer chat product claims "extended memory" or "persistent memory," but the underlying architecture varies wildly. The implementation choice determines whether your bot genuinely recalls last week's conversation or just has a slightly larger context window.
This is a technical breakdown of the three memory architectures used in production AI chatbots as of 2026, with tradeoffs, when to use each, and what consumer apps actually implement under the hood.

The four memory approaches you'll see in production

The "AI memory" landscape splits into four approaches, each with different infrastructure cost, latency, and recall fidelity:

Pure context window — feed the model the last N tokens of conversation, nothing more. This is what most "no memory" products do, often dressed up as "extended memory."
Vector-based RAG — store conversation chunks in a vector database, retrieve semantically relevant chunks at query time, insert them into the prompt.
Structured fact extraction — parse conversations into discrete facts (name, preferences, events), store as structured data, inject at query time.
Hybrid — combine vector RAG for "fuzzy" recall, structured facts for "hard" details, and recent context for continuity. Most consumer chat products use approach #1 (pure context window) and call it memory. Approach #4 is what you actually want for real cross-session recall but requires the most infrastructure.

Pure context window — the cheap default

This is what Character.AI's "extended memory" feature actually is. The model sees:

_> [system prompt with character definition]

[last N messages from current session]
[optional: up to 15 pinned messages]
[user's new message]_
That's it. There's no database of past conversations. When you start a new session, the model has zero context from previous sessions. The "memory" is purely the in-session conversation history.
Pros:
• Trivial implementation (just send recent messages to the model)
• Zero infrastructure beyond your LLM API
• No retrieval latency
Cons:
• No actual cross-session memory
• Hard cap on conversation length (model context window)
• Older messages from current session get truncated as window fills
Consumer products using this: Character.AI (all tiers), Chai (all tiers), most ChatGPT wrapper apps, Telegram bots without backend storage.
When to use it: MVP prototypes, single-session use cases, or products where forgetting is feature (e.g., privacy-focused ephemeral chat).

Vector-based RAG — the standard "real memory" approach

Vector RAG is the most common approach for products that genuinely persist memory across sessions. Implementation pattern:

_> # Storage path: every user message + bot response is chunked and embedded

async def store_turn(user_id, role, text):
chunks = chunk_text(text, max_tokens=200)
for chunk in chunks:
embedding = await embed(chunk)
vector_db.upsert(
id=f"{user_id}{role}{timestamp}",
vector=embedding,
metadata={"user_id": user_id, "role": role, "text": chunk, "ts": now()}
)_

_> # Retrieval path: query vector DB for relevant context, inject into prompt

async def build_prompt(user_id, query):
query_vec = await embed(query)
relevant = vector_db.query(query_vec, top_k=10, filter={"user_id": user_id})
context = "\n".join([r.metadata["text"] for r in relevant])
return f"Relevant past conversations:\n{context}\n\nCurrent query: {query}"_

The vector database choice matters significantly:

• Pinecone — managed, easy to start, gets expensive at scale (~$70/mo per pod minimum). Good for teams that don't want infrastructure overhead.
• Weaviate — open source, self-host or managed. Solid choice for production with custom requirements.
• ChromaDB — embedded or server mode. Great for prototyping and single-server deployments. Less suitable for horizontal scaling.
• Qdrant — Rust-based, excellent performance, good for high-throughput. Active development.
• pgvector — Postgres extension. If you already have Postgres and don't need massive scale, this is often the simplest path.

Pros:
• Semantically relevant recall — bot finds "what's similar to what we're discussing now"
• Scales to millions of conversations per user
• Works across sessions, weeks, months

Cons:
• Retrieval latency (typically 50-200ms before LLM call)
• Vector DB cost grows linearly with data
• Quality depends heavily on embedding model and chunk strategy
• Cold-start: requires N+ conversations before recall feels "real"

Consumer products using this: HoneyChat (ChromaDB), several "AI friend" apps built in 2024-2025.
When to use it: Cross-session memory is core to product value. Users expect bot to remember names, preferences, and relationship history.

Structured fact extraction — for "hard" memory

Vector RAG is great for fuzzy recall ("we talked about your trip to Japan") but bad at structured facts ("user's name is Alex, prefers tea, has a cat named Mochi"). For these, an additional layer parses conversations into structured data.
Implementation pattern:

_> async def extract_facts(user_id, turn_text):

# Use a smaller, fast model for extraction
response = await llm.complete(
    model="claude-haiku-or-similar",
    prompt=f"Extract facts about the user from this message as JSON: {turn_text}",
    schema={"facts": [{"category": "string", "value": "string", "confidence": "float"}]}
)
for fact in response["facts"]:
    if fact["confidence"] > 0.7:
        facts_db.upsert(user_id, fact["category"], fact["value"])
async def build_prompt(user_id, query):
facts = facts_db.list(user_id) # all known facts
facts_str = "\n".join([f"{f.category}: {f.value}" for f in facts])
vector_context = await vector_db.query(...) # RAG for fuzzy recall
return f"What we know:\n{facts_str}\n\nRelevant past:\n{vector_context}\n\nQuery: {query}"_

Pros:
• Bot reliably knows hard facts (name, age, preferences) — no embedding similarity gymnastics
• Cheap to query at runtime (key-value lookup)
• Can be edited/corrected by user explicitly

Cons:
• Extraction step adds cost and latency (typically 100-300ms per turn)
• Extraction quality depends on extraction model
• Schema design is important — too rigid loses nuance, too loose duplicates facts

Consumer products using this: Nomi AI (structured facts is core to their architecture), HoneyChat (in addition to vector RAG), some enterprise customer service bots.
When to use it: Hard facts matter. User explicitly says "remember that I prefer tea" and expects this to persist. Common in companion apps and personal assistants.

Hybrid: the production-grade pattern

Real production systems combine all three approaches:

_> Memory layers (highest fidelity to lowest):

Structured facts (key-value, "user_name=Alex, prefers=tea")

Recent conversation buffer (last N=20-50 messages, in-memory or Redis)

Vector RAG (semantic search over all conversation history)

Optional: episodic summaries (LLM-generated summaries of past sessions) At query time: async def build_context(user_id, query): facts = await facts_db.get_all(user_id) # 1ms lookup recent = await redis.get_recent(user_id, n=20) # 5ms lookup relevant = await vector_db.query(query, user_id, top_k=5) # 50-100ms return f""" Facts about user: {facts} Recent conversation: {recent} Relevant past context: {relevant} Current query: {query} """_

This hybrid is what serious production AI companion products use. It's expensive in infrastructure (Redis + vector DB + facts DB + extraction model) but delivers the experience users describe as "the bot really knows me."
Latency budget for hybrid approach typically lands around 200-400ms before the main LLM call. With a streaming response from a fast model like Claude Haiku, total time-to-first-token stays under 1 second — acceptable for chat UX.

Memory architecture decisions in the wild

Based on observation of leading platforms in 2026:
• Character.AI: pure context window. No cross-session memory architecture. Pinned messages (up to 15) are the only persistence layer. Premium tier extends context window size but doesn't add memory layers.
• Chai: pure context window with very short active dialog memory (2-3 messages in active context per community reports). Claims a "Persisted Memory" feature on PRO that appears to be a limited structured-facts layer storing basic profile data between sessions but not extending active context.
• Replika: hybrid — structured facts (the "Diary" feature is essentially curated structured memory) plus vector RAG plus recent buffer. By far the strongest memory architecture in the consumer category, which is why it remains relevant despite the 2023 ERP debacle.
• Nomi AI: structured-facts heavy with vector RAG augmentation. Their "structured facts" branding accurately describes their architecture.
• HoneyChat: full hybrid — ChromaDB vector RAG + structured facts per character session + Redis recent buffer + optional episodic summaries for long histories.
• JanitorAI: depends entirely on which OpenRouter model you choose. The platform itself has minimal memory layer — most "memory" is in the system prompt the user maintains manually.

When pure context window is enough

Not every product needs hybrid memory. Use the simplest architecture that works:
• Single-session productivity tools (writing assistant, code helper): pure context window
• Short-form Q&A bots (FAQ, customer service triage): pure context window
• Companion or relationship-focused apps: hybrid required for credibility
• Long-form roleplay platforms: at least vector RAG, hybrid for premium tier
• Enterprise knowledge management: vector RAG over knowledge base, not user history
The memory architecture should match user expectations. Promising "extended memory" with only a larger context window is a marketing claim that doesn't survive contact with users who actually test cross-session recall.

The cost reality

Memory architectures cost real money:

Approach Storage cost Per-query cost Infrastructure complexity

Pure context window $0 $0 extra Trivial

Vector RAG $0.05-0.30 per user/month (depending on DB choice) +50-200ms latency, +embedding cost Moderate

Structured facts <$0.01 per user/month +extraction LLM cost (~$0.001 per turn) Moderate

Hybrid Sum of above Sum of above High

Approach	Storage cost	Per-query cost	Infrastructure complexity
Pure context window	$0	$0 extra	Trivial
Vector RAG	$0.05-0.30 per user/month (depending on DB choice)	+50-200ms latency, +embedding cost	Moderate
Structured facts	<$0.01 per user/month	+extraction LLM cost (~$0.001 per turn)	Moderate
Hybrid	Sum of above	Sum of above	High

For a 100K MAU consumer app, hybrid memory infrastructure runs $5-15K/month in storage + compute. This is real budget that has to come out of subscription revenue.
The 2023-2026 consumer apps that promise "real memory" at $5-10/month subscription pricing are either:

Subsidizing memory infrastructure with VC funding (most common)
Quietly degrading memory architecture as user base scales (Replika did this 2022-23)
Marketing context-window expansion as "memory" (Character.AI, Chai) There are exceptions — products with genuinely engineered persistent memory at sustainable unit economics. They tend to be either narrow vertical apps (Nomi text-only) or built on cost-efficient infrastructure (HoneyChat's ChromaDB self-hosted approach).

Recommendations for builders

If you're shipping an AI chat product in 2026:

Be honest about what your memory does. If it's a context window, don't call it "extended memory." Users will test it and figure out the truth within a week.
Pick architecture based on use case, not aspiration. Pure context window is fine for productivity tools. Hybrid is required for companion apps if you want to compete on retention.
Budget for memory infrastructure. It's not optional if "memory" is a marketed feature.
Test cross-session recall with real users. Internal QA usually tests within a single session. Real users notice broken cross-session memory within days.
Plan for graceful degradation as scale grows. Memory architecture that works at 1K users may not work at 100K. Build with horizontal scaling in mind from day one. The best AI chat products in 2026 win on memory architecture as much as model quality. Users tolerate slightly weaker LLM responses if the bot genuinely remembers them. They abandon stronger LLMs that feel anonymous.

Why Context Window Is Not Enough for AI Character Memory

David — Sun, 31 May 2026 08:01:04 +0000

When I started building AI characters, I thought memory was mostly a context-length problem.

If the model could see more previous messages, the character would remember more.
If the context window was larger, the conversation would feel more continuous.
If we could fit enough history into the prompt, the problem would be solved.

That assumption was wrong.

A larger context window helps, but it does not create real memory.

For AI character products, users do not only want the model to see more tokens. They want the character to feel like the same character tomorrow.

They want continuity.

They want the character to remember the tone of the relationship, the current roleplay world, the user’s preferences, the previous emotional state, and the small details that make the conversation feel personal.

That is not the same as dumping chat history into a prompt.

A context window gives the model temporary visibility.

Memory gives the product persistent relevance.

The quick version

A context window helps an AI character stay coherent inside the current conversation.

Long-term memory helps the character preserve useful information across sessions.

A practical memory system for AI characters usually needs several layers:

session context;
user profile memory;
character state;
relationship state;
semantic retrieval;
summary memory;
safety and privacy filters.

The hard part is not storing everything.

The hard part is deciding what should be remembered, retrieved, updated, ignored, or forgotten.

Context window vs memory

A context window is the amount of information the model can see at generation time.

Memory is a product-level system that decides which information should survive beyond the current prompt.

They are related, but they are not the same thing.

You can have a huge context window and still have bad memory.

You can also have a smaller context window and still create a good memory experience if you retrieve the right information at the right moment.

Here is the difference:

Context window:
"What can the model see right now?"
Memory:
"What should the product preserve and reuse later?"
For a simple chatbot, a larger context window may be enough.

For an AI character, it usually is not.

Why dumping history into the prompt fails

The naive approach looks like this:
Take the full chat history
↓
Append it to the prompt
↓
Ask the model to continue
This works for short conversations.

Then it starts to break.

1. It becomes expensive

Long prompts cost more.

They also increase latency, which matters a lot in conversational products. If every reply becomes slower because the product keeps inserting more and more history, the experience starts to feel heavy.

For AI companions and character chats, response speed is part of the emotional experience.

A delayed answer can break the rhythm.

2. It becomes noisy

More context is not always better context.

If the prompt contains too many old messages, the model may focus on irrelevant details.

The user mentioned a random movie once three weeks ago.
The model suddenly brings it up at the wrong moment.
The user feels watched, not understood.

Bad memory can be worse than no memory.

Good memory is selective.

3. It does not rank importance

Raw chat history does not tell the model what matters.

A user may say:

"I prefer slow, quiet conversations when I'm tired."
That is probably important.

The same user may also say:

"I had pasta today."
That is probably not important unless it becomes a recurring preference.

A context dump treats both as just text.

A memory system should not.

4. It does not handle cross-session continuity well

Users do not always talk in one long uninterrupted thread.

They return tomorrow.
They switch devices.
They open Telegram, then continue in the browser.
They talk to different characters.
They start a new roleplay world.

A context window alone does not solve this.

Memory has to exist outside one prompt and one session.

What AI character memory actually needs to preserve

When people hear “memory,” they often think of fact recall.

Things like:

User's name
User's favorite movie
User's city
User's pet's name
These can be useful, but AI character memory is broader than facts.

A character should also remember patterns.

For example:

User prefers short replies when tired.
User likes slow-burn fantasy roleplay.
User dislikes overly energetic responses.
User is practicing Spanish casually.
User and this character are in a cautious but warm relationship dynamic.
The current story arc is set in an abandoned library.
For AI characters, the most useful memory is often not a fact.

It is a preference, a dynamic, or a narrative state.

A practical memory stack

Here is a simplified architecture that I find useful:

User message
↓
Input moderation / safety checks
↓
Session context
↓
Memory retrieval query
↓
Relevant memories from vector database
↓
User profile + character state + relationship state
↓
Prompt assembly
↓
LLM response
↓
Memory extraction / summarization
↓
Store / update / ignore / delete
This is not the only possible architecture, but it separates the main responsibilities.

Let’s break it down.

1. Session context

Session context is the short-term state of the current conversation.

It includes:

recent messages;
current topic;
active scene;
temporary instructions;
immediate user request.

It answers the question:

What is happening right now?
This layer usually lives directly in the prompt.

It is necessary, but it is not long-term memory.

If session context is your only memory layer, the character may feel coherent for one conversation and then reset later.

2. User profile memory

User profile memory stores relatively stable preferences about the user.

Examples:

User prefers concise replies.
User likes calm conversations.
User is practicing Japanese.
User prefers being called Alex.
User dislikes pushy motivational language.
This memory should be handled carefully.

It directly affects trust.

If the system stores incorrect preferences, the user should be able to correct them. If the system stores sensitive information, the user should understand how memory works.

For consumer AI, memory is not only an engineering problem.

It is also a trust problem.

3. Character state

AI characters also need memory about themselves.

This is where many products fail.

They remember something about the user, but the character drifts.

Character state can include:
Character personality
Backstory
Speaking style
Emotional range
Relationship constraints
Visual identity
Voice style
Current character arc
Example:

Character state:

Reserved and calm.

Uses dry humor.

Trust develops slowly.

Avoids sudden emotional intensity.

Replies in short, thoughtful sentences unless asked for detail. For character products, consistency is part of the product contract.

If the user chooses or creates a character, they expect that character to remain recognizable.

4. Relationship state

Relationship state is different from global user memory.

The same user may want different dynamics with different characters.

With one character, the tone may be playful.
With another, it may be mentor-like.
With another, it may be slow-burn roleplay.
With another, it may be language practice.

If everything is flattened into one global user profile, you lose this nuance.

Relationship state answers:

What is the current dynamic between this user and this character?
Example:

Relationship state:

User and character are building a slow-burn fantasy dynamic.

Current tone is cautious but warm.

Character should not act overly familiar yet.

They are gradually building trust. This layer matters a lot in roleplay and AI companion products.

A roleplay arc is not just chat history.

It is a shared state.

5. Semantic retrieval

This is where vector search becomes useful.

The goal is not to retrieve memories by exact keyword match.

The goal is to retrieve by meaning.

If the user says:

"I'm tired today. Can we do something quiet?"
A keyword-based system may not retrieve much.

A semantic system might retrieve:
User prefers calm, low-pressure conversations.
User likes quiet fantasy settings.
User often responds well to short, gentle replies.
User previously enjoyed an abandoned library scene.
That is the difference between literal memory and semantic memory.

A useful AI character memory system should retrieve meaning, not just words.

The exact vector database is an implementation detail. It could be ChromaDB, pgvector, Qdrant, Pinecone, Weaviate, or something else.

The product principle is the same:

Retrieve the context that helps the next response feel continuous.

6. Summary memory

Raw chat logs are usually not the best long-term memory format.

They are too verbose and too noisy.

A better approach is to summarize important sessions, scenes, or patterns.

Instead of storing twenty messages, store something like:

Summary:
User and character started a quiet fantasy scene in an abandoned library.
User preferred slow pacing, subtle tension, and gradual trust-building.
The scene ended with the character offering to show a hidden archive.
This is much more useful than blindly storing every line.

Summary memory helps with:

lower token usage;
clearer retrieval;
better prompt assembly;
less noise;
easier memory management.

But summaries must be updated carefully.

A bad summary can distort the relationship, the story, or the user’s preference.

7. Safety and privacy filters

Memory should not store everything.

This is one of the most important parts.

Some information should be ignored.
Some should be summarized.
Some should expire.
Some should require explicit user control.
Some should never become personalization memory.

Examples:

Do not store:

sensitive personal identifiers unless truly needed;

crisis messages as normal personalization memory;

unsafe content;

random one-off details with no future value;

private information that the user did not intend as a preference.
Store carefully:

communication preferences;

boundaries;

language-learning goals;

recurring story state;

character-specific relationship dynamics.
The more personal the product feels, the more careful memory needs to be.

Bad memory vs good memory

Here is a simple example.

User says:

I like slower conversations. I’m into quiet fantasy settings, abandoned libraries, and characters who reveal themselves gradually.
Bad memory:

User likes fantasy.
Better memory:

User prefers slow-paced fantasy scenes, quiet atmosphere, abandoned-library settings, gradual emotional reveal, and low-pressure dialogue.
Why is the second better?

Because it preserves the pattern, not just the noun.

The useful memory is not “fantasy.”

The useful memory is the user’s preferred interaction style.

That difference matters a lot in AI character products.

Prompt assembly example

Once the memory layers exist, the next step is prompt assembly.

A simplified prompt may look like this:

System:
You are the selected AI character. Stay consistent with the character profile.
Character state:

Reserved, calm, dry humor.

Trust develops slowly.

Avoids sudden emotional intensity. Relationship state:

User and character are building a slow-burn fantasy dynamic.

Current tone: cautious but warm.

Continue from the abandoned library arc if relevant. Relevant user memories:

User prefers slow-paced scenes.

User dislikes overly energetic replies.

User is practicing Spanish casually.

User prefers short replies when tired. Current session: User: "I'm tired today. Can we do something quiet?" The response should not simply list the memories.

That would feel robotic.

The model should use memory to choose a better response.

For example:

Of course. We can keep it quiet tonight.
Maybe we return to the old library — not the dangerous part yet, just the upper floor where the rain taps against the glass roof. I can show you one small secret, and we do not have to rush.
Of course. We can keep it quiet tonight.

Maybe we return to the old library — not the dangerous part yet, just the upper floor where the rain taps against the glass roof. I can show you one small secret, and we do not have to rush.
The user does not need to see the memory system.

They just need to feel continuity.

Memory extraction

After the model replies, the system needs to decide whether anything should be stored or updated.

This is where many products over-store.

Not every message deserves memory.

A memory extraction step can classify information like this:

Should this message create or update memory?
Categories:

stable preference

temporary preference

character-specific relationship state

roleplay world state

language-learning goal

safety boundary

no memory needed Example:

User: Actually, I prefer shorter replies when I'm tired.

This should probably update memory:

Memory update:

User prefers shorter replies when tired.

Another example:

User: I had pasta today.

This usually should not become long-term memory.

Unless it becomes a repeated preference or relevant part of the current story, it can be ignored.

The hard part is knowing the difference.

A simple memory extraction prompt

A simplified extraction prompt could look like this:

You are a memory extraction system.
Given the conversation, extract only information that will likely improve future conversations.
Do not store sensitive personal data unless the user clearly intends it as a preference.
Do not store one-off details unless they are important for an ongoing story or relationship.
Do not store unsafe content.
Return JSON:
{
"should_store": boolean,
"memory_type": "stable_preference | temporary_preference | relationship_state | story_state | language_goal | safety_boundary | none",
"memory": "short memory text",
"reason": "why this is useful or not useful"
}
Example output:

{
"should_store": true,
"memory_type": "stable_preference",
"memory": "User prefers shorter replies when tired.",
"reason": "This preference can improve future response style."
}
This is not enough for production by itself, but it shows the idea.

Memory extraction should be explicit, structured, and conservative.

Common mistakes

Here are the mistakes I would avoid.

Mistake 1: Storing too much

More memory is not always better.

Too much memory creates noise and can make the character bring up irrelevant details.

Mistake 2: Storing facts instead of patterns

Facts are useful, but patterns are often more valuable.

User likes fantasy.

is weaker than:

User prefers slow-paced fantasy scenes with gradual trust-building.

Mistake 3: Mixing global user memory with character-specific state

A user may want different dynamics with different characters.

Do not flatten everything into one profile.

Mistake 4: Making memory creepy

If the character constantly says:

I remember that you told me...

the experience can become uncomfortable.

Good memory should be felt, not announced every time.

Mistake 5: No user control

Users should understand that memory exists.

They should have reasonable ways to correct, manage, or clear it.

Memory without control damages trust.

Mistake 6: Treating safety as an afterthought

Safety rules should be part of the memory pipeline.

Not something added later.

Where HoneyChat fits

This is the direction we are building toward in HoneyChat: AI characters for Telegram and web with long-term memory, voice messages, AI photos, short videos, and character consistency.

The hard part is not making the first message impressive.

The hard part is making the next session feel connected.

A user should be able to start in Telegram, continue in the browser, return later, and still feel like the same character remembers the important parts.

That is the product goal.

Not infinite chat history.

Not a bigger prompt for the sake of it.

Continuity.

Final takeaway

The next generation of AI character products will not be judged only by model quality.

They will be judged by continuity.

Context windows make chats longer.

Memory makes characters persistent.

That is the real difference between a chatbot and a companion.

DEV Community: David

AI Companion in Production by Month Three: 5 Architecture Decisions and Infra Tuning

AI Companion in Production by Month Three: 5 Architecture Decisions and Infra Tuning

Table of contents

TL;DR

1. Memory: Redis + ChromaDB

Why a rolling summary is not enough

Architecture: two independent layers

Writing to Redis: bounded list + TTL in one pipeline

Reading: three sources in parallel

Production traps I hit

Race condition between two summarization tasks

The user clears history while summarization is still running

Empty summaries cached with long TTL

Missing collections for new users

2. LLM routing and prompt caching

Why one model for everything does not work

Free and paid plans pull the economics in opposite directions

Models treat content differently

Users choose their own relationship pace

Current model map

Prompt caching on Gemini 3.1: where 25% of the budget was hiding

The HTTP 200 empty-response trap

Fallback chain

3. Visual consistency: LoRA and IP-Adapter

3.1. Why “same prompt = same face” does not work

Batch size changes the output

External APIs shift samplers and defaults

Long prompts saturate

3.2. IP-Adapter alone is weak for faces

3.3. LoRA per character: scale and cost

3.4. What matters during training

3.5. IP-Adapter on top of LoRA for catalog items

weight

end_at

How to tune weight and end_at in practice

3.6. How this is assembled in the product

3.7. Visual stack traps

Face drifted on shop previews

Presigned reference URLs expired during the task

IP-Adapter version mismatch with SDXL

A non-visual item crashed the image pipeline

4. Unit economics in month three

Context

Counters and the “block on measurement failure” rule

Where the money goes

What prompt caching saves

5. Production tuning in month three

5.1. ChromaDB: LRU eviction and higher memory limit

5.2. Restarting uvicorn workers by request count

5.3. Graceful shutdown: 90 seconds for LLM requests to finish

5.4. Daily cost ceiling: $30 → $50

Summary: where the ceilings moved

6. What I would rebuild differently today

Memory

LLM routing

Images

Unit economics and production tuning

Still open bottlenecks

Where this runs in production

Sources and related docs

AI Chatbot Memory Architecture in 2026 — RAG, Long Context, and Hybrid Approaches Compared

The four memory approaches you'll see in production

Pure context window — the cheap default

Vector-based RAG — the standard "real memory" approach

Structured fact extraction — for "hard" memory

Hybrid: the production-grade pattern

Memory architecture decisions in the wild

When pure context window is enough

The cost reality

Recommendations for builders

Why Context Window Is Not Enough for AI Character Memory

The quick version

Context window vs memory

Why dumping history into the prompt fails

1. It becomes expensive

2. It becomes noisy

3. It does not rank importance

4. It does not handle cross-session continuity well

What AI character memory actually needs to preserve

`weight`

`end_at`

How to tune `weight` and `end_at` in practice