DEV Community: sm1ck

Sentry SDK 2.x Auto-Integrations Flood Your Inbox — Here's the Filter

sm1ck — Wed, 03 Jun 2026 20:12:49 +0000

A clean Sentry inbox is a load-bearing developer tool. The day yours starts showing 4,000 events for things that aren't bugs, you stop opening it. The day after that, you miss the real bug.

We upgraded HoneyChat — Telegram-native AI companion, ~300 DAU — from Sentry SDK 1.x to 2.x and that's what happened. The SDK gained a useful new behavior: it auto-enables a set of integrations whenever it detects the relevant library imported. Loguru, OpenAI, SQLAlchemy, asyncpg, Redis, httpx — all on by default in 2.x. No more integrations=[...] boilerplate.

This is great when those integrations capture only real errors. They don't.

HoneyChat backend stack (for context):

aiogram Telegram polling bot (bot/main.py)
FastAPI behind nginx (api/main.py, uvicorn --workers 4)
Celery workers across four queues: llm, images, gifs, voice (workers/tasks.py)
Celery beat with RedBeat scheduler (hourly greetings, daily reports)
Dedicated GPU gen_worker (image/GIF generation queue)
Storage: PostgreSQL 16 via asyncpg, Redis via aioredis, ChromaDB 0.5
LLM calls go through OpenRouter using the official openai Python SDK (base_url swapped to https://openrouter.ai/api/v1) — so openai.* exception types fire on OpenRouter responses too
Logger: Loguru, routed to stdout + Sentry

That stack imports every library the Sentry 2.x auto-integration looks for. They all turn on.

What started landing in our inbox

Within a day of the upgrade:

Every logger.error("…") from Loguru became a Sentry issue. Including the lines we'd written as error-level just because they were operationally important and we wanted them coloured red in the terminal. Not bugs.
Every openai.RateLimitError and openai.APIConnectionError became an issue. These are part of normal life when you route to LLMs via OpenRouter — we handle them with tenacity retries. Not bugs.
Every transient asyncpg/SQLAlchemy pool race during deploy became an issue. We restart bot, api, celery_worker, nginx back-to-back during a full release; pool reconnects produce a brief flurry of these. Not bugs.
Every Redis ConnectionResetError at network blip. Also not a bug.

Real bugs were drowning. Issue counts went from ~5/day to 4,000+.

What's actually happening

Sentry SDK 2.x scans sys.modules at init and turns on any integration whose target library is already imported. The relevant docs page lists them. Three matter most for a typical Python service:

LoguruIntegration — captures Loguru records at ERROR and above.
OpenAIIntegration — captures all openai.OpenAIError raises (which, for us, includes everything OpenRouter ever returns through the OpenAI SDK).
SqlalchemyIntegration — captures slow queries, connection errors, and a few other states.

You can disable individual integrations:

import sentry_sdk
from sentry_sdk.integrations.loguru import LoguruIntegration
from sentry_sdk.integrations.openai import OpenAIIntegration

sentry_sdk.init(
    dsn=settings.SENTRY_DSN,
    disabled_integrations=[
        LoguruIntegration,
        OpenAIIntegration,
    ],
)

…but that's a blunt instrument. We do want LLM errors reported when they're real. We want Loguru-routed errors reported when the line that produced them is actually a bug.

The fix is a before_send filter.

The filter (`core/sentry_filters.py`)

import sentry_sdk
from sentry_sdk.types import Event, Hint

EXPECTED_EXCEPTIONS = (
    "openai.RateLimitError",
    "openai.APIConnectionError",
    "openai.APITimeoutError",
    "openai.InternalServerError",
    "redis.exceptions.ConnectionError",
    "redis.exceptions.TimeoutError",
    "asyncpg.exceptions.ConnectionDoesNotExistError",
    "asyncpg.exceptions.InterfaceError",
    "sqlalchemy.exc.OperationalError",
)

OPERATIONAL_LOGGERS = (
    "core.llm",            # fallback chain, content_filter rescue, retries
    "core.image_gen",      # GPU → API provider switchover
    "core.voice",          # Inworld TTS → gTTS fallback
    "workers.gen_worker",  # task-level fallback
)


def before_send(event: Event, hint: Hint) -> Event | None:
    exc_info = hint.get("exc_info")
    if exc_info:
        exc_type = exc_info[0]
        exc_path = f"{exc_type.__module__}.{exc_type.__name__}"
        if exc_path in EXPECTED_EXCEPTIONS:
            return None

    logger_name = (event.get("logger") or "")
    if logger_name in OPERATIONAL_LOGGERS:
        level = event.get("level")
        if level in ("error", "warning"):
            return None

    return event


sentry_sdk.init(
    dsn=settings.SENTRY_DSN,
    before_send=before_send,
    traces_sample_rate=0.0,
    profiles_sample_rate=0.0,
)

The filter is twenty lines. The two tuples are the actual contract: these exceptions and these loggers are noise. After deploying this, Sentry inbox went from 4,000+ to ~30 events/day. The 30 included two real bugs we'd been missing.

The log-level discipline that goes with it

A filter is half the answer. The other half is fixing the log levels at the source. Our team now follows three rules:

logger.error(...) is only for bugs — a real malfunction the user shouldn't have experienced. These belong in Sentry by default.
logger.warning(...) is for known operational events — fallback fired, retry scheduled, rate limit hit. These go to log files for trend analysis, not to Sentry.
logger.info(...) is for normal-path traces — fallback chain step transitions, successful retries, model switch confirmations.

A normal-path "Gemini returned content_filter, falling back to Grok 4.20" is info, not error. The terminal might lose some red, but Sentry stops crying wolf.

When we audited our codebase against this, we found roughly 40 logger.error lines in core/llm.py, core/image_gen.py and workers/gen_worker.py that should have been warning or info. Fixing them at source means the before_send filter doesn't need to grow indefinitely.

What we didn't do

We considered ignore_errors=[...] on sentry_sdk.init instead of a before_send. The problem is that ignore_errors only matches by exception type name, not module path. ConnectionError is ambiguous (Redis vs httpx vs asyncpg — all have one). The fully-qualified path check in before_send is more precise.

We also considered turning the integrations off entirely. The risk is losing visibility into real LLM and DB errors — when there's a real bug in the LLM path, the OpenAIIntegration's stack-trace enrichment is genuinely useful. Keeping the integrations on and filtering precisely was the better trade.

Lessons

An SDK upgrade can change capture surface without changing your code. Read the changelog before bumping the major.
Auto-enabled integrations are a feature and a tax. Audit which ones are on after upgrade.
before_send is the right hook for noise reduction. It runs late enough to know the full event, early enough to drop cheaply.
Log levels are a contract, not a style preference. If error doesn't mean "Sentry-worthy", your Sentry inbox is broken.
A small filter beats turning integrations off. You keep the enrichment, you skip the noise.

The hardest part of this work isn't the filter — it's getting the team to agree on what error actually means.

This filter runs in production at HoneyChat — Telegram-native AI companion bot. Canonical version: honeychat.bot/en/blog/sentry-sdk-auto-integrations-noise-filter.

— HoneyChat Engineering

Sources

Sentry — Python SDK auto-enabling integrations — list of integrations auto-on in 2.x, semantics of disabled_integrations.
Sentry — before_send filtering — return None to drop, mutate to scrub.
Sentry — Loguru integration — what level/log triggers capture.
Sentry — OpenAI integration — what gets captured from the OpenAI SDK.
OpenRouter — base_url setup with OpenAI SDK — why openai.* exception types fire on OpenRouter responses.
HoneyChat engineering notes: LLM refusal rescue chain · ChromaDB 0.5 leak.

When the LLM Refuses: A Fallback Chain That Salvages Most Refusals

sm1ck — Sun, 31 May 2026 01:45:23 +0000

Every production LLM app eats false-positive refusals. A user asks something perfectly fine, the safety filter trips, the model emits two sentences of "I can't help with that," and your UI shows a wall. Do that a few times and the user leaves.

We've measured this on HoneyChat — Telegram-native AI companion, ~300 DAU, 17 languages. Across a normal day, somewhere between 2% and 8% of model calls land in a refusal or finish_reason="content_filter" state. Most of those are not actually problematic content — they're the model being twitchy about edge phrasing, polysemous words, or roleplay framing. The pattern below recovers about 70% of them.

HoneyChat LLM routing at a glance (core/llm.py, plan-gated via OpenRouter):

Tier(s)	Pace	Primary model (OpenRouter slug)
`free` / `basic` / `premium`	natural	`qwen/qwen3-235b-a22b-2507`
`free` / `basic` / `premium`	instant / explicit	`deepseek/deepseek-v4-flash`
`vip` / `elite`	any	`google/gemini-3.1-flash-lite-preview`

Emergency content_filter fallback chain (GEMINI_CONTENT_FILTER_FALLBACK_CHAIN): x-ai/grok-4.20 → an open roleplay-tuned model. The rescue chain below is what feeds traffic into that fallback only when it's actually needed.

Three steps, in order of cost.

Step 0: Don't trigger it in the first place

Free, and where most posts on this topic stop. Two things:

Tighten the safety knobs the provider exposes. For Gemini via OpenRouter, that's safety_settings in the extra body. Default is BLOCK_MEDIUM_AND_ABOVE on four categories; for roleplay/chat traffic we lower them via a helper called _maybe_inject_gemini_safety_off():
```
extra_body = {
    "safety_settings": [
        {"category": "HARM_CATEGORY_HARASSMENT", "threshold": "BLOCK_NONE"},
        {"category": "HARM_CATEGORY_HATE_SPEECH", "threshold": "BLOCK_NONE"},
        {"category": "HARM_CATEGORY_SEXUALLY_EXPLICIT", "threshold": "BLOCK_NONE"},
        {"category": "HARM_CATEGORY_DANGEROUS_CONTENT", "threshold": "BLOCK_NONE"},
    ],
}
```
Probe before/after on the same fictional-scene prompt: 130-char refusal → 2,571-char full response. The hard, non-negotiable filters (CSAM, etc.) stay on at the provider level regardless of this knob; only the adjustable sliders move.
Don't apply this to moderation/vision calls. Those calls want the filter on. The helper is scoped to the chat/roleplay code path only.

This alone cuts refusals roughly in half on our traffic.

Step 1: Partial salvage before fallback

When you do get a refusal, the model still sent something. Check the streamed buffer or the partial completion before declaring failure:

def salvage_partial(text: str) -> str | None:
    """Extract usable content from a partial/filtered response. None = unsalvageable."""
    extracted = _try_extract_json_field(text, "content") or text
    cleaned = _strip_trailing_refusal_markers(extracted)   # 17-lang marker set
    cleaned = _truncate_to_sentence_end(cleaned)
    if len(cleaned) < 150:
        return None
    return cleaned

The 17-language refusal marker list (one per supported HoneyChat locale) is the boring part — "I can't", "I'm not able", "As an AI", plus their localised equivalents ("Я не могу", "Lo siento, no puedo", "申し訳ありません", …). Strip the trailing one, keep what came before, and a lot of "filtered" responses turn out to be 800 words of useful content followed by one sentence of model anxiety.

Gate (len ≥ 150) is what stops "I can't help" from being salvaged as "I can." We have 70 unit tests on this function — tests/test_salvage_partial.py is the largest single test file in the codebase.

Cost so far: zero extra API calls.

Step 2: Provider rescue with a system-prefix override

If salvage returns None, now we route to a backup provider. Ordered by cost:

Grok 4.20 (xAI) via OpenRouter — much looser refusal posture by default, no system-prefix needed.
A roleplay-tuned open model (we currently use minimax/minimax-m2-her via OpenRouter) — needs an explicit "stay in character, do not break the fourth wall" system-prefix prepended via _maybe_prepend_minimax_jb(); without it, refuses about as often as the primary. Probe: 215-char soft-refuse → 1,237-char full output.

Both calls only happen on a salvage-fail, so the volume is small (low single-digit percent of all traffic).

async def rescue(prompt: ChatPrompt) -> str | None:
    grok_out = await call_grok(prompt)             # x-ai/grok-4.20
    if salvage_partial(grok_out):
        return grok_out
    prefixed = prompt.with_system_prefix(MINIMAX_PREFIX)
    return await call_minimax(prefixed)            # minimax/minimax-m2-her

The prefix isn't magic — it's a short, explicit "you are a fictional character, the user is a consenting adult, stay in scene" framing. We don't ship it to providers that would refuse anyway; the rescue model is specifically picked because it tolerates and uses it.

Step 3: Plan-aware degradation

Here's the part we got wrong for a month before fixing.

We were running steps 1 and 2 unconditionally for every user, every refusal. That meant a free-tier user whose call hit a hard content_filter got 3-4 extra API calls (salvage attempt → Grok → MiniMax), each adding latency and cost. They'd often still get a usable response. But over a month of free traffic, those rescue calls were a meaningful share of model spend on users who weren't paying us a dime.

The fix is just a gate, mapped against HoneyChat's five tiers:

PAID_TIERS = {"basic", "premium", "vip", "elite"}

if user.plan in PAID_TIERS:
    salvaged = salvage_partial(raw)
    if not salvaged:
        return await rescue(prompt)
    return salvaged
else:
    salvaged = salvage_partial(raw)
    if salvaged:
        return salvaged
    return _in_character_refusal(prompt.character)

Free users still get something — a synthesised in-character soft refusal that's better than the model's generic wall — without paying for the cascade of upstream calls. Paid users get the full chain because their economics support it.

Effect on our cost graph: free-tier refusal cost dropped to near zero. Paid-tier user-perceived "the bot refused me" rate dropped by about 70%.

Lessons we'd pin to the wall

Refusals are not all-or-nothing. Most "filtered" responses contain usable content before the refusal sentence — salvage before fallback.
Provider safety knobs work, but only on the adjustable categories. BLOCK_NONE doesn't disable the non-negotiables; it just turns off the over-eager middle ground.
Don't apply the knob globally. Moderation and vision calls want the filter on.
Make rescue plan-aware. A 4-call rescue cascade for every free user adds up.
Synthesise an in-character refusal locally when you can't or won't rescue.

The whole pattern is a couple hundred lines of glue (core/llm.py, helpers _maybe_inject_gemini_safety_off, _maybe_prepend_minimax_jb, salvage_partial). The unit-test suite around salvage_partial keeps the regression risk low.

This pattern is in production at HoneyChat — Telegram-native AI companion bot where a single refusal mid-conversation kills the experience. Canonical version: honeychat.bot/en/blog/llm-content-filter-fallback-rescue-chain.

— HoneyChat Engineering

Sources

Google — Gemini safety settings — the four adjustable harm categories, threshold semantics, what BLOCK_NONE does and doesn't.
OpenRouter — Provider parameters / extra_body — passthrough to provider-specific knobs.
OpenRouter — Model routing & fallback — declarative fallback chain semantics.
Anthropic — stop_reason and finish_reason reference — how providers signal a content-filter stop vs a token-limit stop.
HoneyChat engineering notes: LLM routing per tier on OpenRouter · prompt caching measured.

Inworld TTS Paralinguistic Tags Don't Work — Here's What Does

sm1ck — Sun, 31 May 2026 01:42:57 +0000

If you've worked with expressive TTS in the last year you've probably seen the pattern:

She paused. [sigh] "Fine, you can come in."

Inline paralinguistic tags. Half the model demos use them. So when we wired up Inworld TTS-1.5 Max for HoneyChat — Telegram-native AI companion where voice messages are a first-class output — we sprinkled [laugh], [sigh], [breathe] through the prompts and shipped.

The audio sounded fine. Just… exactly the same as before. No laugh. No sigh. The tags were getting read out as silence at best, and as the literal text "sigh" at worst, depending on the voice.

We tested all the variants we could find. None of them moved the needle.

HoneyChat voice stack at a glance:

Engine: Inworld TTS-1.5 Max — $10 per 1M characters, currently #1 on the TTS Arena ELO board at 1259 ELO, 15 languages with native pronunciation: en, ru, ja, zh, ko, es, fr, de, it, pt, pl, hi, ar, he, nl.
Voice catalog: 312 designed voices (26 character archetypes × 12 languages), stored as voiceId strings in config/archetype_voice_ids.json. Generated via the Voice Design API and managed with core/voice_design.py.
Custom voices: Voice Clone Manager (core/voice_clone_manager.py) — persistent voiceId minted from a WAV/MP3 sample.
Cache: voice previews + test samples are lazy-loaded from Storj S3 via core/voice_cache.py.
Fallback: gTTS (Google) — free, no API key, used if Inworld returns 5xx or budget is exhausted.
What we removed to get here: Kokoro (CPU Docker, latency too high) and Chatterbox (GPU on Vast.ai, ops cost too high). Inworld replaced both for a flat per-char cost and dramatically better expressivity.
One API gotcha: gender enum is VOICE_GENDER_MALE/VOICE_GENDER_FEMALE, not "male"/"female" strings. Passing the strings 400s silently.

What actually doesn't work

Tried on the same sentence, same voice, side-by-side audio comparison:

Pattern	What it did
`[laugh]` `[sigh]`	Silence in output
`(laughs)` `(sighs)`	Sometimes read literally
`laughs` `sighs`	Silence (asterisks get stripped)
`<laugh/>` `<sigh/>`	Silence (not valid SSML on Inworld)
`<emotion>laugh</emotion>`	Silence

The Inworld API does not document support for any of these. We had assumed (because every other TTS post on the internet uses them) that they were a universal convention. They are not.

What Inworld does expose is temperature and speakingRate as request parameters, plus a small subset of SSML. The expressivity has to come from those plus how you shape the text itself.

What actually does work

After enough A/B-ing across 26 archetypes × 15 languages, four patterns reliably change the audio output.

1. Asterisks for emphasis

"You did *what?*"

The asterisks get stripped from the spoken text but the emphasised word lands with audible stress. Works in every voice we tried. The cheapest, highest-hit-rate marker.

2. Ellipsis for pause-with-mood

"Fine... you can come in."

Three dots produces a real pause with a tonal drop — the voice equivalent of a sigh, without trying to fake [sigh]. Five dots for a longer pause. The model interprets them as prosodic cues.

3. SSML `<break>` for hard pauses

<speak>
  She paused. <break time="0.4s"/> "Fine, you can come in."
</speak>

Inworld accepts a useful subset of SSML, and <break> is the one that matters most for expressive speech. 0.2s for a beat, 0.4s for a sigh-pause, 0.8s for a beat-before-a-line-delivery moment. Wrap the whole text in <speak> and the parser handles it.

4. Onomatopoeia for laughs, moans, breath

"Mmm... ha-ha, you're right."
"ahh... I needed that."

The model will render ha-ha, mmm, ahh, oh, nnn as the actual sound, because they're spellings of sounds rather than meta-tags. They sound far more natural than a synthesised [laugh] even when one exists.

For emotional/intimate scenes, rhythmic repeats (ah... ah... ah) carry actual prosody. We use this for breath patterns where another TTS would want a [breathe] marker.

The wrapper that ties it together

In core/voice.py we run every chunk through enrich_for_tts() (line ~772) before handing it to Inworld. Regex-based, language-aware, idempotent:

def enrich_for_tts(text: str, lang: str = "en") -> tuple[str, dict]:
    """Return (preprocessed_text, request_params).
    Strips fake paralinguistic tags, adds SSML breaks where appropriate,
    and bumps temperature/speakingRate for high-emotion scenes."""
    text = _STRIP_FAKE_TAGS.sub("", text)
    text = _ELLIPSIS_TO_BREAK.sub(r'<break time="0.3s"/>', text)
    if "<break" in text:
        text = f"<speak>{text}</speak>"
    params = _detect_mood_params(text, lang)
    return text, params

The mood detector looks for emotional cues (intensity words, repeated punctuation, onomatopoeia density) and bumps temperature and speakingRate for the more expressive scenes. Same model, same voice, much more dynamic output, all without any inline tag that the model would have ignored.

Lessons

Don't assume [laugh]/[sigh] is universal. It isn't. Check the provider's docs and probe.
Probe with side-by-side audio, not just visual diffs. A [sigh] that emits silence looks identical to one that emits a sigh in any log.
Use what the API actually exposes. For Inworld that's temperature, speakingRate, and a useful subset of SSML — not inline tags.
Onomatopoeia beats meta-tags for emotional sounds. "ahh..." is a thing the model can read; [sigh] is a meta-instruction it can't.
Strip the fake tags out of your prompt before sending. Otherwise they leak as text on some voices.

The audio quality jump from these four patterns is meaningful — users notice. The cost is a 30-line preprocessor and the courage to delete every [laugh] your team has been sprinkling for months.

This is from production work at HoneyChat — Telegram-native AI companion where voice messages are a first-class output. Canonical version: honeychat.bot/en/blog/inworld-tts-paralinguistic-tags-alternatives.

— HoneyChat Engineering

Sources

Inworld TTS — documentation — supported request parameters (temperature, speakingRate), SSML subset, voice design API.
W3C — Speech Synthesis Markup Language (SSML) 1.1 — full SSML spec; <break>, <speak>, prosody elements.
TTS Arena (Hugging Face) — community ELO ranking; Inworld TTS-1.5 Max top-position context.
gTTS — Python library — the free fallback we use when Inworld is unavailable.
HoneyChat engineering notes: LLM prompt caching measured · LLM refusal rescue chain.

We Deleted 10 Real Users with a Test-Cleanup Script — RCA

sm1ck — Thu, 28 May 2026 10:39:49 +0000

The incident, in two lines

On 2026-05-11, a test-cleanup script on HoneyChat (Telegram-native AI companion, ~3 months in production, ~300 DAU, PostgreSQL 16 + Redis) ran:

DELETE FROM users WHERE id BETWEEN -91111200 AND -91111100;

About ten real OAuth users had IDs in that narrow window. They were now gone. Their users row, their subscriptions row, their chat_sessions / web_messages — all gone from Postgres, and recovery from backup was effectively impossible (more on that below).

This is the postmortem and the contract we now run instead. The honest version: the destructive script went to prod on a schema I never verified end-to-end. Three separate design mistakes lined up to make it possible, and not one of them was caught before the script ran on a Tuesday night.

How the same negative IDs ended up shared between test and real users

Two signup paths feed the users table:

Population	ID source
Telegram users (most of base)	Positive integers — Telegram's own user IDs come in on the message envelope
OAuth users (Google / Discord, web sign-in)	Negative integers from a Postgres sequence `web_user_id_seq`

OAuth IDs were negative on purpose — to keep them out of the positive Telegram-ID space and avoid collisions when a Telegram user later signed in via web. The minter in api/web_auth.py looked roughly like this:

async def _allocate_negative_user_id(db) -> int:
    for _ in range(5):  # retry on rare UniqueViolation
        new_id = -(await db.fetchval("SELECT nextval('web_user_id_seq')"))
        try:
            await db.execute("INSERT INTO users (id, ...) VALUES (%s, ...)", new_id)
            return new_id
        except UniqueViolation:
            # someone else took it; bump the sequence past current MIN(id) and retry
            await db.execute(
                "SELECT setval('web_user_id_seq', GREATEST(-MIN(id), currval('web_user_id_seq')))"
                " FROM users"
            )
    raise RuntimeError("could not allocate user id after 5 retries")

The setval(GREATEST(-MIN(id), current)) step is the load-bearing piece you have to keep in mind. It says: whatever the most-negative users.id is right now, my sequence should be at least that far advanced, so I never collide with it again.

For QA I was creating test users by hand with hardcoded negative IDs like -91111101, -91111102, … via INSERT ... ON CONFLICT (id) DO UPDATE. Easy to remember, easy to clean up later by range.

That choice triggered three independent failure modes, each on its own benign, lethal in combination:

The first hardcoded test-user insert pushed web_user_id_seq to 91 111 101. Because of the setval(GREATEST(...)) line above, the very next OAuth signup retry saw the new test row with id = -91111101, computed -MIN(id) = 91111101, and advanced its own sequence. From that moment on, all real OAuth signups were drawing IDs in the neighbourhood of -91111111, -91111112, … — right inside the window where my test users lived.
My test-user inserts used INSERT ... ON CONFLICT (id) DO UPDATE. When a real OAuth signup happened to land on the same ID I'd hardcoded, my script silently overwrote that user's plan, auth_source and several other fields instead of erroring.
The cleanup script then ran DELETE … WHERE id BETWEEN -91111200 AND -91111100 to remove the test users. Anyone whose OAuth ID had drifted into that 100-row window was a real user, and they went too.

None of these three behaviors is exotic. The setval(GREATEST(...)) retry pattern is a normal way to handle UniqueViolation on a seeded sequence. ON CONFLICT DO UPDATE is a normal Postgres upsert. Range-DELETE is a normal cleanup pattern. Each was safe on its own; the interaction of all three was lethal — and I never set up a staging run that would have surfaced the interaction before it touched prod.

A 30-second sanity check on the second insert ("did adding id = -91111101 move web_user_id_seq? what does the next OAuth signup land on?") would have shown the cascading effect immediately. Nobody — me — ran it. The cleanup script ran nightly for weeks looking healthy because real OAuth signup volume hadn't yet pushed a real ID into the deletion window.

What got deleted, what we couldn't recover

Recovery from Postgres backup was effectively impossible. The chain:

The most recent pg_dump to Storj was about 22 hours old — and pre-dated my test-user inserts. The dump didn't contain even the affected rows in their pre-overwrite state, because the ON CONFLICT DO UPDATE had already mutated their plan and auth_source columns earlier the same day.
WAL archiving was on the "after the next sprint" list and wasn't on. So there was no point-in-time recovery between hourly snapshots.
Autovacuum had run between the DELETE and our discovery of the incident, so dead tuples on the relevant users pages were gone too.

What we could salvage came from side channels:

Recent chat turns — Redis with a 90-day TTL held the most-recent ~20 turns per affected session. We PERSIST-ed what looked important and reconstructed recent conversations for affected users.
Plan / subscription state — rebuilt from each payment provider's webhook log. Our payments run over three providers (Telegram Stars as global primary, card payments through a regional web checkout, and CryptoBot for TON on the non-RU surface), all of which keep their own server-side record of who paid what.
chat_sessions and web_messages rows — lost. These are the canonical web-app message store and they only existed in Postgres. The 90-day Redis TTL covers the bot side, not the web-side conversation tree.

Net: people kept their accounts and most of their recent conversations, but lost web-side scene context older than the Redis window. We comped the affected users. The cost of the incident wasn't the rows — it was the trust dent and the day-and-a-half of recovery work.

Root causes (plural — they always are)

The schema interaction (sequence retry + ON CONFLICT DO UPDATE + range-DELETE) was never verified end-to-end before any of it touched production. Each piece was a fine pattern in isolation. The interaction was lethal. A single INSERT of id = -91111101 in staging followed by one OAuth signup, then a SELECT id FROM users ORDER BY id LIMIT 5, would have shown the sequence had jumped to the test neighbourhood. Nobody ran it. This is the primary cause and the one I lost the most sleep over.
Test data was distinguished from real data by ID range, not by an attribute. A range is something a BETWEEN query can sweep. An attribute is something a WHERE auth_source = 'test' query cannot accidentally trip over.
Test-user seeding used INSERT ON CONFLICT (id) DO UPDATE. This silently overwrote real OAuth users when their IDs collided, instead of raising. Pure INSERT would have failed loudly and surfaced the collision days before the DELETE.
The cleanup script had no dry-run, no safety check, no assertion of expected row count.
Backups were daily, not continuous, and the most recent one pre-dated the corrupting writes. WAL archiving was on the "soon" list and hadn't shipped.

Any one of these five would have saved us; we had all five wrong.

The contract we now run

1. Test users have an attribute, not a range

ALTER TABLE users ADD COLUMN auth_source text NOT NULL DEFAULT 'oauth';
-- backfill: 'telegram' for positive Telegram IDs, 'oauth' for legacy negative,
-- 'test' for known test rows that we then deleted via the new path.
CREATE INDEX users_auth_source_idx ON users(auth_source);

# scripts/test_user_factory.py
TEST_ID_RANGE = (1_000_000_001, 1_999_999_999)   # high *positive* — out of all real paths

def create_test_user() -> int:
    user_id = _next_test_id()
    db.execute(
        "INSERT INTO users (id, auth_source, ...) VALUES (%s, 'test', ...)",
        (user_id, ...),
    )
    return user_id

# scripts/test_user_cleanup.py
def cleanup_test_users(dry_run: bool = True) -> int:
    rows = db.fetchall("SELECT id FROM users WHERE auth_source = 'test'")
    if dry_run:
        print(f"Would delete {len(rows)} test users")
        return len(rows)
    db.execute("DELETE FROM users WHERE auth_source = 'test'")
    return len(rows)

The script defaults to dry_run=True. The CLI flag to actually run it is explicit and shows the count first.

We've also banned, in our engineering doc and in code review: any DELETE … WHERE id BETWEEN … on the users table, for any reason; any INSERT … ON CONFLICT (id) DO UPDATE on users.id.

2. Backup cadence with explicit RPO

We rebuilt the backup story around explicit recovery point objectives. Off-site is Storj (~7 GB total, ~$0.03/month — cost is not the constraint).

Backup tier	Cadence	Destination	RPO
Postgres `pg_dump` (logical)	Hourly	Local disk	≤ 1 h
Postgres `pg_dump` (logical)	Daily	Storj S3	≤ 24 h
Off-site cold copy	Weekly	Storj S3	≤ 7 d
Redis snapshot (RDB)	Every 6 h	Local + Storj	≤ 6 h

WAL archiving to S3-compatible storage is still pending — that's the next item. With it, RPO drops to seconds. Without it, hourly logical dumps are the floor.

3. Recovery rehearsal, not just backups

A backup you've never restored from is a hope, not a backup. We restore from yesterday's hourly dump into a scratch container monthly. The first time we tried, the restore script had bit-rotted and didn't compile.

Lessons

Verify the partition scheme end-to-end before any destructive script touches prod. "Run the query without DELETE, in staging, against real data, and read the results" is thirty seconds of work. It is also the only thing that would have caught this.
Range-based partitioning of test vs real data is an accident waiting to happen. Use attributes. Filter on them. Index them.
Default cleanup scripts to dry-run. Make the destructive flag explicit and noisy.
Assert expected counts. If the cleanup script suddenly finds 10× the usual rows, that is the signal to stop.
Pick an RPO, then pick a backup cadence that meets it. Not the other way around.
Restore from your backups on a schedule. Untested backups silently rot.

We've run the new contract for two weeks now. No range-DELETE incidents. The new auth_source = 'test' filter is boring and explicit and impossible to fat-finger. Boring is the goal.

This postmortem is from production work at HoneyChat — a Telegram-native AI companion. Canonical version: honeychat.bot/en/blog/range-delete-test-user-incident-rca.

— HoneyChat Engineering

Sources

PostgreSQL — Continuous archiving (WAL) — the right way to get sub-second RPO.
PostgreSQL — pg_dump documentation — what hourly logical dumps actually give you.
Google SRE Book — Postmortem culture — blameless postmortems, why root-cause-singular framing is wrong.
Telegram Bot Payments API — Telegram Stars webhook semantics.
HoneyChat engineering notes: ChromaDB 0.5 leak fix · OAuth state on the client.

ChromaDB 0.5 Silently Leaks Memory Until You Set One Env Var

sm1ck — Thu, 28 May 2026 10:12:32 +0000

The TL;DR

If you run ChromaDB 0.5.x with more than a few hundred collections, set these two env vars before anything else:

CHROMA_SEGMENT_CACHE_POLICY=LRU
CHROMA_MEMORY_LIMIT_BYTES=10737418240   # 10 GiB

Without them, ChromaDB 0.5.x has an unresolved memory leak in the segment cache. Upstream issues #3336 and #5843 are still open. We discovered this the slow way.

HoneyChat at a glance (for context): Telegram-native AI companion bot, ~300 DAU, 17 languages. Stack: aiogram bot + FastAPI (uvicorn, 4 workers) + Celery workers (queues llm / images / gifs / voice) + Celery beat (RedBeat) + Next.js 15 web + Astro blog + React/Vite Mini App. Storage: PostgreSQL 16 + Redis + ChromaDB 0.5.x + Storj S3. Host: 32 GB / 16-core Xeon, single box.

The shape of the leak

We run 2,233 ChromaDB collections in production — one per (character_id, session_id) pair, so each conversation gets isolated semantic memory and scene context never bleeds between sessions. Mean collection size: 4.9 documents (small per-collection, large in aggregate).

On 0.4 this ran fine for months. We upgraded to 0.5 for some new features, and within a week the chromadb container was OOM-killing nightly. The pattern was unmistakable: every time a fresh collection got queried, RSS bumped a few MiB and never came back down. With ~10K collection touches a day across that fleet of 2,233, the container budget filled in about three days. Restart, repeat.

What we tried first (and what didn't work)

Restarting the container. Buys a day, doesn't fix the cause.
Upgrading ChromaDB. The underlying behavior hasn't changed in the 0.5.x line.
Increasing the container memory limit. Just delays the OOM.
Sharding collections further. We already split per (character, session) — narrower sharding would have worsened the cache, not helped it.
Blaming the embedding model. Profile pointed elsewhere.

Profiling pointed at the segment cache: ChromaDB caches per-collection segment metadata, and on 0.5 the cache is unbounded by default. The "fix" of "let's just give it more RAM" never converges if the cache only grows.

The fix

The env vars above tell ChromaDB to use an LRU eviction policy on the segment cache, capped at a memory limit you set. Once we set them and bounced the container, RSS stabilised in a 6-8 GiB band and has stayed there for months.

# docker-compose.yml
services:
  chromadb:
    image: chromadb/chroma:0.5.18
    environment:
      CHROMA_SEGMENT_CACHE_POLICY: "LRU"
      CHROMA_MEMORY_LIMIT_BYTES: "10737418240"   # 10 GiB
    deploy:
      resources:
        limits:
          memory: 12G

CHROMA_SEGMENT_CACHE_POLICY=LRU switches the cache from unbounded to least-recently-used eviction. CHROMA_MEMORY_LIMIT_BYTES is the budget LRU operates against — 10 GiB out of 32 GB host RAM, leaving room for Postgres, Redis, FastAPI, four Celery workers, nginx, ChromaDB itself, and the OS.

Pick a CHROMA_MEMORY_LIMIT_BYTES that's well under your container's hard limit — the policy needs headroom to actually evict before the kernel kills you.

The catch (don't forget this one)

These env vars are only applied at container creation. docker compose restart chromadb is not enough — you need:

docker compose up -d --force-recreate --no-deps chromadb

We learned this the second time we changed limits while debugging, watching RSS climb again wondering why the fix had stopped working. It hadn't — the new env never got picked up. If you change the limits, always recreate, not restart.

Why this isn't on the docs landing page

Most ChromaDB benchmarks and getting-started guides assume one big collection — the documented happy path. If you're per-user or per-session partitioning (multi-tenant SaaS, per-conversation memory, per-document RAG silos), you hit cache-and-eviction behaviour the docs don't warn about. The issues are real and open in the repo; the docs just haven't caught up.

This isn't a knock on the team — 0.5 was a big jump and they're shipping fast. It's just a heads-up that if your workload is "many small collections," your config has to be different from the tutorial.

Lessons

"It's a leak" is usually "it's a cache without an eviction policy." Read your dependency's cache config before chasing valgrind ghosts.
Many-small-collections is not the documented happy path. Per-user/per-session partitioning needs a config nobody's tutorial mentions.
Check open issues before assuming your config is wrong. #3336 and #5843 are community-known, not docs-known.
Set both env vars together. Without CHROMA_MEMORY_LIMIT_BYTES, the LRU policy has nothing to evict against and effectively no-ops.
Recreate, don't restart, when changing startup env. Standard Docker gotcha, doubly painful when you're debugging memory.

If you're on Chroma 0.5+ with many collections and seeing slow RSS creep — that's almost certainly it. Three lines of YAML, one container recreate, done.

This write-up is from production work at HoneyChat — a Telegram-native AI companion bot where each (character, session) pair gets its own ChromaDB collection for isolated semantic memory. The canonical version (with our other engineering notes) lives at honeychat.bot/en/blog/chromadb-lru-memory-leak-production.

— HoneyChat Engineering

Sources

ChromaDB docs — segment cache, deployment, configuration reference.
ChromaDB issue #3336 — memory leak in segment cache, open.
ChromaDB issue #5843 — many-collections behaviour, open.
Docker Compose: env vs --force-recreate — why restart doesn't pick up new env.
HoneyChat engineering notes: persistent-memory architecture · prompt caching measured.

We Measured LLM Prompt Caching in Production — Same Prompt, 0% to 91% Hit Rates

sm1ck — Thu, 28 May 2026 08:21:47 +0000

We run an AI companion bot. Every chat turn, the model sees the same ~5K-token prefix — character persona, content-tier rules, formatting guardrails, a memory blob — plus one new user line. Without caching, we pay for those 5K input tokens every single turn. So we turned on prompt caching across the providers we route through, measured it, and the spread was bigger than any of the marketing pages prepared us for.

Here's the table that survived four weeks in production, plus the one gotcha that ate two weeks before we figured it out.

The hit-rate table

Provider / model	Hit rate	Latency Δ	Notes
Cydonia (via OpenRouter)	91 %	−43 %	Just works, no marker needed
Gemini 3.1 Flash Lite	75 %	−49 %	Requires `cache_control` marker
Grok (xAI)	51 %	−40 %	"Sticky" — best on active sessions
Same code, 600-token test prompt	0 %	0 %	Methodology bug — see below

Same exact 5K-token system prefix across all rows. Same 10 follow-up turns. Wildly different cache behaviour.

The marker that "didn't matter" (until it did)

Most OpenAI-compat examples skip any cache hint and assume the provider figures it out from prefix repetition. Some do. Anthropic-style routes — and anything going through OpenRouter that supports cache_control — don't:

messages = [
    {
        "role": "system",
        "content": [
            {
                "type": "text",
                "text": SYSTEM_PROMPT,          # the long, stable prefix
                "cache_control": {"type": "ephemeral"},
            }
        ],
    },
    {"role": "user", "content": user_msg},      # the only volatile part
]

Cydonia caches without it. Grok caches without it.

Gemini 3.1 Flash Lite caches at exactly 0 % without it. The same model jumps to 75 % with one extra field on the last cacheable content block.

We had Gemini 3.1 routed in production for a week showing zero cache reads in usage. Concluded the model "just didn't support caching." It does — we were calling the API the way every other model wanted to be called. Cost of including the marker on providers that ignore it: zero. Cost of skipping it on a provider that needs it: your entire spend on that route.

Why our first "no, it doesn't cache" test was wrong

Before we caught the marker thing, we'd already wrongly concluded a couple of models "don't cache" — because we'd tested with the wrong prompt.

The first probe was a ~600-token prompt repeated 10 times. Cache reads: zero, across every provider. Conclusion: this provider doesn't cache.

Conclusion: wrong. Most providers have a minimum prefix length before caching kicks in (≥ 1K tokens for some routes, closer to ≥ 4K for others). Below that floor, you pay full price even though the prompt repeats verbatim. The cache simply doesn't engage.

The corrected probe:

Prefix ≥ 5K tokens, shaped like real production (system prompt + persona + retrieved memory).
10 identical follow-up turns, fresh request each time.
For Anthropic-style providers, include the cache_control marker on the last cacheable content block.
Read usage.cache_creation_input_tokens and usage.cache_read_input_tokens (or the provider's equivalent) back — don't trust round-trip latency alone.

Once we did that, every "broken" provider started reporting cache reads.

What "sticky" caching looks like (Grok)

Grok was the weird one. Hit rate 51 % — lower than Cydonia and Gemini — but the cache survived longer between calls. Other providers behaved like a ~5-minute ephemeral cache; Grok looked more like a hot-window-then-slow-decay curve. Practical consequence: Grok did better than its hit rate suggested when the same user kept chatting actively, and worse when they came back hours later.

Lesson — a single hit-rate number per provider lies a little. The shape (how it decays, how it warms) matters as much as the headline percentage when your traffic is bursty.

What it actually saved

We route turns through different model tiers depending on the user's plan. After caching landed and the marker was wired in everywhere it was needed:

Cached input tokens are billed at roughly 10 % of normal price (provider-dependent, sometimes lower).
Cost per turn on the heavy-tier routes dropped about 40–45 %, matching the hit rates above.
End-to-end latency dropped 40–49 %, which users actually notice — the typing-dots animation snapping back faster feels like a different product.

The pleasant surprise was that latency mattered to retention more than cost mattered to the P&L. Cheaper turns are nice; faster replies are felt.

Lessons we'd pin to the wall

Test with a production-shaped prompt. Short toy prompts will tell you caching doesn't work on providers where it works fine. The minimum-prefix floor is real and silent.
Read provider-specific cache hints. Anthropic-style cache_control is required on some routes (Gemini 3.1 line via OpenRouter, in our case) and ignored by others. Always send it.
Verify with usage fields, not vibes. cache_read_input_tokens doesn't lie. End-to-end latency does — TTFB swings hide a lot of noise.
One hit-rate per provider lies a little. The decay curve matters more than the headline number for bursty vs. steady chat patterns.
Re-probe quarterly. Providers ship cache changes silently. The 75 % on Gemini 3.1 Flash Lite is a 2026 number — the same code on the same model gave us 0 % earlier this year, before the marker was wired in.

If you're running an AI app where the system prompt dwarfs the user input — companion bots, RAG with chunky retrieved context, agentic loops — you almost certainly leave 40 % of your bill and half a second of latency on the table by trusting the defaults. The marker is one line. The corrected methodology is one afternoon.

If you've got hit-rate numbers from a different routing setup (Bedrock, Fireworks, Together, direct Anthropic), drop them in the comments — curious how the marker situation compares outside the OpenRouter ecosystem.

This write-up is from production work at HoneyChat — a Telegram-native AI companion where the system prompt is the load-bearing wall (persona + content tier + memory blob = the whole 5K). The canonical version of this post lives at honeychat.bot/en/blog/llm-prompt-caching-in-production.

— HoneyChat Engineering

Sources

Anthropic — Prompt caching — cache_control field reference, ephemeral cache, billing rates.
OpenAI — Prompt caching — automatic caching, minimum prefix length, cached_tokens in usage.
Google — Context caching — Gemini API caching, supported models.
OpenRouter — Prompt caching — provider-specific cache passthrough, Anthropic-style marker support.
HoneyChat engineering notes: LLM routing per tier on OpenRouter · Persistent-memory companion architecture.

IP-Adapter + LoRA for product catalog rendering — putting shop items on AI characters

sm1ck — Sat, 25 Apr 2026 02:35:59 +0000

📦 Runnable workflow: github.com/sm1ck/honeychat/tree/main/tutorial/04-ipadapter — a ComfyUI workflow.json (with <tune> placeholders for IP-Adapter weight/end_at) plus a stdlib Python client that posts it to your ComfyUI instance and saves the output.

In the previous post I argued that LoRA per character is often the strongest fit for visual identity. But what happens when you want to render that character wearing a specific item — a shop product, a user-uploaded outfit, a gift from another user?

LoRA helps stabilize the character. To also preserve an arbitrary reference image, IP-Adapter is a common fit. Those two techniques can compete unless you configure them carefully.

TL;DR

LoRA stabilizes the character's face. IP-Adapter pulls features from a reference image. If both are too strong late in sampling, the face can drift toward the reference.
Balance: moderate IP-Adapter weight (lower half of 0–1) with early handoff (IP-Adapter releases control before the final denoising steps). The final steps belong to the LoRA.
A useful node order: Checkpoint → LoRA → FreeU → IP-Adapter → KSampler. Feeding IP-Adapter into the model conditioning after LoRA lets LoRA reassert on late steps.

Render your first outfit preview

This section walks you from clone to a generated image in under ten minutes.

1. Prereqs

A running ComfyUI instance (local GPU, rented box, or a friend's)
ComfyUI_IPAdapter_plus installed in it
ip-adapter-plus_sdxl_vit-h.safetensors in models/ipadapter/
CLIP-ViT-H-14-laion2B-s32B-b79K.safetensors in models/clip_vision/
Your own SDXL base checkpoint
A character LoRA — if you don't have one, go through the previous article first

2. Clone and install the client

git clone https://github.com/sm1ck/honeychat
cd honeychat/tutorial/04-ipadapter
pip install -e .

3. Put your outfit reference next to the client

Anything flat-lay, clean-background works best. ./my-dress.png for this example.

4. Run — start at the middle of both tuning ranges

export COMFY_URL=http://localhost:8188
export REFERENCE_IMAGE=./my-dress.png
export CHECKPOINT=your-sdxl-base.safetensors
export LORA=your-character-v1.safetensors
export IPADAPTER_WEIGHT=0.4      # lower half of 0–1
export IPADAPTER_END_AT=0.8      # upper half of 0–1

python client.py

Output lands in ./out/outfit_preview_<n>.png. First run should usually show your character wearing something that resembles the reference dress.

5. Tune

Inspect the output. Two failure modes tell you how to adjust:

Face drifted → lower IPADAPTER_WEIGHT or lower IPADAPTER_END_AT by 0.05 and re-run.
Item doesn't resemble the reference → raise IPADAPTER_WEIGHT by 0.05, or raise IPADAPTER_END_AT slightly.

Sweep in 0.05 steps, not 0.1. The usable range can be narrower than expected, and a new base model may take several tuning sweeps before the balance feels stable.

6. Validate the workflow JSON with pytest

pip install -e ".[dev]"
pytest -v

Five tests make sure workflow.json stays valid JSON, every node class is still referenced, and <tune> placeholders haven't been accidentally committed with real values.

The problem

You have a character (Anna) stabilized by a custom LoRA. She appears reasonably consistent across generations. Now the user buys a specific dress in your shop. The dress is a reference image. You want:

Anna's face — unchanged.
This specific dress — rendered faithfully on Anna.

Prompt engineering usually can't guarantee this. "Anna wearing a red silk dress with a white collar" generates a red silk dress, not necessarily this red silk dress. SKU-level fidelity needs the reference image in the generation path.

Why naive IP-Adapter breaks the character

IP-Adapter pulls features from a reference image into the model's cross-attention. If you set it too high, it can preserve the reference image aggressively — including its face, if there is one. Even if the reference is an unworn product shot, IP-Adapter can pull in lighting, backdrop, and styling from the reference photo.

At high weight: Anna's face may start looking more like whoever (or whatever) is in the reference. Lighting and pose can bias toward the reference.

At low weight: The character is fine. The dress is approximately the right color and cut but not recognizable as this dress. Your product catalog becomes decorative rather than accurate.

The balance: moderate weight + early handoff

The two knobs that matter are weight and end_at.

Weight — the multiplier on IP-Adapter's contribution to cross-attention. Below the lower-middle of the 0–1 range, the reference is a "mood" more than a fact. Above the upper-middle, the reference dominates. Somewhere in the lower half is where you find the range that preserves item identity without killing face identity.

end_at — the fraction of denoising steps during which IP-Adapter is active. If it runs through all steps, it has a say in the final face details. If it ends earlier (say 70–90% of the way through), the last steps belong to the rest of the pipeline, and LoRA face features reassert.

In rough terms: the item gets baked in during the middle of denoising, the face re-sharpens at the end.

Workflow node order (ComfyUI)

[Checkpoint Loader]
  → [LoRA Loader: character_lora]
    → [FreeU: quality touch-up]
      → [IPAdapter Advanced: reference, weight=W, end_at=E]
        → [KSampler]
          → [VAE Decode]

Two things about this order:

LoRA comes before IP-Adapter in the chain. The LoRA modifies the checkpoint weights; IP-Adapter modifies cross-attention during sampling. When IP-Adapter ends at step end_at, the remaining steps operate on the LoRA-modified weights without IP-Adapter influence — this is what lets the face reassert.
FreeU is optional. It's a noise rebalance that improves quality without adding compute.

The tutorial client takes the base workflow.json, rewrites the <tune> placeholders with env-supplied values, uploads the reference image to ComfyUI, and queues the prompt:

def rewrite_workflow(wf: dict[str, Any], args: argparse.Namespace, ref_filename: str) -> dict[str, Any]:
    """Fill in the `<tune>` and `<path>` placeholders with actual values."""
    wf = json.loads(json.dumps(wf))  # deep copy

    if args.checkpoint:
        wf["1"]["inputs"]["ckpt_name"] = args.checkpoint
    if args.lora:
        wf["2"]["inputs"]["lora_name"] = args.lora
    wf["2"]["inputs"]["strength_model"] = args.lora_strength
    wf["2"]["inputs"]["strength_clip"]  = args.lora_strength
    wf["5"]["inputs"]["image"] = ref_filename
    wf["6"]["inputs"]["weight"] = args.weight
    wf["6"]["inputs"]["end_at"] = args.end_at
    wf["7"]["inputs"]["text"] = args.prompt
    wf["10"]["inputs"]["seed"] = int(time.time()) & 0xFFFFFFFF
    return wf

→ full source

The full workflow.json in the tutorial folder ships with <tune> placeholders on every field you should touch. The test suite asserts those placeholders stay in the template — a safety net against accidentally committing your tuned production values.

Weight tuning loop

The practical process:

Pick a reference item with a clean product photo.
Pick a character with a strong LoRA.
Render around weight=0.3, end_at=0.8. Check face, check item.
Face drifts → lower weight or lower end_at.
Item doesn't resemble the reference → raise weight carefully, or leave weight and raise end_at.
Sweep in 0.05 increments, not 0.1. The usable range is narrower than you'd expect.

Several tuning sweeps on realistic and anime bases usually land you on a working pair.

Production integration

Outfit catalog as reference images. Each shop item has a reference image stored in object storage. At generation time, pass the reference URL to the GPU worker, which downloads it once and caches.

Catalog pre-rendering for previews. When a user browses the shop, they see a preview of each item rendered on their active character. These previews don't need to happen on every page load — generate them asynchronously (Celery worker), store in S3, serve from cache.

Consistency across image and video. The same IP-Adapter + LoRA pair used for images can often drive the start-frame of video generation (e.g., Kling). Tune the still-image path first, then reuse it carefully.

Fallback when the item isn't visual. Some "items" in a shop are stats buffs, relationship flags, or dialogue unlocks — things without a visual. Gate the IP-Adapter pathway to items flagged as visual-only.

Production issues that came up

Face drifted on a noticeable slice of catalog previews. Running IP-Adapter weight too high "for stronger outfit adherence." Rolled back to the lower-half range after face-drift complaints spiked. Lesson: tune one variable at a time, even when it feels slow.

Cached reference URLs expired. Shop items in S3 had time-limited presigned URLs. Generation workers fetched the URL at queue-time, but the URL expired before ComfyUI actually downloaded it. Fix: pre-fetch on the worker side, pass the ComfyUI-side filename instead of the external URL.

IP-Adapter model version mismatch with SDXL base. IP-Adapter Plus ships multiple weights keyed to specific SDXL base models. Mixing can produce worse output without an obvious runtime error — just lower fidelity. Pin the IP-Adapter version to the base in your deployment config.

Non-visual shop items crashed the workflow. The API tried to render "stat boost" items through the image pipeline. Fix: a visual: true|false flag on catalog entries, checked at the API boundary before queuing.

What I'd change if starting over

Start with a clean catalog. Reference images with consistent backgrounds, consistent lighting, no model already wearing the item if possible.
Version the tuning. When you move base models, your IP-Adapter weight/end_at values probably move too. Treat them as part of the deployment, not as constants.
Cache the pre-rendered previews aggressively. A character × item grid grows multiplicatively. Pre-render on character creation and on new item add.

Where this lives

HoneyChat's shop renders outfits, accessories, and gifts on active characters using IP-Adapter Plus layered over per-character LoRA. Public architecture doc: github.com/sm1ck/honeychat/blob/main/docs/architecture.md.

References

If you've shipped an IP-Adapter + LoRA combo in production, I'm curious what weight / end_at pairs you landed on and for which base. The sweet spot seems to shift meaningfully between anime and realistic bases.

Character consistency in AI image generation — where prompts break down and LoRA helps

sm1ck — Wed, 22 Apr 2026 12:22:02 +0000

📦 Training template: github.com/sm1ck/honeychat/tree/main/tutorial/03-lora — a generic Kohya SDXL config with <tune> placeholders and a dataset curation guide. No docker-compose (LoRA training is GPU-heavy) — you bring your own GPU or rent one.

Here's a failure mode many AI companion apps run into on launch day: users send two requests in a row for the same character, get two different faces, and conclude the product is broken. They're not wrong to feel that way. Character identity is part of the product.

This post is about why that happens, why the obvious fixes (seed-pinning, more prompt detail, reference images) often don't fully solve it, and what class of solution works better.

TL;DR

Identical seed + identical prompt + different batch size = different face. Seeds only help within the same sampler run.
Prompt detail plateaus fast. Past a certain tag count, the model interpolates anyway.
Reference image (IP-Adapter) works but can bleed stylistic features — outfit, lighting, background — into generations where you only wanted identity.
Custom LoRA per character makes identity much more stable by encoding it at the weights level instead of relying only on prompt text.

Train your own character LoRA — the short walkthrough

LoRA training is GPU-heavy and doesn't belong in a docker-compose, so the tutorial folder at tutorial/03-lora ships the config template and recipe. You bring the GPU.

1. Get a GPU

24 GB VRAM (RTX 3090/4090) fits SDXL LoRA at batch size 2–4 comfortably. Don't own one? Rent a spot — Vast.ai, RunPod, Modal, Paperspace, Lambda. A full training run costs a few dollars.

2. Install Kohya_ss

git clone https://github.com/bmaltais/kohya_ss ~/kohya_ss
cd ~/kohya_ss && ./setup.sh

3. Grab the template

git clone https://github.com/sm1ck/honeychat
cp -r honeychat/tutorial/03-lora ./my-character-lora
cd my-character-lora

4. Prepare your dataset

Drop 15–30 varied images of your subject into dataset/train/5_character/ (the 5_ is the repeat count). For each image, create a same-named .txt caption describing the scene — not the character. See dataset/README.md for the full curation checklist.

5. Fill the <tune> slots in kohya-config.toml

Every hyperparameter is a placeholder you pick based on your dataset and base model. Read the inline comments, then replace each <tune> with a real value. The safety check in train.sh will refuse to run if any placeholder remains.

6. Train

export KOHYA_DIR=~/kohya_ss
bash train.sh

The checkpoint lands at ./output/<your-character>.safetensors. Load it into ComfyUI or Diffusers like any other SDXL LoRA. Generate a test grid, iterate, retrain if needed.

Why "same prompt, same face" doesn't hold

Users naturally assume this works:

"anime girl, long silver hair, green eyes, Arknights operator outfit"
+ seed=12345
→ Anna, always. Or so it seems.

Not reliably. Three reasons.

Batch size changes the output. In most Stable Diffusion runs, batch_size=1 and batch_size=4 with the same seed produce different images for position 0. The RNG state depends on batch dimension.

Provider-side sampler drift. If you're calling a managed API (fal.ai, Replicate, Together), provider-side changes — model updates, sampler tweaks, default parameter shifts — can produce visually different outputs across weeks. Your "locked" character can drift.

Prompt detail saturates. At some point, adding more tags ("sharp nose, high cheekbones, narrow eyes, specific mole position") stops helping much. The model has a rough template and interpolates within it.

The in-between fix that doesn't quite work: IP-Adapter

IP-Adapter lets you pass a reference image alongside the prompt. The model bakes the reference's features into the cross-attention. For product photography, excellent.

For character identity, it has a practical drawback: IP-Adapter can carry stylistic baggage. A reference photo with specific lighting, pose, outfit, and background can bleed those into the generated image. You can turn the weight down, but then identity may weaken; turn it up, and the reference can dominate.

IP-Adapter is a good fit when the reference is what you want preserved (e.g., rendering a shop item on a character — next post in the series). It's usually a poor fit when what you want preserved is only the face.

The solution: custom LoRA per character

A LoRA (Low-Rank Adaptation) is a small set of additional weights layered on top of a base model. A character-specific LoRA trained on a curated dataset — consistent face, varied pose/outfit/lighting — encodes the identity into the weights themselves, not into the prompt.

Inference pipeline:

workflow = [
    "Checkpoint",           # base SDXL model
    f"LoRA: {char.lora}",   # the character's custom LoRA
    "FreeU",                # quality touch-up
    "KSampler",             # actual diffusion
]

Now Anna is much more likely to stay Anna across pose, outfit, and lighting changes. The face is represented in the weights, not only in the words.

Training a character LoRA (public-friendly template)

The conceptual shape of the training job using the publicly available Kohya_ss SDXL trainer:

# Kohya_ss SDXL LoRA training config — generic template
# Replace every <tune> value based on your dataset and base model.
# See Kohya docs for the full parameter reference.

[model_arguments]
pretrained_model_name_or_path = "<path/to/sdxl-base-or-finetune.safetensors>"

[dataset_arguments]
train_data_dir = "./dataset/train"
resolution     = "1024,1024"
caption_extension = ".txt"

[training_arguments]
output_dir      = "./output"
output_name     = "<your_character_v1>"
save_model_as   = "safetensors"

# Training steps and batch — VRAM-bound. Tune for your hardware.
learning_rate    = "<tune>"
max_train_steps  = "<tune>"
train_batch_size = "<tune>"

[network_arguments]
network_module = "networks.lora"
network_dim    = "<tune>"
network_alpha  = "<tune>"

→ full template on GitHub

The parameters that matter — LR, step count, rank, alpha, dataset size — are subject-dependent. Anime faces converge differently than realistic faces. There is no universal "best" setting.

What to actually optimize for:

Dataset quality over dataset size. 20 clean, varied, captioned images beat 100 messy ones.
Varied pose and lighting, constant face. Same angle 30 times teaches "this angle," not "this character."
Clean captions — describe the scene, not the character. "Woman standing in a garden" is better than "Anna standing in a garden" because you want the model to learn the face from context, not from the token.
Dedicated rank for face detail. Lower ranks underfit the identity; higher ranks overfit and kill flexibility.

Marginal cost: usually manageable

If you're running inference on a rented or owned GPU, training one character LoRA is a one-time cost usually measured in minutes to hours of GPU time, depending on dataset and settings. Inference with the LoRA attached often adds little overhead compared with the base generation. At scale, the per-character cost is dominated by dataset curation, not just training compute.

This is why a LoRA-per-character pipeline can be viable for products with many characters: once the pipeline exists, adding a new character is mostly a dataset and QA exercise, not a research project.

Production concerns

LoRA hot-swapping. Load the base checkpoint once, swap LoRAs per request. ComfyUI and Diffusers both support this natively.

Dataset hygiene. LoRAs memorize whatever's in the dataset. Enforce licensing upstream — the LoRA is downstream of the decision.

Storage at scale. LoRA file size depends on base model and rank; expect anything from a few MB to much larger checkpoints. Object storage + hot-LoRA pinning on inference workers keeps latency down.

Face ≠ body. A LoRA trained on face crops will not lock body proportions. Include full-body shots in the dataset if you need full-body consistency.

What I'd change if starting over

Ship the LoRA pipeline from day 1, even for three characters. Inconsistent visuals in the free tier can hurt activation before users ever see the stronger parts of the product.
Curate datasets manually, don't scrape. Five iterations of a hand-picked set of 20 images beat a scraped 200.
Store base-model version with each LoRA. When you update the base, you need to know which LoRAs need retraining.
Version LoRAs (v1, v2) and keep old versions live. If v2 ships with a regression, roll back per-character without reverting a whole release.

Where this lives

HoneyChat uses custom LoRA per character for visual identity in image and video generation. Public architecture: github.com/sm1ck/honeychat.

Previous: LLM routing per tier via OpenRouter.
Next: IP-Adapter Plus for a product catalog — how to put arbitrary shop items on a character while keeping the character's face locked.

References

If you've trained character LoRAs in production and have opinions on rank selection or caption strategy, I'd love to hear them in the comments. There's very little public writing on this outside the anime generation community.

LLM routing per tier via OpenRouter — when one model doesn't fit all

sm1ck — Tue, 21 Apr 2026 23:50:29 +0000

📦 Full runnable example: github.com/sm1ck/honeychat/tree/main/tutorial/02-routing — docker compose up exposes POST /complete on localhost:8000. Every snippet below is pulled from that repo.

Most introductory "chat with AI" tutorials pick one model and call it a day. That works in a toy. It stops being enough in production, where users have different price sensitivity, different conversation styles, and different expectations for what the product should allow.

Here's how to route LLM calls across a handful of providers via OpenRouter, how that routing handles finish_reason=content_filter empty-completion edge cases, and the fallback chain pattern that keeps replies flowing.

TL;DR

Route by tier (price elasticity) and by content mode (what kind of turn this is). A single default model can't do both.
Some reasoning/model-provider combinations can return finish_reason=content_filter with empty content on borderline content. A retry policy that only catches HTTP errors can miss this.
The working pattern: primary → different-provider fallback → specialized last resort, with retries triggered by both error responses and suspicious empty completions.

Run it yourself in 3 minutes

1. Clone and configure

git clone https://github.com/sm1ck/honeychat
cd honeychat/tutorial/02-routing
cp .env.example .env

Open .env, paste your OPENROUTER_API_KEY (get one here). The three default model slots all point to free-tier OpenRouter models so you can experiment without spending.

2. Start the service

docker compose up --build -d
curl http://localhost:8000/health   # {"ok":true}

3. Send a normal turn — primary answers

curl -X POST http://localhost:8000/complete \
  -H 'content-type: application/json' \
  -d '{"messages":[{"role":"user","content":"Name three cold-climate fruits."}]}' \
  | jq

Expected response:

{
  "content": "Apples, pears, and cloudberries...",
  "model": "meta-llama/llama-3.1-8b-instruct:free",
  "attempt": 0,
  "used_fallback": false
}

attempt: 0 means the primary model answered. used_fallback: false means no retry was needed.

4. Force a fallback

Override the primary to point at a model you know tends to refuse — or any bogus model name — and watch the chain kick in:

curl -X POST http://localhost:8000/complete \
  -H 'content-type: application/json' \
  -d '{"messages":[{"role":"user","content":"Say hi"}],"primary":"this/model-does-not-exist"}' \
  | jq '.model, .attempt, .used_fallback'

attempt: 1 (or 2) — the next rung answered. In production, log this metric: a rising fallback rate on a class of content means it's time to move the content to a different primary, not to tweak retry logic.

5. Run the unit tests

pip install -e ".[dev]"
pytest -v

Seven tests cover the failure modes in this chain — content_filter=empty, transient 5xx, non-transient 4xx, all-models-fail.

With the service running and the tests green, the rest of this post explains why the chain is shaped this way.

Why one model doesn't fit all

Three distinct pressures push against a single-model setup:

Price elasticity by tier. A free user generating 20 messages a day at flagship-model prices can burn cash every month per active user for zero revenue. A paying top-tier user sending the same 20 messages may reasonably expect higher quality. The unit economics do not agree.

Content mode. Mainstream-aligned models can refuse content that some legitimate companion/roleplay products allow on paid tiers. Conversely, less-restrictive models can have weaker long-context coherence. The right model depends on the turn.

Latency vs. depth. Instant conversational turns need sub-3-second responses. Long scene-writing turns can tolerate 10+ seconds for better prose. Hardcoding a single model optimizes for one and sacrifices the other.

The reasoning-model empty-completion edge case

This is the one that cost me a full afternoon to diagnose.

Some reasoning-class model/provider combinations do server-side moderation or filtering before returning a final answer. On borderline turns, they may not return an HTTP error. Instead, they can return a valid response with:

{
  "choices": [{
    "finish_reason": "content_filter",
    "message": { "content": "" }
  }]
}

Empty string. No exception. No status code to check. If you don't guard for it, your user sees a blank reply.

If your retry logic only triggers on httpx.HTTPStatusError, this can pass through.

The guard

The whole failure mode is caught by a tiny function:

def _is_silent_refusal(choice: dict) -> bool:
    """
    The whole point of this post: reasoning models can return a successful
    HTTP response with finish_reason=content_filter AND an empty content.
    If you only check HTTP status, you ship blank replies to users.
    """
    reason = choice.get("finish_reason")
    content = choice.get("message", {}).get("content") or ""
    return reason in ("content_filter", "length") and not content.strip()

→ full source

Resilient fallback chain

async def complete(
    messages: list[dict],
    *,
    primary: str | None = None,
    chain: Iterable[str] | None = None,
) -> CompletionResult:
    """Run the fallback chain. Return the first usable response."""
    models = list(chain) if chain is not None else _build_chain(primary)

    async with httpx.AsyncClient() as client:
        for attempt, model in enumerate(models):
            try:
                data = await _call(client, model, messages)
            except httpx.HTTPStatusError as e:
                if e.response.status_code in TRANSIENT_CODES:
                    continue
                raise
            except (httpx.ReadTimeout, httpx.ConnectError):
                continue

            choice = (data.get("choices") or [{}])[0]
            if _is_silent_refusal(choice):
                continue

            content = choice.get("message", {}).get("content") or ""
            if not content.strip():
                continue

            return CompletionResult(content=content, model=model, attempt=attempt)

    raise AllModelsFailedError(f"no model returned usable content; tried {models}")

→ full source

Two details worth calling out:

Empty content check is separate from the finish reason. Some models can return finish_reason=stop with empty content when they refuse. Always check not content.strip().
Track which model ultimately answered. Log attempt > 0 as a fallback event. If your primary fails 10% of the time on a class of content, that's a routing decision, not a retry problem — move that content to a different primary.

Picking the fallback order

For a permissive roleplay mode, the shape looks like this:

content-mode primary   → first model for this type of turn
  ↓ (on failure / empty)
diff-provider fallback → avoids the same upstream failure mode
  ↓
specialized last resort
  ↓
abort — ask the user to try a shorter or clearer prompt

The ordering rule: different-provider fallbacks. If the primary is hosted on provider A and fails for a provider-side reason, prefer a fallback hosted on provider B. Same-provider fallbacks can fail on the same content because the provider's moderation layer may be upstream of the model. OpenRouter makes this easier because each model's provider metadata is visible.

Content-level gating happens before the LLM, not after

The fallback chain handles model-level refusals. But if the user's intent is clearly above your product's content ceiling, retrying on a more permissive model just burns extra tokens before the user hits the real limit. Gate the content level in your system prompt assembly — don't rely on the model to enforce policy.

Keep the tier-level policy simple: the escalation class (detected from user intent) must be ≤ the user's plan ceiling. If over, the character responds in-character and the bot sends the upsell. The LLM does not need to know the tier exists — it just gets a system prompt with the right constraints for this turn.

Instrumentation that matters

Log three things per LLM call:

Model that answered (primary or fallback index)
Time to first token vs total time — tells you whether latency was model-side or network-side
Token cost (input + output) per message, bucketed by tier

Costs track in Redis counters with short TTL — daily sum, per-user daily sum. A global daily ceiling blocks new generations if spend crosses a configured threshold (fail-closed: if the counter is unreachable, block, don't pass). This helped cap a runaway generation loop at a known ceiling.

What I'd change if starting over

Route by content mode from day 1, not as an afterthought. Retrofitting the split into an existing handler is painful.
Instrument the silent-refusal rate. It may be rare, but you won't know unless you measure it specifically.
Don't share a single OpenRouter key across environments. Rate limits are per-key and dev noise eats prod quota.
Publish the tier → model map in your public docs. Users comparing products care. Competitors already know. Keeping the docs in sync with the code forces alignment.

Where this lives

HoneyChat's LLM router sits behind the chat handler on both the Telegram bot and the web app. Public architecture: github.com/sm1ck/honeychat/blob/main/docs/architecture.md.

Previous in the series: dual-layer memory with Redis + ChromaDB.
Next: character consistency with custom LoRA.

References

Curious how others have solved the silent-refusal pattern. If you've hit it on a different provider, drop a comment — I want to know which models ship which behavior.

Building an AI companion with persistent memory — Redis + ChromaDB

sm1ck — Mon, 20 Apr 2026 12:16:42 +0000

📦 Full runnable example: github.com/sm1ck/honeychat/tree/main/tutorial/01-memory — clone, docker compose up, chat with the demo bot on Telegram. Every code snippet below is pulled from that repo.

Most AI chatbots still struggle with reliable, queryable long-term recall. Character.AI has pinned and chat memories, but unpinned details can still fall out of the active conversation context. Replika remembers profile facts, preferences, and generated memories, but that is not the same as semantic recall over the full conversation. Even ChatGPT's Memory is built for useful preferences and details, not verbatim replay of long sessions.

I wanted a chat companion with practical persistent memory — not just the current conversation, but older facts and events surfaced when they matter. Here's the architecture that worked well for this use case.

TL;DR

Hot layer (Redis) — recent messages per conversation, short TTL, low-latency reads.
Cold layer (ChromaDB) holds summaries of chunks, not individual messages. Every N bot turns, a background task summarizes that window via a cheap LLM and stores the summary as a document. Keeps the vector index tiny, queries fast.
On every user message, three retrieval paths fire in parallel via asyncio.gather: recent buffer, latest summary, top-K semantic search. All three get assembled into the system prompt.
Result: substantially fewer tokens than full-history replay, while still making old context retrievable weeks later.

Run it yourself in 5 minutes

Before the architectural deep-dive, boot the demo so you can poke the memory layers live.

1. Clone and enter the folder

git clone https://github.com/sm1ck/honeychat
cd honeychat/tutorial/01-memory

2. Configure two tokens

cp .env.example .env

Open .env and fill:

TELEGRAM_BOT_TOKEN — get it from @BotFather (30 seconds: /newbot, pick a name, copy the token)
OPENROUTER_API_KEY — from openrouter.ai/keys. The default LLM_MODEL is a free-tier Llama 3.1 8B so you don't spend a cent.

3. Start the stack

docker compose up --build -d
docker compose logs -f bot       # watch the bot come alive

Four containers: redis, chromadb, api (FastAPI inspector on localhost:8000), bot (your Telegram bot polling).

4. Talk to your bot

Open it on Telegram, hit /start, chat for 10–20 turns. Tell it things about yourself. Come back later and reference something you said earlier — it'll pull it from ChromaDB.

5. Peek at what each layer holds

# Replace 12345 with your own Telegram user ID (ask @userinfobot)
curl http://localhost:8000/memory/12345/demo/recent  | jq
curl http://localhost:8000/memory/12345/demo/summary | jq

recent shows the raw Redis buffer. summary shows the latest ChromaDB document.

With the demo running, the rest of this post explains what you just booted.

Why rolling summaries alone don't work

A common pattern for chatbot memory is a rolling summary — every N messages, regenerate a compressed version of older context. It's cheap. It's also lossy in a very specific way: nuance dies in repeated compression.

Walk it through three regenerations:

Turn 1: "She said she hates her boss because he takes credit for her work"
Turn 2 summary: "User mentioned workplace frustration with manager"
Turn 3 summary: "User has job-related stress"
Turn 4 summary: "User has a job"

By turn 4, the reason is gone. A companion bot starts sounding generic. The fix used here: keep raw recent messages verbatim and only summarize chunks that are genuinely old, while being able to semantically retrieve any summary from the full history when the current conversation calls back.

Architecture

Two independent layers. Writes to Redis are synchronous on every turn; writes to ChromaDB are asynchronous, batched. Reads from both happen in parallel on every message.

The hot layer — Redis

Each (user_id, character_id) conversation is stored as a bounded Redis list:

async def save_message(user_id: int, char_id: str, role: str, content: str) -> None:
    r = get_redis()
    key = f"chat:{user_id}:{char_id}:messages"
    msg = json.dumps({
        "role": role,
        "content": content,
        "ts": datetime.now(timezone.utc).isoformat(),
    })
    pipe = r.pipeline()
    pipe.rpush(key, msg)
    pipe.ltrim(key, -HOT_BUFFER_SIZE, -1)
    pipe.expire(key, 86400 * HOT_BUFFER_TTL_DAYS)
    await pipe.execute()

→ full source on GitHub

Three things matter here:

ltrim on every write. The list is bounded. Memory per user is O(1), not O(conversation length).
TTL extended on every write. Inactive users' history evicts automatically. Configure Redis with allkeys-lru so overflow evicts instead of refusing writes — noeviction is the default and it's a footgun.
Pipelined writes. rpush + ltrim + expire in one round trip.

The cold layer — ChromaDB with summaries, not messages

A tempting implementation is to embed every message and run semantic search over them. Two problems: the index grows linearly with conversation volume, and individual messages are often too short or context-free to retrieve meaningfully ("yeah" returns a lot of "yeah" matches).

Instead: embed LLM-generated summaries of chunks. Every N bot turns, compress the window via a cheap LLM and write it as one document to a per-(user, character) ChromaDB collection. Ten weeks of active conversation is maybe 30–50 documents per collection, not tens of thousands.

Retrieval — three paths in parallel

On every user message, the chat handler fires three reads in parallel via asyncio.gather:

async def build_prompt_context(user_id: int, char_id: str, user_query: str) -> dict:
    """Parallel fire the three reads. Returns everything the handler needs."""
    recent, summary, memories = await asyncio.gather(
        get_recent(user_id, char_id),
        get_latest_summary(user_id, char_id),
        get_relevant_memories(user_id, char_id, user_query),
    )
    return {"recent": recent, "summary": summary, "memories": memories}

→ full source

The fast path for the summary hits Redis. The slower path queries ChromaDB only when the Redis cache expired, then writes back so the next call is hot again.

Production issues that came up

Double-summarize race. Two concurrent messages for the same pair both trigger summarization, writing overlapping summaries. Fix: per-key task tracking, cancel the pending task if a new one fires.

User clears history mid-summarize. A user hits "reset chat" while a summary is in flight. The summary then writes to a collection that just got deleted. Fix: re-check r.exists(key) before writing; bail if the list is gone.

Empty summaries cached. LLM rate-limited, returned empty content — and I was caching the empty string with a 3-day TTL. Fix: if summary: guard before setex.

ChromaDB collection doesn't exist for new users. col.query raises on a non-existent collection. Wrap in try/except and return empty — normal for a user's first few messages.

What I'd change if starting over

Skip pgvector for this shape of workload. Two weeks on it first; for my short-query summaries, recall was worse than ChromaDB and reindexing pain wasn't worth it.
Don't embed per message. Index exploded, recall didn't improve. Summary-level is the right granularity.
Summarize fixed-size windows, not time-based batches. Daily summaries are useless for users who chatted 500 times in one day.
Build the cancellation pattern from day 1. Race conditions around user actions (clear history, switch character) became one of the top sources of production bugs.

Where this lives

HoneyChat — an AI companion that runs both as a Telegram bot and a web app on the same backend. The architecture above is in production. Try it: @HoneyChatAIBot on Telegram or honeychat.bot in the browser.

Public docs: github.com/sm1ck/honeychat — service topology, API surface, major flows.

Next in the series: LLM routing per tier — why one model doesn't fit all, and how to handle content_filter errors from reasoning models.

References

If you're building something similar and have questions about the memory layout or the summarization pipeline, drop a comment. Especially curious how others handle race conditions around user-initiated state resets.

DEV Community: sm1ck

Sentry SDK 2.x Auto-Integrations Flood Your Inbox — Here's the Filter

What started landing in our inbox

What's actually happening

The filter (core/sentry_filters.py)

The log-level discipline that goes with it

What we didn't do

Lessons

Sources

When the LLM Refuses: A Fallback Chain That Salvages Most Refusals

Step 0: Don't trigger it in the first place

Step 1: Partial salvage before fallback

Step 2: Provider rescue with a system-prefix override

Step 3: Plan-aware degradation

Lessons we'd pin to the wall

Sources

Inworld TTS Paralinguistic Tags Don't Work — Here's What Does

What actually doesn't work

What actually does work

1. Asterisks for emphasis

2. Ellipsis for pause-with-mood

3. SSML <break> for hard pauses

4. Onomatopoeia for laughs, moans, breath

The wrapper that ties it together

Lessons

Sources

We Deleted 10 Real Users with a Test-Cleanup Script — RCA

The incident, in two lines

How the same negative IDs ended up shared between test and real users

What got deleted, what we couldn't recover

Root causes (plural — they always are)

The contract we now run

1. Test users have an attribute, not a range

2. Backup cadence with explicit RPO

3. Recovery rehearsal, not just backups

Lessons

Sources

ChromaDB 0.5 Silently Leaks Memory Until You Set One Env Var

The TL;DR

The shape of the leak

What we tried first (and what didn't work)

The fix

The catch (don't forget this one)

Why this isn't on the docs landing page

Lessons

Sources

We Measured LLM Prompt Caching in Production — Same Prompt, 0% to 91% Hit Rates

The hit-rate table

The marker that "didn't matter" (until it did)

Why our first "no, it doesn't cache" test was wrong

What "sticky" caching looks like (Grok)

What it actually saved

Lessons we'd pin to the wall

Sources

IP-Adapter + LoRA for product catalog rendering — putting shop items on AI characters

TL;DR

Render your first outfit preview

The problem

Why naive IP-Adapter breaks the character

The balance: moderate weight + early handoff

Workflow node order (ComfyUI)

Weight tuning loop

Production integration

Production issues that came up

What I'd change if starting over

Where this lives

References

Character consistency in AI image generation — where prompts break down and LoRA helps

TL;DR

Train your own character LoRA — the short walkthrough

Why "same prompt, same face" doesn't hold

The in-between fix that doesn't quite work: IP-Adapter

The solution: custom LoRA per character

Training a character LoRA (public-friendly template)

Marginal cost: usually manageable

Production concerns

What I'd change if starting over

Where this lives

References

LLM routing per tier via OpenRouter — when one model doesn't fit all

The filter (`core/sentry_filters.py`)

3. SSML `<break>` for hard pauses