<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Boris Kl</title>
    <description>The latest articles on DEV Community by Boris Kl (@lamas51).</description>
    <link>https://dev.to/lamas51</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3942385%2F8d8793b0-7612-4b5a-a70c-1d4a8b562b8a.png</url>
      <title>DEV Community: Boris Kl</title>
      <link>https://dev.to/lamas51</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/lamas51"/>
    <language>en</language>
    <item>
      <title>I set up Claude Code for a real production project. Here's what actually earned its keep</title>
      <dc:creator>Boris Kl</dc:creator>
      <pubDate>Sat, 06 Jun 2026 13:53:19 +0000</pubDate>
      <link>https://dev.to/lamas51/i-set-up-claude-code-for-a-real-production-project-heres-what-actually-earned-its-keep-56i7</link>
      <guid>https://dev.to/lamas51/i-set-up-claude-code-for-a-real-production-project-heres-what-actually-earned-its-keep-56i7</guid>
      <description>&lt;p&gt;Everyone's got a "10 AI coding tricks" post. This isn't that. This is what's left after three weeks of running Claude Code on a real project — a bilingual booking bot for a beauty salon (Telegram + WhatsApp, Postgres, Google Calendar) — once the novelty wore off and only the useful parts survived.&lt;/p&gt;

&lt;p&gt;Out of the box, Claude Code is a very smart intern with amnesia. Every session it shows up brilliant and clueless. The whole game is fixing the clueless part. Four things did that for me: a CLAUDE.md file, two custom agents, one skill, and two hooks. Everything else I tried, I deleted.&lt;/p&gt;

&lt;h2&gt;
  
  
  CLAUDE.md: the file that pays rent every single day
&lt;/h2&gt;

&lt;p&gt;CLAUDE.md sits in your repo root and gets read at the start of every session. Mine started as three lines. It grew every time the assistant did something I had to undo.&lt;/p&gt;

&lt;p&gt;That's the trick, honestly. Don't write CLAUDE.md upfront — grow it from failures. Mine now includes things like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## Architecture rules&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Business logic lives in src/core/ and must not know about
  Telegram or WhatsApp. Channel code lives in src/adapters/.
&lt;span class="p"&gt;-&lt;/span&gt; All times stored in UTC; convert only for display.
&lt;span class="p"&gt;-&lt;/span&gt; Booking creation must stay double-booking-safe — never remove
  locks or constraints around it.

&lt;span class="gu"&gt;## Working agreements&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Before "done": run typecheck &amp;amp;&amp;amp; lint &amp;amp;&amp;amp; test and show the result.
&lt;span class="p"&gt;-&lt;/span&gt; Schema changes go through a migration file. Always.
&lt;span class="p"&gt;-&lt;/span&gt; Prefer the smallest diff that does the job.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each of those lines exists because the assistant once did the opposite. It put Telegram-specific code in core logic — new rule. It "fixed" a timezone bug by converting at storage time — new rule. It reported "done" with failing types — new rule.&lt;/p&gt;

&lt;p&gt;Three weeks in, I almost never repeat an instruction. That file is the difference between an assistant and a goldfish.&lt;/p&gt;

&lt;h2&gt;
  
  
  Custom agents: the reviewer I argue with
&lt;/h2&gt;

&lt;p&gt;Custom agents live in &lt;code&gt;.claude/agents/&lt;/code&gt; as markdown files with a system prompt. You invoke them for a specific job, they do it with their own instructions and tool limits, and they don't pollute your main session's context.&lt;/p&gt;

&lt;p&gt;The one that earns its keep daily is a code reviewer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;code-reviewer&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Reviews&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;changes&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;for&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;bugs&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;and&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;security&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;issues"&lt;/span&gt;
  &lt;span class="s"&gt;before they are committed.&lt;/span&gt;
&lt;span class="na"&gt;tools&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Read, Grep, Glob, Bash&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;

You are a strict but practical code reviewer.
Check, in this order: correctness (timezone boundaries,
double-booking windows), security (unvalidated webhook input,
SQL built by concatenation, missing signature checks),
project rules from CLAUDE.md, and whether behavior changed
without a test changing.
Report findings ordered by severity, with file:line and a
concrete fix. If something is fine, don't pad the review.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The point isn't that it catches everything. The point is that it's a &lt;em&gt;different context&lt;/em&gt; with one job. The main session wrote the code and is biased toward liking it. The reviewer agent reads it cold. It regularly catches things the main session waved through — a webhook handler that trusted &lt;code&gt;message_id&lt;/code&gt; without checking the signature, a slot calculation that broke across midnight.&lt;/p&gt;

&lt;p&gt;It found the midnight bug before my client's customers did. That one agent paid for the whole setup.&lt;/p&gt;

&lt;h2&gt;
  
  
  A skill that stops me from skipping steps
&lt;/h2&gt;

&lt;p&gt;Skills are reusable workflows — a SKILL.md file describing a procedure the assistant follows when you invoke it. I have exactly one that matters, &lt;code&gt;/add-feature&lt;/code&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Restate what we're building, confirm.&lt;/li&gt;
&lt;li&gt;List files that will change and why. Smallest possible diff.&lt;/li&gt;
&lt;li&gt;Implement, following CLAUDE.md.&lt;/li&gt;
&lt;li&gt;Write tests for the changed units.&lt;/li&gt;
&lt;li&gt;Run the code-reviewer agent on the diff. Fix what it finds.&lt;/li&gt;
&lt;li&gt;Summarize: what changed, how to try it, what I must do manually.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Nothing clever. It's a checklist. But here's the thing about checklists — they work precisely because on the fifth feature of the day, &lt;em&gt;I&lt;/em&gt; would skip the review step. The skill doesn't get tired at 11pm. Pilots figured this out decades ago; we're just catching up.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hooks: the two-line insurance policy
&lt;/h2&gt;

&lt;p&gt;Hooks run shell commands on events. I only need two.&lt;/p&gt;

&lt;p&gt;The first blocks any edit to secrets files. The assistant has no business touching &lt;code&gt;.env&lt;/code&gt;, ever, and now it physically can't — a PreToolUse hook checks the file path and exits with an error if it looks like secrets. Cost me five minutes to write. Worth it the first time a refactor tried to "helpfully" update an env var.&lt;/p&gt;

&lt;p&gt;The second runs the typecheck after every file edit and pipes problems straight back into the session. The assistant sees its own type errors immediately instead of discovering them at the end, which means it fixes them while the context is hot. This one change cut my "it said done but nothing compiles" rate to roughly zero.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"hooks"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"PreToolUse"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"matcher"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Edit|Write"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"hooks"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"bash .claude/hooks/protect-secrets.sh"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"PostToolUse"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"matcher"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Edit|Write"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"hooks"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"npm run --silent typecheck"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What I tried and deleted
&lt;/h2&gt;

&lt;p&gt;For honesty's sake: I also built an agent for writing commit messages (the main session does this fine), a skill for deployments (too risky to automate, I want my hands on that), and a hook that auto-ran the full test suite on every edit (made everything crawl — the typecheck is the right granularity; full tests run at review time).&lt;/p&gt;

&lt;p&gt;If a piece of setup doesn't save you something every day, it's not configuration, it's clutter.&lt;/p&gt;

&lt;h2&gt;
  
  
  The honest summary
&lt;/h2&gt;

&lt;p&gt;Claude Code without setup is a talented freelancer on their first day, every day. With a grown-from-failures CLAUDE.md, one cold-eyed reviewer agent, one checklist skill and two hooks, it's closer to a colleague who's been on the project for a month.&lt;/p&gt;

&lt;p&gt;The setup took me about two hours total, spread over days, mostly as reactions to things that annoyed me. The payback is that I now ship features for a production bot — payments, reminders, a wait-list — in evenings, alone, without the quality dropping.&lt;/p&gt;

&lt;p&gt;Start with CLAUDE.md. Add a reviewer agent the first time you catch a bug you should've caught. Grow the rest from your own failures — they're better teachers than my list anyway.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>automation</category>
      <category>claudecode</category>
    </item>
    <item>
      <title>One year of self-hosted n8n on a $6 Hetzner VPS</title>
      <dc:creator>Boris Kl</dc:creator>
      <pubDate>Wed, 27 May 2026 11:49:40 +0000</pubDate>
      <link>https://dev.to/lamas51/one-year-of-self-hosted-n8n-on-a-6-hetzner-vps-4ee7</link>
      <guid>https://dev.to/lamas51/one-year-of-self-hosted-n8n-on-a-6-hetzner-vps-4ee7</guid>
      <description>&lt;h1&gt;
  
  
  One year of self-hosted n8n on a $6 Hetzner VPS
&lt;/h1&gt;

&lt;p&gt;Twelve months ago I moved my workflow automation off Zapier and onto a single Hetzner CX22 — €4.51/mo, 2 vCPU, 4 GB RAM, 40 GB disk. One Docker host, one n8n container, one Postgres, one Caddy reverse proxy. It's run four production workflows continuously since then, with one outage I'll get to below.&lt;/p&gt;

&lt;p&gt;This post is not a "n8n vs Zapier" pitch. It's a year of operating notes — what stayed cheap, what broke, what I'd do differently.&lt;/p&gt;

&lt;h2&gt;
  
  
  The actual setup
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Hetzner Cloud CX22 (Falkenstein)
├── Docker
│   ├── n8n (latest stable)
│   ├── postgres:15
│   └── caddy (with automatic TLS)
├── UFW (22, 80, 443 only)
└── borgbackup → Hetzner Storage Box (€3.81/mo)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Caddy bit matters more than people think. n8n's built-in HTTP is fine for localhost, but webhook receivers need real TLS, and Caddy gives you ACME, HTTP→HTTPS redirect, and per-domain certificates with zero config. Caddyfile is six lines. You don't have to think about it again.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's running
&lt;/h2&gt;

&lt;p&gt;Four workflows. None of them invented; all real:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Telegram bot dispatcher.&lt;/strong&gt; Inbound webhook → routing logic → either a Postgres write or a downstream service call. About 40 events/day average, occasional 200-event spikes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RSS aggregator → Telegram channel.&lt;/strong&gt; Polls 12 feeds every 15 min, dedupes by URL hash in Postgres, posts new items to a private channel. ~30 posts/day.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Form submission → CRM-lite.&lt;/strong&gt; A few WordPress sites hit a webhook on form submit; n8n writes to Postgres, sends an email confirmation, and logs to a Discord channel for me.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Daily reporting cron.&lt;/strong&gt; Pulls metrics from three internal APIs at 06:00, builds a markdown digest, emails it, also posts it to Slack.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;None of these need millisecond latency. All of them benefit from being one config-pull away from changing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The cost breakdown (12 months)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Item&lt;/th&gt;
&lt;th&gt;Monthly&lt;/th&gt;
&lt;th&gt;Annual&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Hetzner CX22&lt;/td&gt;
&lt;td&gt;€4.51&lt;/td&gt;
&lt;td&gt;€54.12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Storage Box (backup)&lt;/td&gt;
&lt;td&gt;€3.81&lt;/td&gt;
&lt;td&gt;€45.72&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Domain (.dev)&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;€12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~€9.20&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~€112&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Equivalent Zapier seat for the same task volume would have been ~$30-50/month depending on the plan, so we're looking at roughly €350-500 saved over the year. Not life-changing. The real win is something else, which I'll get to.&lt;/p&gt;

&lt;h2&gt;
  
  
  What broke (the one outage)
&lt;/h2&gt;

&lt;p&gt;Month four. n8n upgraded from v1.x to a major release. I'd been running &lt;code&gt;docker compose pull&lt;/code&gt; weekly without pinning, because "it's been fine." The upgrade introduced a breaking change to how credentials were stored. Container started; UI loaded; every workflow showed "credentials missing" and refused to execute.&lt;/p&gt;

&lt;p&gt;Root cause: I had no version-pin and no upgrade test. The backup was fine (borg snapshots intact), but the restore-and-investigate took me a Saturday afternoon.&lt;/p&gt;

&lt;p&gt;What I changed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pinned n8n image to a specific minor version (&lt;code&gt;n8nio/n8n:1.45.x&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Added a "staging" branch on a second Hetzner VPS (€3/mo CX21) that gets the upgrade first.&lt;/li&gt;
&lt;li&gt;Subscribed to the n8n releases RSS feed so I see breaking changes before I pull.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In hindsight: a SaaS would have done the upgrade for me and either Bigger Things would have broken (multi-tenant blast radius) or none of this would have ever happened. Pick your trade.&lt;/p&gt;

&lt;h2&gt;
  
  
  The actual win (it's not the money)
&lt;/h2&gt;

&lt;p&gt;The €350/year doesn't matter. What matters is that &lt;strong&gt;workflows live in a git-tracked YAML I own, on infrastructure I own&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;When a workflow changes, I commit the n8n export. When something breaks, I can diff yesterday's export against today's and see what shifted. When the credentials database gets weird, I open psql and look at the rows. When the webhook target changes, I write the new URL in a Caddyfile and reload — no support ticket, no rate limit on changes, no "this requires an upgrade to the Team plan."&lt;/p&gt;

&lt;p&gt;On Zapier, the same change graph is a black box. Some changes are free, some require the next plan tier, and you don't always know which until the click. With n8n on a box you control, the question "can I do this?" reduces to "is it physically possible?" — and the answer is almost always yes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Things I'd do differently if starting today
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Pin the image from day one.&lt;/strong&gt; Whatever the cost in "missing the new shiny feature for a week" is dwarfed by the cost of an unscheduled Saturday.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use external Postgres, not the docker-compose one.&lt;/strong&gt; Hetzner offers managed Postgres now. €11/mo, automatic backups, no "my container restarted and ate the WAL" risk. I'd take the €11 hit gladly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Don't put auth on the webhook receivers via n8n itself.&lt;/strong&gt; Put it at Caddy or a separate gateway. n8n's auth model exists, but you can't reuse it for non-n8n endpoints, and you'll regret the coupling.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Write the runbook first, not after the first outage.&lt;/strong&gt; "How do I restore from borg," "how do I roll the credentials key," "where are the env files" — five minutes to write, an hour to rediscover when stressed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Don't put more than 10 workflows on one box.&lt;/strong&gt; Memory usage scales with concurrent execution, and a runaway loop in one workflow will starve the others. If you go past 10, split into two n8n instances, not one.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  When NOT to self-host
&lt;/h2&gt;

&lt;p&gt;This setup works because the four workflows are mine, the data is mine, and downtime measured in hours (not minutes) is acceptable. If any of those three change, the calculus changes.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If a client depends on the webhook receiver having 99.95% uptime, this single-box setup is wrong. Use n8n Cloud or a multi-node deployment.&lt;/li&gt;
&lt;li&gt;If the workflows touch regulated data (HIPAA, PCI, GDPR's stricter applications), don't reach for the cheapest box. Use a vendor who'll sign a DPA and an audit-ready hosting tier.&lt;/li&gt;
&lt;li&gt;If you're a team of more than three and people need fine-grained access, n8n self-host's RBAC is workable but not great. The Cloud tier handles teams better.&lt;/li&gt;
&lt;li&gt;If your time is worth more than €30/month, and the workflows are simple enough that Zapier or Make.com handles them without ceremony, the savings aren't worth the operating load. Pay for the SaaS.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The five-line take
&lt;/h2&gt;

&lt;p&gt;Self-hosted n8n on a cheap VPS is one of those rare cases where the "boring" answer is also the cheap one and also the powerful one. Run it for a year before you decide it's not for you. Pin your versions. Write the runbook. Don't put it on the same box as anything else important.&lt;/p&gt;

&lt;p&gt;— Boris (&lt;a href="https://twitter.com/lamastoma" rel="noopener noreferrer"&gt;@lamastoma&lt;/a&gt;)&lt;/p&gt;




&lt;h2&gt;
  
  
  Publishing checklist
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;☐ Set &lt;code&gt;published: true&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;☐ Add cover image (1000×420 — Hetzner ANGE + n8n logo composite? or just terminal screenshot)&lt;/li&gt;
&lt;li&gt;☐ Tags: &lt;code&gt;n8n&lt;/code&gt;, &lt;code&gt;selfhosted&lt;/code&gt;, &lt;code&gt;automation&lt;/code&gt;, &lt;code&gt;devops&lt;/code&gt; — Dev.to limits to 4&lt;/li&gt;
&lt;li&gt;☐ Canonical URL: leave blank (Dev.to is canonical)&lt;/li&gt;
&lt;li&gt;☐ Once published, share Fiverr profile URL in bio (not in body of article)&lt;/li&gt;
&lt;li&gt;☐ Comment-engagement plan: monitor for first 24h, reply to every comment, no defensive corrections&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  See also
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Article #1 (race condition Python Telegram bot) — already published 2026-05-20&lt;/li&gt;
&lt;li&gt;[[devto-article-01]] memory — engagement tracking&lt;/li&gt;
&lt;li&gt;[[twitter-rules]] — no Dev.to URL in Twitter body for first 30 days&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>n8n</category>
      <category>selfhosted</category>
      <category>automation</category>
      <category>devops</category>
    </item>
    <item>
      <title>A Production Python Telegram Bot Was Crashing Every 2 Hours. The Fix Was 18 Lines.</title>
      <dc:creator>Boris Kl</dc:creator>
      <pubDate>Wed, 20 May 2026 13:28:23 +0000</pubDate>
      <link>https://dev.to/lamas51/a-production-python-telegram-bot-was-crashing-every-2-hours-the-fix-was-18-lines-29di</link>
      <guid>https://dev.to/lamas51/a-production-python-telegram-bot-was-crashing-every-2-hours-the-fix-was-18-lines-29di</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"If you see cascading errors, find the first thing that fails and stop reading the log there. Everything after the first failure is the system reacting to the first failure."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A production Python Telegram bot I was looking after started crashing every 2-3 hours. The traceback was a horror show — &lt;code&gt;TelegramRetryAfter&lt;/code&gt;, then &lt;code&gt;asyncio.TimeoutError&lt;/code&gt;, then &lt;code&gt;sqlite3.OperationalError: database is locked&lt;/code&gt;, then 47 leaked sessions, then the process got OOM-killed, then systemd restarted it. Then it happened again, 140 minutes later, like clockwork.&lt;/p&gt;

&lt;p&gt;The temptation when you see this kind of cascade is to throw the whole architecture out. &lt;em&gt;"SQLite can't handle our scale, let's move to Postgres."&lt;/em&gt; &lt;em&gt;"Bare asyncio is too low-level, let's add a queue."&lt;/em&gt; &lt;em&gt;"Let's rewrite it in Go."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I didn't do any of those things. The fix was 18 lines of code in one middleware file. The bot has been up for weeks since.&lt;/p&gt;

&lt;p&gt;Here's the diagnosis, the fix, and the takeaway. The code is real (anonymized of any client specifics) and the numbers are real.&lt;/p&gt;

&lt;h2&gt;
  
  
  The symptoms
&lt;/h2&gt;

&lt;p&gt;Stack: &lt;code&gt;Python 3.12&lt;/code&gt;, &lt;code&gt;aiogram 3.x&lt;/code&gt;, &lt;code&gt;SQLite&lt;/code&gt; for user state, &lt;code&gt;asyncio&lt;/code&gt; everywhere. Volume: about 4,000 daily incoming messages. Not high-throughput.&lt;/p&gt;

&lt;p&gt;The log every 140 minutes looked like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;[14:22:01] ERROR  aiogram.TelegramRetryAfter: flood control, retry in 28s
[14:22:03] ERROR  asyncio.TimeoutError in update handler
[14:22:05] WARNING bot.session not closed (47 active)
[14:22:08] ERROR  sqlite3.OperationalError: database is locked
[14:22:14] ERROR  ...same pattern, multiplying...
[14:22:20] ERROR  process killed by OOM
[14:22:21] INFO   systemd: restarted
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Process up ~140 minutes. Then the cascade. Then restart. Repeat.&lt;/p&gt;

&lt;h2&gt;
  
  
  What looked plausible (and was wrong)
&lt;/h2&gt;

&lt;p&gt;When I started looking, the first hypothesis was &lt;em&gt;"SQLite is the bottleneck — it can't handle the concurrency."&lt;/em&gt; That's the most obvious thing to say when you see &lt;code&gt;database is locked&lt;/code&gt; in a log.&lt;/p&gt;

&lt;p&gt;It was wrong. Here's why I dropped it after 30 minutes:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4,000 messages a day is nothing for SQLite.&lt;/strong&gt; SQLite handles tens of thousands of writes per second on modest hardware. If we were hitting a SQLite ceiling, we'd be hitting it under steady load, not in sudden bursts. The 140-minute interval was the giveaway — something was &lt;em&gt;accumulating&lt;/em&gt;, not saturating.&lt;/p&gt;

&lt;p&gt;The second hypothesis was &lt;em&gt;"We're hitting Telegram API rate limits."&lt;/em&gt; That's what &lt;code&gt;TelegramRetryAfter&lt;/code&gt; literally says. But again, 4,000 messages a day = roughly 1 message every 20 seconds on average. Telegram's per-bot rate limit is 30 messages per second. We weren't even in the same order of magnitude.&lt;/p&gt;

&lt;p&gt;So whatever was happening was &lt;em&gt;bursty&lt;/em&gt;, not steady-state. And the bot was somehow turning a steady stream of inbound updates into a burst of outbound API calls.&lt;/p&gt;

&lt;h2&gt;
  
  
  The actual root cause
&lt;/h2&gt;

&lt;p&gt;Here's what was happening, step by step:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A user sends a message. &lt;code&gt;aiogram&lt;/code&gt; receives it as an update.&lt;/li&gt;
&lt;li&gt;The handler runs, does some work, and sends a reply to Telegram.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Normally:&lt;/strong&gt; that reply goes out, the handler returns, the asyncio task ends, the &lt;code&gt;bot.session&lt;/code&gt; HTTP connection is released.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What actually happened:&lt;/strong&gt; &lt;em&gt;no throttle middleware existed.&lt;/em&gt; If 5-10 users happened to message in the same second (which happens during peak hours), the bot fired 5-10 outbound &lt;code&gt;sendMessage&lt;/code&gt; API calls &lt;em&gt;concurrently&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;Five or ten outbound requests inside one second pushed us past Telegram's per-second rate limit. Telegram answered with &lt;code&gt;429 Too Many Requests&lt;/code&gt; and a &lt;code&gt;retry_after&lt;/code&gt; header.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;aiogram&lt;/code&gt; raised &lt;code&gt;TelegramRetryAfter&lt;/code&gt;. But the handler that raised it was &lt;em&gt;waiting&lt;/em&gt; on the API response — it couldn't release its HTTP session until the retry window closed (28 seconds in the log above).&lt;/li&gt;
&lt;li&gt;While that handler was waiting, the next inbound update hit the same handler code. Another async task spawned. Another &lt;code&gt;bot.session&lt;/code&gt; connection opened. Another wait.&lt;/li&gt;
&lt;li&gt;Now we have two stuck tasks, each holding a connection, each blocked on &lt;code&gt;retry_after&lt;/code&gt;. Both tasks also need to update the user's row in SQLite. SQLite locks the row for the first writer. The second writer waits. Deadlock potential.&lt;/li&gt;
&lt;li&gt;Multiply this by 10 minutes of bursty traffic. Now you have 47 leaked sessions, an SQLite deadlock, and a Python process eating memory because tasks aren't completing.&lt;/li&gt;
&lt;li&gt;OOM killer hits. Systemd restarts. Cycle resets.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The cascade had &lt;strong&gt;one&lt;/strong&gt; cause: no rate limit on the bot's &lt;em&gt;inbound&lt;/em&gt; side. Everything downstream was just the system reacting to the upstream pressure.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix — 18 lines
&lt;/h2&gt;

&lt;p&gt;A throttle middleware. Drop incoming updates from a user if they already had a message in the last second. That's it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# middleware.py
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;aiogram&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BaseMiddleware&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;aiogram.types&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Update&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;cachetools&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TTLCache&lt;/span&gt;


&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ThrottleMiddleware&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseMiddleware&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Drop second-message-within-N-seconds per user.

    Without this, bursty inbound traffic translates 1:1 into bursty
    outbound API calls and trips Telegram&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s flood control.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rate_limit&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TTLCache&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;maxsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ttl&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;rate_limit&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__call__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Update&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;from_user&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt;  &lt;span class="c1"&gt;# silently drop — user is over their rate limit
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And wire it up plus a clean shutdown:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# main.py
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;aiogram&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Bot&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Dispatcher&lt;/span&gt;

&lt;span class="n"&gt;bot&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Bot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;BOT_TOKEN&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;dp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Dispatcher&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;dp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;update&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;middleware&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;ThrottleMiddleware&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rate_limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;


&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;on_shutdown&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Close the bot session explicitly. Otherwise sessions leak
    on graceful shutdown and the next start hits a connection pool
    in a weird state.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;bot&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;


&lt;span class="n"&gt;dp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shutdown&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;register&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;on_shutdown&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's 18 lines of production code plus one test:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# test_middleware.py
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pytest&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;middleware&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ThrottleMiddleware&lt;/span&gt;


&lt;span class="nd"&gt;@pytest.mark.asyncio&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_throttle_drops_rapid_second_message&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mocker&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;middleware&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ThrottleMiddleware&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rate_limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;handler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mocker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;AsyncMock&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;return_value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;processed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;make_event&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;123&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# helper to build a fake aiogram Update
&lt;/span&gt;
    &lt;span class="c1"&gt;# First message — goes through
&lt;/span&gt;    &lt;span class="n"&gt;result1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;middleware&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{})&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;result1&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;processed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="c1"&gt;# Second message same user, same second — dropped
&lt;/span&gt;    &lt;span class="n"&gt;result2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;middleware&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{})&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;result2&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

    &lt;span class="n"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;assert_called_once&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's the whole patch.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this works
&lt;/h2&gt;

&lt;p&gt;The fix doesn't make SQLite faster. It doesn't add a queue. It doesn't change anything about how the handlers process messages. It just stops the &lt;em&gt;upstream pressure&lt;/em&gt; before it cascades downstream.&lt;/p&gt;

&lt;p&gt;Once incoming updates are rate-limited per-user at 1 per second, the bot never has 10 concurrent outbound API calls. It has at most 1-2. Telegram never gets angry. &lt;code&gt;TelegramRetryAfter&lt;/code&gt; never fires. Handlers never get stuck waiting. Sessions never leak. SQLite never sees concurrent writes for the same row.&lt;/p&gt;

&lt;p&gt;The cascade isn't a chain. It's a tree, and the throttle cuts the tree at the root.&lt;/p&gt;

&lt;h2&gt;
  
  
  The result
&lt;/h2&gt;

&lt;p&gt;Numbers (real, from production):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;First 4 hours after deploy:&lt;/strong&gt; zero &lt;code&gt;TelegramRetryAfter&lt;/code&gt;. Zero &lt;code&gt;TimeoutError&lt;/code&gt;. Session count stable at 1-2 (vs. climbing past 40 every two hours before).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;First 24 hours:&lt;/strong&gt; zero errors of any kind in the log.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;First 7 days:&lt;/strong&gt; zero crashes. Zero systemd restarts.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Bot has been up continuously since deploy. Same SQLite. Same asyncio. Same handlers. The only thing that changed is the throttle middleware.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd tell a junior on the team
&lt;/h2&gt;

&lt;p&gt;A few generic takeaways that apply far beyond this specific bug:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Find the first failure in the log and stop reading.&lt;/strong&gt; When you see cascading errors, everything after the first failure is the system reacting to the first failure. Don't try to "fix" the downstream errors. Find the upstream cause.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Upstream backpressure is the cause about 80% of the time when you see async-Python cascades.&lt;/strong&gt; When the downstream component (SQLite, HTTP client, worker pool) looks stuck, it's almost always waiting for something the upstream is doing too fast. Rate-limit the upstream first.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. The temptation to rewrite is almost always wrong early in diagnosis.&lt;/strong&gt; "Rewrite in Go" / "switch to Postgres" / "add a queue" are valid responses to &lt;em&gt;real&lt;/em&gt; scale problems. They're not valid responses to "I haven't figured out the bug yet." Spend an hour with the actual logs first.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Volume matters less than burstiness.&lt;/strong&gt; A system handling 4k messages/day average can absolutely fall over from 10 messages in one second. The metric you care about is &lt;em&gt;peak concurrency&lt;/em&gt;, not &lt;em&gt;total throughput&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Test the throttle as a unit, not as an integration.&lt;/strong&gt; The fix above has one test (12 lines). It doesn't try to spin up a real bot. It just verifies the middleware behavior in isolation. That's enough — the actual production behavior is downstream of this contract holding.&lt;/p&gt;

&lt;h2&gt;
  
  
  Code
&lt;/h2&gt;

&lt;p&gt;The middleware and the test are public:&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://github.com/lamas51/claude-code-templates" rel="noopener noreferrer"&gt;github.com/lamas51/claude-code-templates&lt;/a&gt; (case studies folder)&lt;/p&gt;

&lt;p&gt;Same project also has Claude Code agent/skill/hook templates I deploy across Go, Python, and WordPress projects — feel free to fork.&lt;/p&gt;

&lt;h2&gt;
  
  
  About me
&lt;/h2&gt;

&lt;p&gt;I'm Boris — IT-pro since 1999. I run production code across Go, Python, and React, mostly for small and mid-size businesses. Last 18 months I've been heavy on Claude Code workflow.&lt;/p&gt;

&lt;p&gt;If you have a production Python service throwing similar cascades and want help diagnosing it, I take this kind of work through Fiverr (clean scope, escrow, no off-platform contact):&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://www.fiverr.com/lamastoma" rel="noopener noreferrer"&gt;fiverr.com/lamastoma&lt;/a&gt; — Python / n8n / Telegram bot bug fixing in 24 hours&lt;/p&gt;

&lt;p&gt;Open to questions in the comments — happy to dig into specifics if you're seeing something similar.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Anonymized — no client data, the diagnosis flow and final patch are the actual ones I shipped.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>aiogram</category>
      <category>asyncio</category>
      <category>debugging</category>
    </item>
  </channel>
</rss>
