<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Boris Kl</title>
    <description>The latest articles on DEV Community by Boris Kl (@lamas51).</description>
    <link>https://dev.to/lamas51</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3942385%2F8d8793b0-7612-4b5a-a70c-1d4a8b562b8a.png</url>
      <title>DEV Community: Boris Kl</title>
      <link>https://dev.to/lamas51</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/lamas51"/>
    <language>en</language>
    <item>
      <title>A Production Python Telegram Bot Was Crashing Every 2 Hours. The Fix Was 18 Lines.</title>
      <dc:creator>Boris Kl</dc:creator>
      <pubDate>Wed, 20 May 2026 13:28:23 +0000</pubDate>
      <link>https://dev.to/lamas51/a-production-python-telegram-bot-was-crashing-every-2-hours-the-fix-was-18-lines-29di</link>
      <guid>https://dev.to/lamas51/a-production-python-telegram-bot-was-crashing-every-2-hours-the-fix-was-18-lines-29di</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"If you see cascading errors, find the first thing that fails and stop reading the log there. Everything after the first failure is the system reacting to the first failure."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A production Python Telegram bot I was looking after started crashing every 2-3 hours. The traceback was a horror show — &lt;code&gt;TelegramRetryAfter&lt;/code&gt;, then &lt;code&gt;asyncio.TimeoutError&lt;/code&gt;, then &lt;code&gt;sqlite3.OperationalError: database is locked&lt;/code&gt;, then 47 leaked sessions, then the process got OOM-killed, then systemd restarted it. Then it happened again, 140 minutes later, like clockwork.&lt;/p&gt;

&lt;p&gt;The temptation when you see this kind of cascade is to throw the whole architecture out. &lt;em&gt;"SQLite can't handle our scale, let's move to Postgres."&lt;/em&gt; &lt;em&gt;"Bare asyncio is too low-level, let's add a queue."&lt;/em&gt; &lt;em&gt;"Let's rewrite it in Go."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I didn't do any of those things. The fix was 18 lines of code in one middleware file. The bot has been up for weeks since.&lt;/p&gt;

&lt;p&gt;Here's the diagnosis, the fix, and the takeaway. The code is real (anonymized of any client specifics) and the numbers are real.&lt;/p&gt;

&lt;h2&gt;
  
  
  The symptoms
&lt;/h2&gt;

&lt;p&gt;Stack: &lt;code&gt;Python 3.12&lt;/code&gt;, &lt;code&gt;aiogram 3.x&lt;/code&gt;, &lt;code&gt;SQLite&lt;/code&gt; for user state, &lt;code&gt;asyncio&lt;/code&gt; everywhere. Volume: about 4,000 daily incoming messages. Not high-throughput.&lt;/p&gt;

&lt;p&gt;The log every 140 minutes looked like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;[14:22:01] ERROR  aiogram.TelegramRetryAfter: flood control, retry in 28s
[14:22:03] ERROR  asyncio.TimeoutError in update handler
[14:22:05] WARNING bot.session not closed (47 active)
[14:22:08] ERROR  sqlite3.OperationalError: database is locked
[14:22:14] ERROR  ...same pattern, multiplying...
[14:22:20] ERROR  process killed by OOM
[14:22:21] INFO   systemd: restarted
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Process up ~140 minutes. Then the cascade. Then restart. Repeat.&lt;/p&gt;

&lt;h2&gt;
  
  
  What looked plausible (and was wrong)
&lt;/h2&gt;

&lt;p&gt;When I started looking, the first hypothesis was &lt;em&gt;"SQLite is the bottleneck — it can't handle the concurrency."&lt;/em&gt; That's the most obvious thing to say when you see &lt;code&gt;database is locked&lt;/code&gt; in a log.&lt;/p&gt;

&lt;p&gt;It was wrong. Here's why I dropped it after 30 minutes:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4,000 messages a day is nothing for SQLite.&lt;/strong&gt; SQLite handles tens of thousands of writes per second on modest hardware. If we were hitting a SQLite ceiling, we'd be hitting it under steady load, not in sudden bursts. The 140-minute interval was the giveaway — something was &lt;em&gt;accumulating&lt;/em&gt;, not saturating.&lt;/p&gt;

&lt;p&gt;The second hypothesis was &lt;em&gt;"We're hitting Telegram API rate limits."&lt;/em&gt; That's what &lt;code&gt;TelegramRetryAfter&lt;/code&gt; literally says. But again, 4,000 messages a day = roughly 1 message every 20 seconds on average. Telegram's per-bot rate limit is 30 messages per second. We weren't even in the same order of magnitude.&lt;/p&gt;

&lt;p&gt;So whatever was happening was &lt;em&gt;bursty&lt;/em&gt;, not steady-state. And the bot was somehow turning a steady stream of inbound updates into a burst of outbound API calls.&lt;/p&gt;

&lt;h2&gt;
  
  
  The actual root cause
&lt;/h2&gt;

&lt;p&gt;Here's what was happening, step by step:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A user sends a message. &lt;code&gt;aiogram&lt;/code&gt; receives it as an update.&lt;/li&gt;
&lt;li&gt;The handler runs, does some work, and sends a reply to Telegram.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Normally:&lt;/strong&gt; that reply goes out, the handler returns, the asyncio task ends, the &lt;code&gt;bot.session&lt;/code&gt; HTTP connection is released.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What actually happened:&lt;/strong&gt; &lt;em&gt;no throttle middleware existed.&lt;/em&gt; If 5-10 users happened to message in the same second (which happens during peak hours), the bot fired 5-10 outbound &lt;code&gt;sendMessage&lt;/code&gt; API calls &lt;em&gt;concurrently&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;Five or ten outbound requests inside one second pushed us past Telegram's per-second rate limit. Telegram answered with &lt;code&gt;429 Too Many Requests&lt;/code&gt; and a &lt;code&gt;retry_after&lt;/code&gt; header.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;aiogram&lt;/code&gt; raised &lt;code&gt;TelegramRetryAfter&lt;/code&gt;. But the handler that raised it was &lt;em&gt;waiting&lt;/em&gt; on the API response — it couldn't release its HTTP session until the retry window closed (28 seconds in the log above).&lt;/li&gt;
&lt;li&gt;While that handler was waiting, the next inbound update hit the same handler code. Another async task spawned. Another &lt;code&gt;bot.session&lt;/code&gt; connection opened. Another wait.&lt;/li&gt;
&lt;li&gt;Now we have two stuck tasks, each holding a connection, each blocked on &lt;code&gt;retry_after&lt;/code&gt;. Both tasks also need to update the user's row in SQLite. SQLite locks the row for the first writer. The second writer waits. Deadlock potential.&lt;/li&gt;
&lt;li&gt;Multiply this by 10 minutes of bursty traffic. Now you have 47 leaked sessions, an SQLite deadlock, and a Python process eating memory because tasks aren't completing.&lt;/li&gt;
&lt;li&gt;OOM killer hits. Systemd restarts. Cycle resets.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The cascade had &lt;strong&gt;one&lt;/strong&gt; cause: no rate limit on the bot's &lt;em&gt;inbound&lt;/em&gt; side. Everything downstream was just the system reacting to the upstream pressure.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix — 18 lines
&lt;/h2&gt;

&lt;p&gt;A throttle middleware. Drop incoming updates from a user if they already had a message in the last second. That's it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# middleware.py
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;aiogram&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BaseMiddleware&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;aiogram.types&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Update&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;cachetools&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TTLCache&lt;/span&gt;


&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ThrottleMiddleware&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseMiddleware&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Drop second-message-within-N-seconds per user.

    Without this, bursty inbound traffic translates 1:1 into bursty
    outbound API calls and trips Telegram&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s flood control.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rate_limit&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TTLCache&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;maxsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ttl&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;rate_limit&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__call__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Update&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;from_user&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt;  &lt;span class="c1"&gt;# silently drop — user is over their rate limit
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And wire it up plus a clean shutdown:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# main.py
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;aiogram&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Bot&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Dispatcher&lt;/span&gt;

&lt;span class="n"&gt;bot&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Bot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;BOT_TOKEN&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;dp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Dispatcher&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;dp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;update&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;middleware&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;ThrottleMiddleware&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rate_limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;


&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;on_shutdown&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Close the bot session explicitly. Otherwise sessions leak
    on graceful shutdown and the next start hits a connection pool
    in a weird state.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;bot&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;


&lt;span class="n"&gt;dp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shutdown&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;register&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;on_shutdown&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's 18 lines of production code plus one test:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# test_middleware.py
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pytest&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;middleware&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ThrottleMiddleware&lt;/span&gt;


&lt;span class="nd"&gt;@pytest.mark.asyncio&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_throttle_drops_rapid_second_message&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mocker&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;middleware&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ThrottleMiddleware&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rate_limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;handler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mocker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;AsyncMock&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;return_value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;processed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;make_event&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;123&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# helper to build a fake aiogram Update
&lt;/span&gt;
    &lt;span class="c1"&gt;# First message — goes through
&lt;/span&gt;    &lt;span class="n"&gt;result1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;middleware&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{})&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;result1&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;processed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="c1"&gt;# Second message same user, same second — dropped
&lt;/span&gt;    &lt;span class="n"&gt;result2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;middleware&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{})&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;result2&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

    &lt;span class="n"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;assert_called_once&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's the whole patch.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this works
&lt;/h2&gt;

&lt;p&gt;The fix doesn't make SQLite faster. It doesn't add a queue. It doesn't change anything about how the handlers process messages. It just stops the &lt;em&gt;upstream pressure&lt;/em&gt; before it cascades downstream.&lt;/p&gt;

&lt;p&gt;Once incoming updates are rate-limited per-user at 1 per second, the bot never has 10 concurrent outbound API calls. It has at most 1-2. Telegram never gets angry. &lt;code&gt;TelegramRetryAfter&lt;/code&gt; never fires. Handlers never get stuck waiting. Sessions never leak. SQLite never sees concurrent writes for the same row.&lt;/p&gt;

&lt;p&gt;The cascade isn't a chain. It's a tree, and the throttle cuts the tree at the root.&lt;/p&gt;

&lt;h2&gt;
  
  
  The result
&lt;/h2&gt;

&lt;p&gt;Numbers (real, from production):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;First 4 hours after deploy:&lt;/strong&gt; zero &lt;code&gt;TelegramRetryAfter&lt;/code&gt;. Zero &lt;code&gt;TimeoutError&lt;/code&gt;. Session count stable at 1-2 (vs. climbing past 40 every two hours before).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;First 24 hours:&lt;/strong&gt; zero errors of any kind in the log.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;First 7 days:&lt;/strong&gt; zero crashes. Zero systemd restarts.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Bot has been up continuously since deploy. Same SQLite. Same asyncio. Same handlers. The only thing that changed is the throttle middleware.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd tell a junior on the team
&lt;/h2&gt;

&lt;p&gt;A few generic takeaways that apply far beyond this specific bug:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Find the first failure in the log and stop reading.&lt;/strong&gt; When you see cascading errors, everything after the first failure is the system reacting to the first failure. Don't try to "fix" the downstream errors. Find the upstream cause.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Upstream backpressure is the cause about 80% of the time when you see async-Python cascades.&lt;/strong&gt; When the downstream component (SQLite, HTTP client, worker pool) looks stuck, it's almost always waiting for something the upstream is doing too fast. Rate-limit the upstream first.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. The temptation to rewrite is almost always wrong early in diagnosis.&lt;/strong&gt; "Rewrite in Go" / "switch to Postgres" / "add a queue" are valid responses to &lt;em&gt;real&lt;/em&gt; scale problems. They're not valid responses to "I haven't figured out the bug yet." Spend an hour with the actual logs first.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Volume matters less than burstiness.&lt;/strong&gt; A system handling 4k messages/day average can absolutely fall over from 10 messages in one second. The metric you care about is &lt;em&gt;peak concurrency&lt;/em&gt;, not &lt;em&gt;total throughput&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Test the throttle as a unit, not as an integration.&lt;/strong&gt; The fix above has one test (12 lines). It doesn't try to spin up a real bot. It just verifies the middleware behavior in isolation. That's enough — the actual production behavior is downstream of this contract holding.&lt;/p&gt;

&lt;h2&gt;
  
  
  Code
&lt;/h2&gt;

&lt;p&gt;The middleware and the test are public:&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://github.com/lamas51/claude-code-templates" rel="noopener noreferrer"&gt;github.com/lamas51/claude-code-templates&lt;/a&gt; (case studies folder)&lt;/p&gt;

&lt;p&gt;Same project also has Claude Code agent/skill/hook templates I deploy across Go, Python, and WordPress projects — feel free to fork.&lt;/p&gt;

&lt;h2&gt;
  
  
  About me
&lt;/h2&gt;

&lt;p&gt;I'm Boris — IT-pro since 1999. I run production code across Go, Python, and React, mostly for small and mid-size businesses. Last 18 months I've been heavy on Claude Code workflow.&lt;/p&gt;

&lt;p&gt;If you have a production Python service throwing similar cascades and want help diagnosing it, I take this kind of work through Fiverr (clean scope, escrow, no off-platform contact):&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://www.fiverr.com/lamastoma" rel="noopener noreferrer"&gt;fiverr.com/lamastoma&lt;/a&gt; — Python / n8n / Telegram bot bug fixing in 24 hours&lt;/p&gt;

&lt;p&gt;Open to questions in the comments — happy to dig into specifics if you're seeing something similar.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Anonymized — no client data, the diagnosis flow and final patch are the actual ones I shipped.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>aiogram</category>
      <category>asyncio</category>
      <category>debugging</category>
    </item>
  </channel>
</rss>
