Boris Kl

Posted on May 20

A Production Python Telegram Bot Was Crashing Every 2 Hours. The Fix Was 18 Lines.

#python #aiogram #asyncio #debugging

"If you see cascading errors, find the first thing that fails and stop reading the log there. Everything after the first failure is the system reacting to the first failure."

A production Python Telegram bot I was looking after started crashing every 2-3 hours. The traceback was a horror show — TelegramRetryAfter, then asyncio.TimeoutError, then sqlite3.OperationalError: database is locked, then 47 leaked sessions, then the process got OOM-killed, then systemd restarted it. Then it happened again, 140 minutes later, like clockwork.

The temptation when you see this kind of cascade is to throw the whole architecture out. "SQLite can't handle our scale, let's move to Postgres." "Bare asyncio is too low-level, let's add a queue." "Let's rewrite it in Go."

I didn't do any of those things. The fix was 18 lines of code in one middleware file. The bot has been up for weeks since.

Here's the diagnosis, the fix, and the takeaway. The code is real (anonymized of any client specifics) and the numbers are real.

The symptoms

Stack: Python 3.12, aiogram 3.x, SQLite for user state, asyncio everywhere. Volume: about 4,000 daily incoming messages. Not high-throughput.

The log every 140 minutes looked like this:

[14:22:01] ERROR  aiogram.TelegramRetryAfter: flood control, retry in 28s
[14:22:03] ERROR  asyncio.TimeoutError in update handler
[14:22:05] WARNING bot.session not closed (47 active)
[14:22:08] ERROR  sqlite3.OperationalError: database is locked
[14:22:14] ERROR  ...same pattern, multiplying...
[14:22:20] ERROR  process killed by OOM
[14:22:21] INFO   systemd: restarted

Process up ~140 minutes. Then the cascade. Then restart. Repeat.

What looked plausible (and was wrong)

When I started looking, the first hypothesis was "SQLite is the bottleneck — it can't handle the concurrency." That's the most obvious thing to say when you see database is locked in a log.

It was wrong. Here's why I dropped it after 30 minutes:

4,000 messages a day is nothing for SQLite. SQLite handles tens of thousands of writes per second on modest hardware. If we were hitting a SQLite ceiling, we'd be hitting it under steady load, not in sudden bursts. The 140-minute interval was the giveaway — something was accumulating, not saturating.

The second hypothesis was "We're hitting Telegram API rate limits." That's what TelegramRetryAfter literally says. But again, 4,000 messages a day = roughly 1 message every 20 seconds on average. Telegram's per-bot rate limit is 30 messages per second. We weren't even in the same order of magnitude.

So whatever was happening was bursty, not steady-state. And the bot was somehow turning a steady stream of inbound updates into a burst of outbound API calls.

The actual root cause

Here's what was happening, step by step:

A user sends a message. aiogram receives it as an update.
The handler runs, does some work, and sends a reply to Telegram.
Normally: that reply goes out, the handler returns, the asyncio task ends, the bot.session HTTP connection is released.
What actually happened: no throttle middleware existed. If 5-10 users happened to message in the same second (which happens during peak hours), the bot fired 5-10 outbound sendMessage API calls concurrently.
Five or ten outbound requests inside one second pushed us past Telegram's per-second rate limit. Telegram answered with 429 Too Many Requests and a retry_after header.
aiogram raised TelegramRetryAfter. But the handler that raised it was waiting on the API response — it couldn't release its HTTP session until the retry window closed (28 seconds in the log above).
While that handler was waiting, the next inbound update hit the same handler code. Another async task spawned. Another bot.session connection opened. Another wait.
Now we have two stuck tasks, each holding a connection, each blocked on retry_after. Both tasks also need to update the user's row in SQLite. SQLite locks the row for the first writer. The second writer waits. Deadlock potential.
Multiply this by 10 minutes of bursty traffic. Now you have 47 leaked sessions, an SQLite deadlock, and a Python process eating memory because tasks aren't completing.
OOM killer hits. Systemd restarts. Cycle resets.

The cascade had one cause: no rate limit on the bot's inbound side. Everything downstream was just the system reacting to the upstream pressure.

The fix — 18 lines

A throttle middleware. Drop incoming updates from a user if they already had a message in the last second. That's it.

# middleware.py
from aiogram import BaseMiddleware
from aiogram.types import Update
from cachetools import TTLCache


class ThrottleMiddleware(BaseMiddleware):
    """Drop second-message-within-N-seconds per user.

    Without this, bursty inbound traffic translates 1:1 into bursty
    outbound API calls and trips Telegram's flood control.
    """

    def __init__(self, rate_limit: float = 1.0):
        self.cache = TTLCache(maxsize=10_000, ttl=rate_limit)

    async def __call__(self, handler, event: Update, data):
        user_id = event.message.from_user.id if event.message else None
        if user_id and user_id in self.cache:
            return  # silently drop — user is over their rate limit
        if user_id:
            self.cache[user_id] = True
        return await handler(event, data)

And wire it up plus a clean shutdown:

# main.py
from aiogram import Bot, Dispatcher

bot = Bot(token=BOT_TOKEN)
dp = Dispatcher()

dp.update.middleware(ThrottleMiddleware(rate_limit=1.0))


async def on_shutdown():
    """Close the bot session explicitly. Otherwise sessions leak
    on graceful shutdown and the next start hits a connection pool
    in a weird state.
    """
    await bot.session.close()


dp.shutdown.register(on_shutdown)

That's 18 lines of production code plus one test:

# test_middleware.py
import pytest
from middleware import ThrottleMiddleware


@pytest.mark.asyncio
async def test_throttle_drops_rapid_second_message(mocker):
    middleware = ThrottleMiddleware(rate_limit=1.0)
    handler = mocker.AsyncMock(return_value="processed")

    event = make_event(user_id=123)  # helper to build a fake aiogram Update

    # First message — goes through
    result1 = await middleware(handler, event, {})
    assert result1 == "processed"

    # Second message same user, same second — dropped
    result2 = await middleware(handler, event, {})
    assert result2 is None

    handler.assert_called_once()

That's the whole patch.

Why this works

The fix doesn't make SQLite faster. It doesn't add a queue. It doesn't change anything about how the handlers process messages. It just stops the upstream pressure before it cascades downstream.

Once incoming updates are rate-limited per-user at 1 per second, the bot never has 10 concurrent outbound API calls. It has at most 1-2. Telegram never gets angry. TelegramRetryAfter never fires. Handlers never get stuck waiting. Sessions never leak. SQLite never sees concurrent writes for the same row.

The cascade isn't a chain. It's a tree, and the throttle cuts the tree at the root.

The result

Numbers (real, from production):

First 4 hours after deploy: zero TelegramRetryAfter. Zero TimeoutError. Session count stable at 1-2 (vs. climbing past 40 every two hours before).
First 24 hours: zero errors of any kind in the log.
First 7 days: zero crashes. Zero systemd restarts.

Bot has been up continuously since deploy. Same SQLite. Same asyncio. Same handlers. The only thing that changed is the throttle middleware.

What I'd tell a junior on the team

A few generic takeaways that apply far beyond this specific bug:

1. Find the first failure in the log and stop reading. When you see cascading errors, everything after the first failure is the system reacting to the first failure. Don't try to "fix" the downstream errors. Find the upstream cause.

2. Upstream backpressure is the cause about 80% of the time when you see async-Python cascades. When the downstream component (SQLite, HTTP client, worker pool) looks stuck, it's almost always waiting for something the upstream is doing too fast. Rate-limit the upstream first.

3. The temptation to rewrite is almost always wrong early in diagnosis. "Rewrite in Go" / "switch to Postgres" / "add a queue" are valid responses to real scale problems. They're not valid responses to "I haven't figured out the bug yet." Spend an hour with the actual logs first.

4. Volume matters less than burstiness. A system handling 4k messages/day average can absolutely fall over from 10 messages in one second. The metric you care about is peak concurrency, not total throughput.

5. Test the throttle as a unit, not as an integration. The fix above has one test (12 lines). It doesn't try to spin up a real bot. It just verifies the middleware behavior in isolation. That's enough — the actual production behavior is downstream of this contract holding.

Code

The middleware and the test are public:

→ github.com/lamas51/claude-code-templates (case studies folder)

Same project also has Claude Code agent/skill/hook templates I deploy across Go, Python, and WordPress projects — feel free to fork.

About me

I'm Boris — IT-pro since 1999. I run production code across Go, Python, and React, mostly for small and mid-size businesses. Last 18 months I've been heavy on Claude Code workflow.

If you have a production Python service throwing similar cascades and want help diagnosing it, I take this kind of work through Fiverr (clean scope, escrow, no off-platform contact):

→ fiverr.com/lamastoma — Python / n8n / Telegram bot bug fixing in 24 hours

Open to questions in the comments — happy to dig into specifics if you're seeing something similar.

Anonymized — no client data, the diagnosis flow and final patch are the actual ones I shipped.

DEV Community