<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: anna-maree morris</title>
    <description>The latest articles on DEV Community by anna-maree morris (@annamareemorris).</description>
    <link>https://dev.to/annamareemorris</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F4007188%2Fbd323bb5-a87f-4b5b-9b62-7a9493249de4.jpg</url>
      <title>DEV Community: anna-maree morris</title>
      <link>https://dev.to/annamareemorris</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/annamareemorris"/>
    <language>en</language>
    <item>
      <title>Your Python rate limiter is lying to you the moment you add a second server</title>
      <dc:creator>anna-maree morris</dc:creator>
      <pubDate>Mon, 29 Jun 2026 03:06:03 +0000</pubDate>
      <link>https://dev.to/annamareemorris/your-python-rate-limiter-is-lying-to-you-the-moment-you-add-a-second-server-2df5</link>
      <guid>https://dev.to/annamareemorris/your-python-rate-limiter-is-lying-to-you-the-moment-you-add-a-second-server-2df5</guid>
      <description>&lt;p&gt;Most rate-limiter tutorials show you a tidy little token bucket that works perfectly — on one machine. Then you deploy to production, where you're running three copies of your app behind a load balancer, and the limiter quietly stops doing its job. Nobody gets an error. Nothing crashes. Your "100 requests per minute" just silently becomes 300, and you don't find out until something downstream falls over.&lt;/p&gt;

&lt;p&gt;This post is about &lt;em&gt;why&lt;/em&gt; that happens, a small demo you can run to see it, and the one change that fixes it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The limiter that works on your laptop
&lt;/h2&gt;

&lt;p&gt;Here's a textbook in-memory token bucket. The maths is correct: tokens refill at a fixed rate, a request spends one, and you reject when the bucket is empty.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;TokenBucketLimiter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;capacity&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;refill_rate&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;capacity&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;capacity&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;refill_rate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;refill_rate&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;capacity&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;monotonic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;allow_request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;now&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;monotonic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;capacity&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;now&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;refill_rate&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;-=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On a single process, this is fine. The problem is the word &lt;em&gt;single&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Problem one: every server counts in private
&lt;/h2&gt;

&lt;p&gt;The state — &lt;code&gt;self.tokens&lt;/code&gt; — lives in the memory of one process. Run two copies of your app and each has its own bucket. The limit you &lt;em&gt;think&lt;/em&gt; you set gets multiplied by however many instances, workers, or containers you're running:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;intended limit:  100/min
3 servers:        300/min   (each counts on its own)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This isn't a bug in the code. It's the code doing exactly what in-memory state does: not sharing. To enforce one limit across many servers, the count has to live somewhere all of them can see — like Redis.&lt;/p&gt;

&lt;h2&gt;
  
  
  Problem two: even with shared state, the naive fix still leaks
&lt;/h2&gt;

&lt;p&gt;So you reach for Redis and write the obvious thing: read the token count, do the maths in Python, write it back.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;capacity&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# read
# ... refill + check in Python ...
&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                 &lt;span class="c1"&gt;# write
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This looks shared, and it is — but it's still wrong, because the read and the write are two separate trips to Redis with your Python logic in between. Under concurrency, two requests can both read the same balance &lt;em&gt;before&lt;/em&gt; either writes back, and both decide they're allowed. That's a classic read-modify-write race, and it gets worse the more traffic you have — exactly when you need the limiter most.&lt;/p&gt;

&lt;p&gt;How bad is it? Here's a tiny experiment: fire 50 concurrent requests at a bucket with a capacity of 10.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;capacity 10  ·  50 concurrent requests
naive read-modify-write  -&amp;gt;  granted 42   (over by 32)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Forty-two grants from a bucket that should allow ten. The limiter isn't limiting.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix: make the whole decision atomic
&lt;/h2&gt;

&lt;p&gt;The reason it leaks is that the decision is spread across multiple Redis calls. The fix is to make the &lt;em&gt;entire&lt;/em&gt; read-check-spend happen as one indivisible operation on the Redis server — using a Lua script, which Redis executes atomically. No other request can interleave between the read and the write, because to Redis it's a single command.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight lua"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- runs atomically on the Redis server&lt;/span&gt;
&lt;span class="kd"&gt;local&lt;/span&gt; &lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;tonumber&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'HGET'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;KEYS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="s1"&gt;'tokens'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="c1"&gt;-- (refill from elapsed time, clamp to capacity) ...&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;then&lt;/span&gt;
    &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'HSET'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;KEYS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="s1"&gt;'tokens'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;   &lt;span class="c1"&gt;-- allowed&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;       &lt;span class="c1"&gt;-- rejected&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same experiment, same burst, with the decision moved into one atomic script:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;capacity 10  ·  50 concurrent requests
atomic Lua script  -&amp;gt;  granted 10   (holds)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Exactly ten. The line holds, no matter how many requests arrive at once or how many servers they come from.&lt;/p&gt;

&lt;p&gt;Two details worth getting right while you're in there: read the current time from Redis (its &lt;code&gt;TIME&lt;/code&gt; command) rather than each app server's clock, so independent servers don't disagree about elapsed time; and decide explicitly what happens if Redis is unreachable — fail closed to protect the backend, or fail open to keep serving. That's a real decision, not a default to stumble into.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters more than it looks
&lt;/h2&gt;

&lt;p&gt;A rate limiter that over-grants under load is worse than no limiter, because it gives you false confidence. It passes every test you write on your laptop and then fails silently in the one environment it exists for: production, under concurrency, across servers. The only way to trust one is to test it the way it'll actually be hit — hundreds of simultaneous requests at a single bucket — and assert it never exceeds capacity.&lt;/p&gt;

&lt;p&gt;If you'd rather not build and test this yourself, I package exactly this as a single-file, fully-tested drop-in (the atomic Lua limiter plus the concurrency test suite that proves the guarantee). It's &lt;a href="https://annamaree.gumroad.com/l/hkjqga" rel="noopener noreferrer"&gt;here&lt;/a&gt;. But the technique above is the important part — whether you buy it, copy it, or roll your own, move the decision into one atomic operation and your limiter will tell the truth.&lt;/p&gt;

</description>
      <category>python</category>
      <category>redis</category>
      <category>webdev</category>
      <category>backend</category>
    </item>
  </channel>
</rss>
