<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Piko Monde</title>
    <description>The latest articles on DEV Community by Piko Monde (@pikomonde).</description>
    <link>https://dev.to/pikomonde</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F173108%2F8994e825-cbe9-4e07-aedf-1d52ea2593c9.png</url>
      <title>DEV Community: Piko Monde</title>
      <link>https://dev.to/pikomonde</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/pikomonde"/>
    <language>en</language>
    <item>
      <title>How We Saved a High-Traffic IoT Service from 200 RPS to 20,000+ RPS (and a $42k+ AWS Bill)</title>
      <dc:creator>Piko Monde</dc:creator>
      <pubDate>Sun, 01 Mar 2026 17:12:19 +0000</pubDate>
      <link>https://dev.to/pikomonde/how-we-saved-a-high-traffic-iot-service-from-200-rps-to-20000-rps-and-a-42k-aws-bill-14j1</link>
      <guid>https://dev.to/pikomonde/how-we-saved-a-high-traffic-iot-service-from-200-rps-to-20000-rps-and-a-42k-aws-bill-14j1</guid>
      <description>&lt;p&gt;In the world of high-concurrency systems, throwing more hardware at a problem is often the most expensive way to fail. Recently, I revisited some investigation logs and Go &lt;code&gt;pprof&lt;/code&gt; profiles from a project I handled four years ago as a contractor for an Automobile &lt;strong&gt;&lt;em&gt;IoT&lt;/em&gt;&lt;/strong&gt; company. At the time, the company was managing telemetry for tens of thousands of connected vehicles.&lt;/p&gt;

&lt;p&gt;The service was struggling with massive CPU utilization and scaling issues. Despite being backed by a significant cloud budget, the infrastructure was buckling under a load that, on paper, should have been manageable. This is a story of how we moved from a state of "throwing money at the fire" to a lean, high-performance architecture.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Infrastructure Bottleneck: Roughly 27 Nodes for 200 RPS
&lt;/h2&gt;

&lt;p&gt;My first task at this company was to optimize our gateway server. Since I was a contractor and didn't have direct access yet, the Engineering Lead and I sat down to review the dashboard together. When he showed me the infrastructure, I was floored.&lt;/p&gt;

&lt;p&gt;The service was running on a cluster of roughly &lt;strong&gt;27 large compute instances&lt;/strong&gt; on AWS. As a contractor I wasn’t shown the billing console directly, but back-of-the-envelope math told the story: at on-demand pricing for large compute-optimized instances (I’m using c6g.16xlarge for this calculation), this cluster was burning somewhere &lt;strong&gt;north of $42,000 a month&lt;/strong&gt; — for this one service layer alone. Worse, CPU usage across all nodes was pegged at 90%+.&lt;/p&gt;

&lt;p&gt;The Engineering Manager felt his hands were tied; his schedule was packed with meetings, and the rest of the team was primarily focused on DevOps. I was essentially the only one available and specifically hired to fix this mess. Looking at the code, the EM had a hunch: &lt;em&gt;"I think it's the RBAC logic, especially when unmarshalling the user entity from Redis cache."&lt;/em&gt; he told me.&lt;/p&gt;

&lt;h2&gt;
  
  
  Identifying the Bottleneck: A Lesson in Profiling
&lt;/h2&gt;

&lt;p&gt;To understand why a cluster that large couldn’t handle a measly 200 RPS, I started my investigation using API call samples provided by one of the DevOps engineers.&lt;/p&gt;

&lt;h3&gt;
  
  
  Setting up Observability
&lt;/h3&gt;

&lt;p&gt;I implemented the &lt;code&gt;net/http/pprof&lt;/code&gt; package to capture live data under a simulated load, by adding a simple blank import:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="s"&gt;"net/http/pprof"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The service started exposing debugging endpoints at &lt;code&gt;/debug/pprof/&lt;/code&gt;. This allowed me to capture live data while the service was under simulated load.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Local Benchmark Discrepancy
&lt;/h3&gt;

&lt;p&gt;I began the investigation using &lt;strong&gt;&lt;em&gt;go-wrk&lt;/em&gt;&lt;/strong&gt;. I ran a standard 5-second test with 10 concurrent connections:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;go-wrk &lt;span class="nt"&gt;-d&lt;/span&gt; 5 &lt;span class="nt"&gt;-c&lt;/span&gt; 10 http://localhost:8080/v1/telemetry
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To my surprise, my local environment was hitting nearly &lt;strong&gt;2,000 RPS&lt;/strong&gt;, ten times the production capacity. This discrepancy was confusing. I initially thought the bottleneck might be related to minor inefficiencies in the code that only compounded at scale.&lt;/p&gt;

&lt;h3&gt;
  
  
  The First Round of Optimizations
&lt;/h3&gt;

&lt;p&gt;At this point, I performed optimizations based on the initial &lt;code&gt;pprof&lt;/code&gt; data:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RBAC JSON Unmarshalling:&lt;/strong&gt; The service used a Redis-based Role-Based Access Control (RBAC) system. For every request, it fetched a massive User struct from Redis and unmarshalled the entire JSON object to check a single permission bit. I refactored this to store simple boolean flags or bitmasks in Redis. This reduced the CPU time spent on JSON decoding by nearly 30%.&lt;/p&gt;

&lt;p&gt;The profile at this stage looked something like this — JSON unmarshalling is visible, eating around 20–30% of CPU, but nothing that would explain a full cluster meltdown:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feooj4c503ueycnyes2eq.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feooj4c503ueycnyes2eq.webp" alt="Figure 1A: Initial pprof profile — RBAC JSON unmarshalling is the largest consumer, but the overall picture still looks manageable. (Reproduced on a test service for illustration; the original production profiles are proprietary.)" width="800" height="404"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I thought I had found the solution. But when we deployed the RBAC fix to production, nothing changed. The CPU stayed at 100%, and the throughput stayed at 200 RPS.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Breakthrough: Production vs. Local Payloads
&lt;/h2&gt;

&lt;p&gt;The "Eureka" moment happened during a 1-on-1 session with my Engineering Manager. We compared my local requests with actual production traffic and realized I was &lt;strong&gt;missing a specific header&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;When I added that header and ran &lt;code&gt;pprof&lt;/code&gt; again, the trace revealed something entirely different. The request wasn't hitting the standard user path; it was being routed to a specific IoT Middleware. In this middleware, there was a hidden process: it wasn't checking session or refresh tokens. Instead, it was &lt;strong&gt;validating every single incoming telemetry token using Bcrypt&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;As soon as I updated my local &lt;code&gt;go-wrk&lt;/code&gt; script to include the production-spec device headers, my local CPU instantly spiked to 100%, and my RPS plummeted to exactly what we saw in production: &lt;strong&gt;200 RPS&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This is what the new profile looked like. Same service, same endpoint — spot the difference:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3uvt3n90qnosc9da6b48.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3uvt3n90qnosc9da6b48.webp" alt="Figure 1B: Profile after adding the IoT device header. bcrypt.CompareHashAndPassword has consumed almost everything — there is almost nothing else left. (Reproduced on a test service for illustration.)" width="800" height="404"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bcrypt Implementation Issue: Self-Inflicted DDoS
&lt;/h2&gt;

&lt;p&gt;The resulting Flame Graph was unmistakable. A single function was consuming over 80% of the total CPU time: &lt;code&gt;bcrypt.CompareHashAndPassword&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;We discovered that for every single telemetry ping, sent by thousands of vehicles every few seconds, the system was executing a &lt;strong&gt;Bcrypt&lt;/strong&gt; comparison.&lt;/p&gt;

&lt;p&gt;From an architectural perspective, this is a catastrophic anti-pattern. &lt;strong&gt;Bcrypt&lt;/strong&gt; is designed by cryptographers to be &lt;strong&gt;slow&lt;/strong&gt;. It uses a computational "cost factor" to ensure that even with massive hardware, brute-forcing a password takes an eternity. It is meant for login endpoints, not telemetry endpoints.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;We were essentially DDoS-ing ourselves.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5uk9g002bi7bg3cgvfk7.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5uk9g002bi7bg3cgvfk7.jpg" alt="Production CPU at 100%. “Oh, we are DDoS-ing ourselves.” — This is Fine, KC Green, Gunshow comic (2013). Via Know Your Meme." width="716" height="349"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;While the intention was to have high security for every data point, the implementation was impractical. For high-frequency IoT data, the industry standard is to use Bcrypt only during the initial handshake to exchange credentials for a lightweight session token or a Short-lived JWT (JSON Web Token). Once authenticated, subsequent telemetry should be validated using symmetric keys or token lookups, which are orders of magnitude faster than Bcrypt.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Solution: Caching and Singleflight
&lt;/h2&gt;

&lt;p&gt;Since we couldn't update the firmware of thousands of vehicles overnight, we had to build a server-side "shield".&lt;/p&gt;

&lt;h3&gt;
  
  
  Implementing the Cache
&lt;/h3&gt;

&lt;p&gt;I consulted with the Security team first. The whole point of Bcrypt is to prevent brute-forcing, so I asked: "If we cache the successful validation result, is it still secure?" They gave us the green light.&lt;/p&gt;

&lt;p&gt;We introduced a caching layer. After a successful Bcrypt validation, we stored the &lt;strong&gt;SHA-256 hash&lt;/strong&gt; of the device credentials in Redis with a &lt;strong&gt;30-second TTL&lt;/strong&gt; (Time to Live). For any subsequent request within that window, the server would simply compare the SHA-256 hashes, a process that takes nanoseconds, instead of running the Bcrypt algorithm.&lt;/p&gt;

&lt;h3&gt;
  
  
  Resolving the Thundering Herd with Singleflight
&lt;/h3&gt;

&lt;p&gt;After deploying the cache, we saw a massive improvement, but every 30 seconds, the CPU would spike again. This was the &lt;strong&gt;"Thundering Herd"&lt;/strong&gt; problem. When the cache expired, all concurrent requests for the same device would see a cache miss and simultaneously trigger a Bcrypt operation.&lt;/p&gt;

&lt;p&gt;To resolve this, we implemented &lt;code&gt;golang.org/x/sync/singleflight&lt;/code&gt;. The logic was simple: for a given device ID, only one Bcrypt operation should be "in flight" at any given time.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// Example of the logic we implemented&lt;/span&gt;
&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;g&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Do&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;deviceID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;interface&lt;/span&gt;&lt;span class="p"&gt;{},&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c"&gt;// Only one goroutine per deviceID executes this at a time&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;validateBcrypt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;storedHash&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;providedPassword&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By using &lt;strong&gt;&lt;em&gt;&lt;u&gt;singleflight&lt;/u&gt;&lt;/em&gt;&lt;/strong&gt;, if 1,000 requests for Device-A arrive at the same time during a cache miss, &lt;strong&gt;one&lt;/strong&gt; request performs the Bcrypt check, while the other 999 wait for that single result.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note on multi-instance behavior:&lt;/strong&gt; Singleflight is local to the instance. In our multi-node setup, we might still perform one Bcrypt operation per instance during a cache miss — but this was a negligible cost compared to the thousands of operations we were doing previously.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A note on security:&lt;/strong&gt; While this fixed the performance, the system remains somewhat vulnerable to a focused DDoS. If an attacker cycles through thousands of different invalid tokens, the CPU would still spike because each unique invalid token triggers a new Bcrypt operation.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Results: 90%+ Reduction Across the Board
&lt;/h2&gt;

&lt;p&gt;The impact was immediate. During a final 1-on-1, my Engineering Manager decided to test the limits. He scaled the environment down to just two small/medium instances.&lt;/p&gt;

&lt;p&gt;Here is what changed under the hood — from every request triggering bcrypt directly, to a layered shield where 99% of requests never touch it at all:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu9eq8tbdtvw8nx936byu.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu9eq8tbdtvw8nx936byu.webp" alt="Request flow before (left) and after (right) the cache + singleflight fix. Cache hits return in nanoseconds; bcrypt is now rare and controlled." width="680" height="560"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We watched the dashboard in silence. The system didn't just survive; it thrived.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Throughput&lt;/td&gt;
&lt;td&gt;200 RPS&lt;/td&gt;
&lt;td&gt;20,000+ RPS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CPU Utilization&lt;/td&gt;
&lt;td&gt;90–100%&lt;/td&gt;
&lt;td&gt;&amp;lt;50%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Instance count&lt;/td&gt;
&lt;td&gt;~27 nodes&lt;/td&gt;
&lt;td&gt;2 smaller instances&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Est. monthly cost&lt;/td&gt;
&lt;td&gt;North of $42k/mo**&lt;/td&gt;
&lt;td&gt;Drastically reduced&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;*&lt;em&gt;Estimated from node count and on-demand pricing for the instance family. Exact billing figures were not shared with me as a contractor.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;After that meeting, I received an email that honestly made me a bit emotional. The EM sent a company-wide announcement, crediting me for fixing the system's core stability and drastically slashing the AWS bill.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Profile early, profile often.&lt;/strong&gt; Don't guess where the bottleneck is. Use &lt;code&gt;pprof&lt;/code&gt; to see exactly where the CPU cycles are going.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Production parity matters.&lt;/strong&gt; My local tests failed initially because I wasn't using production-grade headers. Always ensure your load tests mimic real-world traffic patterns, including headers and metadata.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security vs. performance.&lt;/strong&gt; Security is paramount, but expensive algorithms like Bcrypt don't belong in high-frequency "hot paths." Use them for handshakes, not for every data packet.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Beware of the Thundering Herd.&lt;/strong&gt; Caching is not a silver bullet. When dealing with high concurrency, always consider what happens when the cache expires. Tools like &lt;code&gt;singleflight&lt;/code&gt; are essential in your Go toolkit.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Infrastructure is not a substitute for optimization.&lt;/strong&gt; Scaling horizontally can hide bad code for a while, but eventually, the technical debt will become too expensive to ignore.&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;&lt;em&gt;More posts at &lt;a href="https://blog.pikomo.top?utm_source=dev.to"&gt;blog.pikomo.top&lt;/a&gt; · &lt;a href="https://github.com/pikomonde" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;. If this saved you some debugging time, &lt;a href="https://ko-fi.com/pikomonde" rel="noopener noreferrer"&gt;Ko-fi&lt;/a&gt; is always appreciated.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>pprof</category>
      <category>go</category>
      <category>performance</category>
    </item>
  </channel>
</rss>
