Oded Keren

Posted on Mar 26

When a Few Noisy Devices Took Down the System: Lessons from a Production Investigation

#aws #iot #distributedsystems #monitoring

We investigated a production incident that started during a canary IoT firmware OTA rollout and quickly escalated into a system-down alarm.

From the outside, the symptoms looked familiar: EC2 instances became slow and stuck, SQS backlog grew, retries piled up, and API latency increased across the system. At first, it did not look like a dramatic traffic spike, and that was part of what made the incident confusing. The real story only appeared once we zoomed in enough.

The trigger

A new firmware version was rolled out gradually in a canary release. That firmware had a bug related to the control panel sensor sensitivity in our device. As a result, affected machines started reporting rapid on/off transitions multiple times per second. Each transition generated an EWI event to the cloud.

The rollout was still in an early canary stage. Only a few hundred devices, or fewer, had received the version, and yet that was enough to seriously impact the system. That was the first sign that this was not simply a "high traffic" story.

The event flow

The relevant system flow looked roughly like this:

Depending on the event type and code path, processing could lead to multiple scenarios:

updates to Aurora-backed device data
inserts into a DynamoDB history table
push notifications to users
data notifications that triggered app refresh flows
in some scenarios, calls to IoT Fleet Index as part of device-related checks

At that point in the investigation, the picture was still incomplete. Yes, the affected devices were generating many more EWI events than expected. But that alone still did not explain why the system became unhealthy so quickly.

What made the incident especially interesting was that it was triggered by only a relatively small subset of devices.

What we saw first

At the beginning of the investigation, we looked at which APIs were slowest and which APIs were being called most frequently. That helped, but it was also misleading at first. One of the confusing parts was that the overall amount of traffic in the system did not initially look dramatically different. This was not a case where the whole fleet suddenly generated a huge visible spike.

Some APIs were slow, but sometimes the same APIs behaved normally. The slowdown was intermittent enough to make the picture unclear. The real change came from a relatively small subset of devices that started generating EWI events at a very high rate, and that alone was enough to push shared backend resources into contention.

Still, one thing became clear fairly early: the issue was not isolated to only one endpoint. DeviceService was slow, but other services using the same Aurora cluster were slow too. We also saw Aurora errors such as:
Lock wait timeout exceeded; try restarting transaction
So Aurora was clearly part of the story. But it still was not the whole story.

The real picture appeared only after slicing by device, user, and event type

The real breakthrough came only after looking at observability dashboards that aggregated by device, user, event type, and API pattern. That changed the picture completely.

A relatively small subset of devices was producing EWI events at an extremely high rate. Those few problematic devices created a disproportionately large amount of backend activity. That behavior kept hitting the same backend paths, the same device-related rows, and the same supporting flows around them.

Only after looking at the system through that lens did the incident start to make sense.

Why the impact grew so fast

What made this incident severe was not just the firmware bug itself, and not just the number of events. What made it severe was the interaction between several things.

First, those repeated events kept hitting backend paths that were sensitive to this kind of pressure: the same device-related rows were updated again and again, some important indexes were missing or not aligned to the real access patterns, transactions were open longer than needed, and ORM behavior added extra queries inside already busy flows.

Second, the event flow had secondary effects. Data notifications triggered app-side flows that made backend API calls automatically to fetch updated state, and also caused notification-related cleanup operations such as deletes. Those additional reads and deletes added more pressure and also made the investigation itself more confusing, because now the database showed stress from multiple directions at once.

And once Aurora slowed down, retries continued feeding the same stressed path back into the system.

So this was not just a story of “too many events.” It was a story of repeated high-rate device activity colliding with hot database paths, imperfect indexing and transaction design, app-triggered follow-up traffic, and retry behavior that made the situation even worse.

Why a small canary was enough

One of the most important lessons from this incident was that the system did not need a fleet-wide traffic explosion to fail. Once those hot paths became contended, the impact spread well beyond the original devices.

That was an important lesson: sometimes the real risk is not broad load across the system, but concentrated repeated activity on specific hot rows, tables, and flows. A very small subset of devices can create enough contention to degrade the experience for everyone else.

Why Aurora became the visible bottleneck

Aurora was where a lot of the pain surfaced. Even simple queries such as findById() started taking longer, and that confused the investigation at first. It was tempting to ask whether this was a pure database sizing problem, a configuration issue, a MySQL version issue, or maybe a general infrastructure issue.

We checked all of those directions. We looked at lock waits, configurations, and the broader database behavior. What helped the most here was AWS Aurora Performance Insights. It made it much easier to see where the database was spending time, which SQL patterns were involved, where locking showed up, and how CPU rose during the incident. That visibility was critical, because the issue was not just “the DB is slow.” It was about understanding why it became slow.

What the database investigation revealed

Some tables had missing indexes. Some had duplicate indexes that were not really needed. Some large tables were not indexed correctly for the way they were being queried and updated in practice. One important example was optimistic locking. Some tables used versioned updates, but the relevant update pattern did not have the right composite index on (id, version). That became especially painful under repeated contention.
Transactions were open longer than they needed to be. Some business flows held transactions open across too much work. For example, in some cases additional logic or API calls were happening while the transaction was still active. That meant transactions lived longer than they should have, increasing the chance of waiting, blocking, and timeout under pressure.
ORM behavior added extra pressure, Hibernate behavior also played a role. Relationship modeling and eager/lazy loading choices caused additional queries inside already busy flows. Under normal behavior this may stay invisible for a long time. Under repeated high-rate updates, it becomes part of the problem.
The incident was not only about writes. The notification-driven app flow also increased reads and delete operations, especially against a large notifications-related table that was not indexed correctly for that usage pattern. That contributed to the Aurora CPU increase and made the overall symptom picture noisier.

It was the combination of repeated high-rate updates from affected devices, hot-row and hot-table patterns, indexing problems, broad transaction scope, ORM inefficiencies, notification-triggered follow-up DB work, and retries continuing to recycle the load back into the same path.

That combination is what turned the canary issue into a production incident.

Retries made the situation worse

When Lambda called DeviceService and received a timeout, the message returned to SQS and was retried again. That kept happening too aggressively and for too long. In some scenarios, this continued for far too long. So once the system entered a degraded state, retries stopped acting like recovery and started amplifying the load instead. The queue was not just buffering work anymore. It was feeding the same unhealthy workload back into an already stressed path.

Rollback was necessary but not sufficient

The first action was obvious and necessary:

stop the OTA rollout
revert the bad firmware
protect the IoT rule path so affected firmware versions would stop pushing those EWI events deeper into the system

That was the immediate containment step, but it became clear that rollback alone would not be enough. The bad firmware exposed weaknesses that were already there: too much expensive work per event in some flows, insufficient protection against repeated noisy devices, indexing issues, transaction-scope issues, and retry behavior that could amplify failure.

In that sense, the firmware bug triggered the incident, but the investigation exposed deeper hardening work that needed to happen anyway.

What we changed

Database and schema improvements

add missing indexes, remove duplicate or unnecessary indexes
fix composite indexing for hot optimistic-locking update paths
change parts of the database schema and relationships, including removing and merging tables where needed, and redesign some hot paths

Transaction and Hibernate fixes

shorten transaction scope, avoid keeping transactions open across unnecessary work, avoid making API calls while transactions are still active
improve relationship handling
revisit eager/lazy behavior and how mappings were written

Flow minimization

reduce unnecessary work in some event paths
minimize some flows entirely
in some scenarios move to lighter approaches, for example using IoT Shadow where it was enough for the business need

Notification and app-behavior changes

reduce and throttle push and data notifications
change app behavior so incoming data notifications would not trigger such an expensive backend pattern

Retry and protection changes

correct retry handling so failures would not recycle almost endlessly
build a new rate-limiting mechanism by device and event type using DynamoDB
use that same mechanism as a near-real-time observability layer for diagnosing abnormal device behavior

The biggest lesson

It does not always require a massive fleet-wide spike. Sometimes a single firmware issue or a very small number of devices repeatedly hitting the same hot path is enough to create contention that affects many unrelated flows. Once that happens, the whole system can start looking sick even though the original trigger came from only a very small corner of it.

The broader engineering lesson

Resilience in production is deeply tied to the quality of the software and data design behind the flows. That includes things like:

whether indexes actually match real access patterns
whether table relationships reflect current workload realities
whether transaction boundaries are kept tight
whether ORM usage is predictable under pressure
whether retries are bounded correctly
whether a noisy producer can be isolated before it stresses shared resources

The system was good enough under normal behavior. But this incident exposed assumptions that did not hold under this failure mode. And that is exactly why these kinds of investigations are valuable.

A practical takeaway for IoT and event-driven systems

In IoT systems especially, not every high-frequency device signal should be allowed to trigger the full weight of downstream processing again and again.

You need the ability to ask:

which devices are suddenly producing abnormal load?
which event types are exploding?
which backend paths are getting hot?
which tables or queries are becoming contention points?
are retries helping recovery, or making the situation worse?

Service-level dashboards are not enough. Without slicing by device, user, and event type, we would have seen only “general slowness.” That zoomed-in observability is what revealed the real shape of the incident.

Final thought

The firmware issue triggered the incident, but the deeper value came from what it revealed. It showed how a small number of noisy devices could create enough repeated pressure on specific rows, tables, and backend flows to impact the whole system. And it pushed us to improve the platform in a much more meaningful way than simply reverting one bad rollout: better indexing, better transaction handling, cleaner relationships, lighter flows, bounded retries, and stronger protection against abnormal device behavior.

DEV Community