Your SQS Queue Is Redelivering Messages Your Lambda Is Still Processing

#aws #typescript #serverless #opensource

Your order-processing Lambda starts sending duplicate confirmation emails. Not always — maybe one order in twenty. CloudWatch shows more invocations than messages published. The function code hasn't changed in weeks. What changed is that someone added a fraud check that pushed processing time from 25 seconds to around 45, and your SQS queue is still running the default 30-second visibility timeout.

That combination is the whole bug. When a Lambda pulls a message from SQS, the message isn't deleted — it's hidden for the duration of the visibility timeout. If the function is still working when that window closes, SQS assumes the consumer died and hands the same message to another invocation. Now two Lambdas are processing the same order, both will "succeed," and both will send the email. Nothing errors. Nothing retries. There is no log line that says "this message was delivered twice because your timeouts are misconfigured."

Infrawise (npm) flags this exact mismatch as a high-severity finding before it costs you an afternoon of staring at idempotency-free handler code. This post walks through why the bug is so hard to see, how the detection works, and how to keep an AI assistant from reintroducing it.

Why you never catch this one yourself

Three things make this misconfiguration nearly invisible:

It passes every test. In local tests and staging, your handler processes a synthetic message in two seconds. The 30-second visibility window never comes close to expiring. The bug only exists under production conditions — real payload sizes, real downstream latency, cold starts stacking on top of slow dependencies.

The defaults set the trap. SQS queues default to a 30-second visibility timeout. Lambda functions routinely get their timeout bumped to 60, 120, or 900 seconds as they grow. Nobody bumps the queue at the same time, because the two settings live in different consoles, different IaC resources, and usually different pull requests.

The failure signature points elsewhere. Duplicate processing looks like an application bug. You'll audit your handler for accidental double-sends, check whether the producer published twice, and read DynamoDB conditional-write docs before anyone thinks to compare two timeout values across two services.

The fix is one line of IaC. Finding out you need it is the expensive part.

How infrawise detects the mismatch

Infrawise builds a graph of your actual AWS account and runs rule-based analyzers over it — no LLM involved in the analysis, so a finding either fires or it doesn't (deterministic by design).

For this check, two extractions matter:

Queue attributes. For every queue returned by ListQueues, infrawise calls GetQueueAttributes and records VisibilityTimeout alongside the redrive policy, encryption status, and approximate message counts. A queue node in the graph carries its visibilityTimeoutSec.

Event source mappings. Infrawise paginates through ListEventSourceMappings and attaches each mapping to its Lambda. Every SQS-type mapping becomes a triggers edge in the graph: queue:aws:order-events → lambda:aws:process-order. Disabled mappings are skipped — a queue that used to feed a Lambda doesn't generate noise.

Then the VisibilityTimeoutMismatchAnalyzer walks the graph: for each queue that actually triggers a Lambda, it compares the queue's visibility timeout against that function's configured timeout. If the visibility timeout is smaller, you get a high-severity finding.

Two details keep this precise rather than noisy:

Only wired-up queues are checked. A queue with no active event source mapping is ignored. The analyzer isn't pattern-matching on names or guessing at architecture — it follows the same edge SQS itself uses to deliver messages.
The comparison uses real values from both sides. The Lambda timeout comes from the function configuration; the visibility timeout comes from the live queue attributes. If your Terraform says one thing and someone changed the queue in the console, infrawise reports what's actually deployed.

Running infrawise analyze against the scenario above prints:

  1.  HIGH   Queue "order-events" visibility timeout (30s) is less than Lambda "process-order" timeout (120s)
       If the Lambda takes longer than the visibility timeout, SQS will re-deliver the message to another consumer while the original invocation is still running, causing duplicate processing.
       → Set the visibility timeout for "order-events" to at least 720s (6× the Lambda timeout of 120s), per AWS best practice.

The 6× multiplier follows AWS's own guidance for Lambda event source mappings: the extra headroom covers batch processing and retries within the polling window, not just a single invocation.

The fix — and keeping it fixed when AI writes your infra

The immediate fix is mechanical. In CDK:

const orderEvents = new sqs.Queue(this, 'OrderEvents', {
  visibilityTimeout: cdk.Duration.seconds(720), // 6× the consumer's 120s timeout
});

The longer-term problem is that this class of bug gets reintroduced constantly — and increasingly by AI assistants. Ask an assistant to "add an SQS queue that triggers the order processor" and it will happily emit a queue with default settings wired to a 120-second function. The code is syntactically perfect. It deploys. It has the bug from paragraph one built in.

This is where infrawise's MCP server changes the workflow. Run once:

infrawise start --claude

It scans your account, writes .mcp.json to the project root, and opens Claude Code with the analysis available as tools. From then on the assistant can call get_queue_details, which returns every queue with its visibilityTimeoutSec, DLQ presence, encryption, FIFO flag, and message counts — plus any findings attached to that queue. The tool description explicitly tells the model to verify visibility timeout against Lambda timeout when reviewing messaging architecture, so an assistant asked to touch queue infrastructure checks the real numbers instead of emitting defaults.

Concretely, the before/after looks like this:

Before: "add a queue for order events" → assistant generates new sqs.Queue(...) with no visibility timeout → default 30s ships → duplicates appear weeks later when processing slows down.
After: the assistant calls get_queue_details and get_lambda_overview, sees the consumer's 120s timeout, and generates the queue with visibilityTimeout: Duration.seconds(720) — citing the existing high-severity finding if one is already live.

Two adjacent findings tend to show up in the same report, and they compound. If the mismatched queue also has no dead-letter queue — another high-severity check — your duplicate-processing problem coexists with silent message loss after retries are exhausted. And note the fix's flip side: a properly long visibility timeout means a genuinely failed message stays hidden for that full window before retry, which is exactly why the DLQ check matters alongside this one. Duplicate delivery can still happen in rare cases even with correct timeouts (SQS standard queues are at-least-once), so idempotent handlers remain good practice — but there's a difference between designing for a rare edge case and misconfiguring your way into hitting it on every slow invocation.

Analysis results are cached for 24 hours, and the MCP server refreshes stale analysis at session start, so the numbers your assistant reads track what's actually deployed.

Key takeaways

A Lambda timeout longer than its queue's visibility timeout guarantees duplicate processing under load. SQS redelivers any message whose consumer is still running when the visibility window closes.
The defaults create the bug: queues start at 30 seconds, Lambda timeouts grow over time, and the two values live in different resources that rarely change in the same PR.
Set visibility timeout to 6× the consumer Lambda's timeout, per AWS guidance — and pair every triggered queue with a dead-letter queue so retries can't silently discard messages.
Check deployed values, not IaC intent. Infrawise compares the live queue attribute against the live function configuration, so console drift gets caught too.
Give your AI assistant the real numbers. With infrawise's MCP tools connected, an assistant generating queue infrastructure reads actual timeouts instead of shipping defaults.

Try it against your own account — or against a free LocalStack sandbox if you want to see the findings without touching real AWS: GitHub · npm.