Roman Tsypuk for AWS Community Builders

Posted on Dec 23 • Originally published at tsypuk.github.io

How switching to SQS Batch operations improves Performance an Billing

#monitoring #aws #go #performance

Abstract

In this post, we explore how refactoring SQS message processing from individual SendMessage calls to Batch SendMessage operations can
significantly improve application performance and reduce SQS billing costs by lowering IOPS usage.

The idea

When monitoring a Golang application with DataDog, we can measure SQS message sending in detail. By comparing a traditional loop-based send approach versus batch sending, we can see clear differences in timing, network calls, and resource usage.

Full Datadog tracing of SQS is not supported for all languages:

Set DD_TRACE_CLOUD_REQUEST_PAYLOAD_TAGGING=all and DD_TRACE_CLOUD_RESPONSE_PAYLOAD_TAGGING=all on this service to enable complete payload tagging.

https://docs.datadoghq.com/tracing/guide/aws_payload_tagging/?tab=nodejs

DD_TRACE_CLOUD_REQUEST_PAYLOAD_TAGGING=all
DD_TRACE_CLOUD_RESPONSE_PAYLOAD_TAGGING=all

For Golang, you can leverage Datadog attribute tags to inspect payload metadata.

Regular SQS message send operations

Sending messages one by one involves multiple network calls and extra overhead.

Following tracing diagram displays how timing looks like when using loop operation.

For example, sending 7 messages individually took 175ms, with 7 separate HTTP requests. The first call typically dominates the timing due to DNS lookup and connection setup.

But since the service in running in same K8S cluster, we can assume that experiment is clean and no additional overhead is present.

Sending messages in a Batch

AWS SQS allows sending up to 10 messages per batch. Sending 20 messages in 2 batches demonstrates significant efficiency gains:

Sent 3x more messages.
Made 10x fewer HTTP requests.
Total processing time reduced by ~3x.

Response examples:

When batch send is performed in a batch response, for each given message there is a status, including the error.
So the batch can be completed, but some messages in a batch can fail, parsing this response status will allow to efficiently
replay or handle with a fallback business logic of such cases.



{
  "Successful": [
    {
      "ID": "0",
      "MessageID": "655f3404-fbe4-4c51-8868-b5c604bd5f6d",
      "Error": null
    },
    {
      "ID": "1",
      "MessageID": "daf36653-9abb-490b-b620-608efa24a219",
      "Error": null
    },
    {
      "ID": "2",
      "MessageID": "93f4dcfd-0500-4076-90f2-3b880b32c943",
      "Error": null
    },
    {
      "ID": "3",
      "MessageID": "f6c7b079-98f5-4290-b293-2ac6e43ed6f2",
      "Error": null
    },
    {
      "ID": "4",
      "MessageID": "2b4a96bc-b4ec-4711-9473-d887dd3213f7",
      "Error": null
    },
    {
      "ID": "5",
      "MessageID": "1bd30cd9-f9c1-4b47-8d6d-2e23ce771841",
      "Error": null
    },
    {
      "ID": "6",
      "MessageID": "8eed75ef-2563-442e-a191-6b3dff29d635",
      "Error": null
    },
    {
      "ID": "7",
      "MessageID": "c65a36ce-7ce0-444c-9974-96648dcae0ea",
      "Error": null
    },
    {
      "ID": "8",
      "MessageID": "75379265-52f9-4a60-8c3a-0537cffdaa80",
      "Error": null
    },
    {
      "ID": "9",
      "MessageID": "59239903-d4d9-498f-9a08-6d7d7ae8beba",
      "Error": null
    },
    {
      "ID": "10",
      "MessageID": "9a614c58-113b-487d-a8f1-7509f93b42f9",
      "Error": null
    },
    {
      "ID": "11",
      "MessageID": "1077de5c-8f0f-4d5b-a0fe-dca45712bfdf",
      "Error": null
    },
    {
      "ID": "12",
      "MessageID": "8b0f5836-0e01-4a88-9793-4bac2a6d879a",
      "Error": null
    }
  ],
  "Failed": []
}

AWS Console Behavior

Batch sending does not change how messages appear in SQS. Each message is stored individually, so consumers don’t need any changes to handle batches.

Same messages, same structures are posted and present in SQS.

However, there are other optimization technics that allow optimise consumer batch size when polling messages from SQS.

Golang Implementation Example

entry := &sqs.SendMessageBatchRequestEntry{
  Id:          aws.String(fmt.Sprintf("%d", i+idx)), // Unique ID within batch
  MessageBody: aws.String(string(b)),
}

if taskConfig.MessageGroupId != "" {
  entry.SetMessageGroupId(taskConfig.MessageGroupId)
}
if taskConfig.MessageDeduplicationId != "" {
  entry.SetMessageDeduplicationId(taskConfig.MessageDeduplicationId)
}
if taskConfig.DelaySeconds > 0 {
  entry.SetDelaySeconds(taskConfig.DelaySeconds)
}
entries = append(entries, entry)
}

type BatchResult struct {
    Successful []BatchResultEntry
    Failed     []BatchResultEntry
}

// BatchResultEntry represents a single entry in a batch result
type BatchResultEntry struct {
    ID        string
    MessageID string
    Error     error
}

// Send batch
input := &sqs.SendMessageBatchInput{
  QueueUrl: stp.url,
  Entries:  entries,
}

output, err := stp.c.SendMessageBatchWithContext(ctx, input)
if err != nil {
  err = handleSqsErrors(err)
  // Mark all entries in this batch as failed
  for idx := range batch {
    result.Failed = append(result.Failed, BatchResultEntry{
      ID:    fmt.Sprintf("%d", i+idx),
      Error: err,
    })
  }
  continue
}

Dedicated message details

Exactly this messageID was returned in a batch is success response section.

Additional things to check and optimize

Deduplication technic

Before sending the messages, perform the deduplication—this will reduce the SQS iops usage, decrease the latency of processing and reduce the load on the consumer side, avoid unneeded storage read queries, rewrites, etc.

Distributed Tracing frameworks can consume SQS batch slot for metaiformation

Some frameworks for distributed tracing are propagating metainformation through async transports like SQS.
If you are using them, check integrations, it can affect the max batch size. I.e. Datadog uses 1 batch element to propagate metainformation with tracing,
that will be consumed and applied with span to a same trace.

x-ray since it is a proprietary AWS technology does not utilize any slots in a SQS batch and uses UDP server to submit span/trace info.

Limitations:

message size payload (1Mb)
batch size (10 messages)
payload serialization (JSON)

Conclusions:

Switching implementation from loop Send to Batch Send, allowed significantly decrease the overall timing, decrease network round trips and as a bonus decrease the SQS billing (due to decrease API calls in x10 times).

DEV Community