Abstract
In this post, we explore how refactoring SQS message processing from individual SendMessage calls to Batch SendMessage operations can
significantly improve application performance and reduce SQS billing costs by lowering IOPS usage.
The idea
When monitoring a Golang application with DataDog, we can measure SQS message sending in detail. By comparing a traditional loop-based send approach versus batch sending, we can see clear differences in timing, network calls, and resource usage.
Full Datadog tracing of SQS is not supported for all languages:
Set DD_TRACE_CLOUD_REQUEST_PAYLOAD_TAGGING=all and DD_TRACE_CLOUD_RESPONSE_PAYLOAD_TAGGING=all on this service to enable complete payload tagging.
https://docs.datadoghq.com/tracing/guide/aws_payload_tagging/?tab=nodejs
DD_TRACE_CLOUD_REQUEST_PAYLOAD_TAGGING=all
DD_TRACE_CLOUD_RESPONSE_PAYLOAD_TAGGING=all
For Golang, you can leverage Datadog attribute tags to inspect payload metadata.
Regular SQS message send operations
Sending messages one by one involves multiple network calls and extra overhead.
Following tracing diagram displays how timing looks like when using loop operation.
For example, sending 7 messages individually took 175ms, with 7 separate HTTP requests. The first call typically dominates the timing due to DNS lookup and connection setup.
But since the service in running in same K8S cluster, we can assume that experiment is clean and no additional overhead is present.
Sending messages in a Batch
AWS SQS allows sending up to 10 messages per batch. Sending 20 messages in 2 batches demonstrates significant efficiency gains:
- Sent 3x more messages.
- Made 10x fewer HTTP requests.
- Total processing time reduced by ~3x.
Response examples:
When batch send is performed in a batch response, for each given message there is a status, including the error.
So the batch can be completed, but some messages in a batch can fail, parsing this response status will allow to efficiently
replay or handle with a fallback business logic of such cases.
{
"Successful": [
{
"ID": "0",
"MessageID": "655f3404-fbe4-4c51-8868-b5c604bd5f6d",
"Error": null
},
{
"ID": "1",
"MessageID": "daf36653-9abb-490b-b620-608efa24a219",
"Error": null
},
{
"ID": "2",
"MessageID": "93f4dcfd-0500-4076-90f2-3b880b32c943",
"Error": null
},
{
"ID": "3",
"MessageID": "f6c7b079-98f5-4290-b293-2ac6e43ed6f2",
"Error": null
},
{
"ID": "4",
"MessageID": "2b4a96bc-b4ec-4711-9473-d887dd3213f7",
"Error": null
},
{
"ID": "5",
"MessageID": "1bd30cd9-f9c1-4b47-8d6d-2e23ce771841",
"Error": null
},
{
"ID": "6",
"MessageID": "8eed75ef-2563-442e-a191-6b3dff29d635",
"Error": null
},
{
"ID": "7",
"MessageID": "c65a36ce-7ce0-444c-9974-96648dcae0ea",
"Error": null
},
{
"ID": "8",
"MessageID": "75379265-52f9-4a60-8c3a-0537cffdaa80",
"Error": null
},
{
"ID": "9",
"MessageID": "59239903-d4d9-498f-9a08-6d7d7ae8beba",
"Error": null
},
{
"ID": "10",
"MessageID": "9a614c58-113b-487d-a8f1-7509f93b42f9",
"Error": null
},
{
"ID": "11",
"MessageID": "1077de5c-8f0f-4d5b-a0fe-dca45712bfdf",
"Error": null
},
{
"ID": "12",
"MessageID": "8b0f5836-0e01-4a88-9793-4bac2a6d879a",
"Error": null
}
],
"Failed": []
}
AWS Console Behavior
Batch sending does not change how messages appear in SQS. Each message is stored individually, so consumers don’t need any changes to handle batches.
Same messages, same structures are posted and present in SQS.
However, there are other optimization technics that allow optimise consumer batch size when polling messages from SQS.
Golang Implementation Example
entry := &sqs.SendMessageBatchRequestEntry{
Id: aws.String(fmt.Sprintf("%d", i+idx)), // Unique ID within batch
MessageBody: aws.String(string(b)),
}
if taskConfig.MessageGroupId != "" {
entry.SetMessageGroupId(taskConfig.MessageGroupId)
}
if taskConfig.MessageDeduplicationId != "" {
entry.SetMessageDeduplicationId(taskConfig.MessageDeduplicationId)
}
if taskConfig.DelaySeconds > 0 {
entry.SetDelaySeconds(taskConfig.DelaySeconds)
}
entries = append(entries, entry)
}
type BatchResult struct {
Successful []BatchResultEntry
Failed []BatchResultEntry
}
// BatchResultEntry represents a single entry in a batch result
type BatchResultEntry struct {
ID string
MessageID string
Error error
}
// Send batch
input := &sqs.SendMessageBatchInput{
QueueUrl: stp.url,
Entries: entries,
}
output, err := stp.c.SendMessageBatchWithContext(ctx, input)
if err != nil {
err = handleSqsErrors(err)
// Mark all entries in this batch as failed
for idx := range batch {
result.Failed = append(result.Failed, BatchResultEntry{
ID: fmt.Sprintf("%d", i+idx),
Error: err,
})
}
continue
}
Dedicated message details
Exactly this messageID was returned in a batch is success response section.
Additional things to check and optimize
Deduplication technic
Before sending the messages, perform the deduplication—this will reduce the SQS iops usage, decrease the latency of processing and reduce the load on the consumer side, avoid unneeded storage read queries, rewrites, etc.
Distributed Tracing frameworks can consume SQS batch slot for metaiformation
Some frameworks for distributed tracing are propagating metainformation through async transports like SQS.
If you are using them, check integrations, it can affect the max batch size. I.e. Datadog uses 1 batch element to propagate metainformation with tracing,
that will be consumed and applied with span to a same trace.
x-ray since it is a proprietary AWS technology does not utilize any slots in a SQS batch and uses UDP server to submit span/trace info.
Limitations:
- message size payload (1Mb)
- batch size (10 messages)
- payload serialization (JSON)
Conclusions:
Switching implementation from loop Send to Batch Send, allowed significantly decrease the overall timing, decrease network round trips and as a bonus decrease the SQS billing (due to decrease API calls in x10 times).




Top comments (0)