With the help of EventBridge Pipes and Firehose, you can export DynamoDB data to S3 in near real-time, enabling you to query it using SQL via Athena.
This is the second part of a two-part series covering how to copy data from DynamoDB to S3 and query it with Athena.
- Part 1: Query DynamoDB with SQL using Athena - Leveraging DynamoDB Exports to S3
- Part 2 (this article): Query DynamoDB with SQL using Athena - Leveraging EventBridge Pipes and Firehose
DynamoDB is a highly scalable NoSQL database, but it has limited query capabilities like any NoSQL database. By exporting data to S3 and querying it with Athena using SQL, we can combine the scalability of DynamoDB with the flexibility of SQL queries. This offers a truly serverless approach to SQL queries.
The solution is built using CDK and TypeScript. It covers both single- and multiple-table designs.
Real-Time Data Flow to S3
In Part 1, we explored how to use DynamoDB's native export to S3. While that works well for certain use cases, it falls short when you need near-real-time data. In Part 2, we’ll demonstrate a solution to stream new DynamoDB data to S3 in near real-time using EventBridge Pipes and Firehose.
An added advantage of using Firehose over DynamoDB’s native export feature is that Firehose can convert the data to efficient formats, such as Parquet, which optimizes both storage and Athena query performance. You can also filter the data, so not all records need to be exported.
Downsides and Considerations
- Delay
Exporting DynamoDB data via EventBridge Pipes and Firehose, we can achieve real-time data storage in S3, albeit with some delay. How much delay depends on which batching configuration you set on Firehose. Firehose batch the data to store multiple records into one file. This is beneficial regarding cost, and most of all, it is great for Athena because it performs better when working with larger files.
- Append-Only Nature
Data in S3 is stored in an append-only format. If a record is deleted in DynamoDB, it won’t be removed from S3. Instead, you will need to store a record with a “deleted” flag and adjust your queries to account for this. Similarly, updates in DynamoDB are stored as new records, meaning that your Athena queries must be designed to retrieve the latest version of a record, which can add complexity. Those kinds of queries could be slow and expensive. But Athena is cheap, so this is not a problem if you do not have terabytes of data.
- Duplicate Records
There’s no guarantee that data transferred to S3 will not duplicate records. EventBridge Pipes and Firehose do not ensure exactly-once delivery, so some records may be duplicated. Athena appears like an SQL database, but it is not. It has no constraints on the primary key, as you would subconsciously expect. And the whole process of transferring data is not idempotent, which means that it can produce duplicates. So, on most aggregate queries, you will need to use a DISTINCT SQL statement or similar approach.
While Athena is cost-effective, handling duplicate records or complex queries may lead to performance trade-offs, especially with large datasets.
Top comments (0)