Rita {FlyNerd} Lyczywek

Posted on Feb 25, 2023 • Edited on Feb 27, 2023

How-To Export AWS DynamoDB Data To S3 For Recurring Tasks

#aws #devops #cloud #devjournal

Have you ever tried to schedule export of DynamoDB data to S3? I mean the automated recurring task everyday e.g. at 6 AM?
You went to AWS console only to discover that it limits you to a single "on click" export?

I did 😏

Therefore, in this article I'll try to cover the whole process of exporting AWS DynamoDB data to S3 as a recurring task. Additionally, I'd like my data to be filtered by secondary index. Also, I'll answer: why and how do this, and compare what solutions AWS offers.

📔 One side note: I explore universal options, but keep in mind that my table size is below 1 GB.

✨ Let's go! ✨

Why export data from DynamoDB to S3?
How-to export data from DynamoDB to S3?
Comparison table
Final thoughts on export costs
What's next?

First things first: Why?

Why export data from DynamoDB to S3?

From AWS website we can learn what are the benefits or reasons for exporting data from DynamoDB to S3. They divide it to:

ETL: Perform ETL (Extract, Transform, Load) operations on the exported data in S3, and then import the transformed data back into DynamoDB.
Data Archiving: Retain historical snapshots for audit and compliance requirements
Data Integration: Integrate the data with other services and applications
Data Lake: Build a data lake in S3, allowing users to perform analytics across multiple data sources using services such as Amazon Athena, Amazon Redshift, and Amazon SageMaker
Ad-hoc queries: Query data from Athena or Amazon EMR without affecting your DynamoDB capacity

In my case, BI team asked about a daily snapshot of our table from DynamoDB but only exported partially. So, I started the investigation: what are my options?.

How to export data from DynamoDB to S3?

At the beginning, I excluded the idea of scanning the table at the lambda level. Such a solution would be inefficient and costly, since AWS has tools for this - would also be a waste of time.

Those are 3 possible ways in 2023

"basic" Export DynamoDB to S3 feature
AWS Glue Job
AWS Data Pipeline (to be deprecated)

But before you start prepare this

Requirements:

Enable Point-in-time recovery (PITR) on the source table that allows export table data from any point in time within the PITR window, up to 35 days.
Add IAM role with permissions to access the DynamoDB table and write to the S3 bucket, allow:
- ExportTableToPointInTime(DynamoDb)
- PutObject(S3)



{
    "Effect": "Allow",
    "Action": [
      "dynamodb:ExportTableToPointInTime"
    ],
    "Resource": "*"
},
{
    "Effect": "Allow",
    "Action": [
      "s3:PutObject"
    ],
    "Resource": "*"
}

S3 bucket - create new bucket or select to use an existing one

Build-in Export DynamoDB to S3

Export to S3 as DynamoDB feature is the easiest way to dump the table data to S3. Also it doesn't run a scan against whole table, so it is efficient, cheaper way.

It is a simple, one-click feature in the DynamoDB console that exports the data in either JSON or Amazon Ion text format.

BUT

🚨 no filter the data before export

HOW-TO: Export DynamoDB → S3

Step by step instruction:

Go to DynamoDB console, select table you want to export.
There is a tab "Export table",click export button and to fill details:
- S3 bucket
- IAM role (created earlier)
- Format: choose the format for the exported data (JSON or AWS Ion)
Start the export process and wait until it to complete.
Check the S3 bucket to verify that the exported data is available in the specified format.

Lambda

As you may notice - there's no option to schedule recurring task on the AWS console level.

That's why we would need a minimal Lambda function triggered daily at the specified time e.g. via EventBridge rule, that calls the exportTableToPointIntime from AWS SDK.

AWS SDK v3 DynamoDB Client | ExportTableToPointInTimeCommand

Monitoring

We need to configure it, it's not provide as default. For example use AWS CloudTrail logs for table export to enable logging, continuous monitoring, and auditing

Limitations

Always dump whole table data
For recurring tasks needs extra lambda, that would run 1 per day
Task number: Up to 300 export tasks, or up to 100 TB of table size, can be exported concurrently. Doc
Format: DynamoDB JSON format or Amazon Ion text format

Cost

Export to S3 is “free” to setup, as it's part of the DynamoDB service.
We are charged $0.10 per GB exported + additional S3 costs for data storage and upload, which vary depending on the region you're in.

AWS Glue Jobs

Probably, (right now) it is the best way for data integration, especially when one source needs to stay alive while coping. AWS Glue is flexible as it allows you to export data from not just DynamoDB, but also other AWS services.
It is efficient for large datasets because the export feature uses the DynamoDB backup/export functionality (so it doesn't do a scan on the source table). In other words, it performs the Export to S3 (described above) under the hood.

AWS Glue crawl a DynamoDB table, extract the data into Amazon S3, and perform analysis using SQL queries. Technically, AWS Glue runs jobs in an Apache Spark serverless environment.

📔 Side note: I'm not covering the ETL capabilities of Glue here. For me, I need only data export, but if you plan to use Glue for ETL operations, you may want to create Glue Data Catalog for your jobs.

HOW-TO: Export AWS Glue Jobs → S3

Step by step instruction:

Go to AWS Glue Job:
- Navigate to the AWS Glue Studio
- Click on the "Jobs" menu.
- "Add job" button to create a new AWS Glue job
Select the source: select DynamoDB table as the source
Select the destination: select S3
Confirm with create button

This opens Glue Studio editor.

Visual editor is guides through job's properties. But you need to know what you want to do because it is a powerful tool full of options.

Configure the job details:

Data source: DynamoDB table
- AWS Glue is using DynamoDB's feature - export to S3 and creates a temporary S3 bucket
Data transform: ApplyMapping - add filtering in SQL
Data target: set format (e.g. JSON), then select the S3 bucket as the destination
Set schedule

After configuring the AWS Glue job, click the Run Job button to start the export process.
AWS Glue will automatically extract the data from the DynamoDB and store it in the specified S3 bucket.

👉 In my case, I also:

set timeout after 8 hours.
added number of retries - 3 time per day. AWS Glue will automatically restart the job if it fails
narrowed down number of workers from default 10 to 2 (it's experimental decision, the export takes 10 min, and total cost is lower than with 10 allocated workers. Again this may vary, depending on the size of the input data)

Monitoring

Some logs & monitoring are created by default with job, which is nice 👍

It's good idea to add the alerts in AWS Cloud Watch, that could inform you on email/slack/any way you want, that your job is failing.

Jobs Cost

0.44 USD per DPU-Hour + S3 storage

AWS Data Pipeline [to be deprecated]

✨ I quickly go through the main aspects, but without a detailed configuration, because I personally skipped it as it's not worth delving into anymore.

AWS Data Pipeline is being deprecated and will no longer be available after January 1, 2025. AWS recommends alternative solutions

Please note that Data Pipeline service is in maintenance mode and we are not planning to expand the service to new regions. We plan to remove console access by 04/30/2023.

Unfortunately, AWS Pipeline is often recommended on StackOverflow 😉

AWS Data Pipeline is a more complex service that requires configuration, management, and monitoring of pipelines. Sounds similar to Glue when it comes to the functionalities (flexible - many data sources, large scale). What's the difference?

Disadvantages

Deprecated, sure 🙈

Also, this approach is a bit old-fashioned as it utilises EC2 instances and triggers the EMR cluster to perform the export activity. If instance and the cluster configuration is not properly provided in the pipeline, it could cost... 💸 dearly 💸

HOW-TO: Export AWS Data Pipeline → S3

To export a DynamoDB table, we start with the AWS Data Pipeline console to create a new pipeline. The pipeline launches an Amazon EMR cluster to perform the actual export. Amazon EMR reads the data from DynamoDB, and writes the data to the export file in an Amazon S3 bucket.

AWS Data Pipeline — manages the import/export workflow for you.
Amazon S3 — contains the data that you export from DynamoDB, or import into DynamoDB.
Amazon EMR — runs a managed Hadoop cluster to perform reads and writes between DynamoDB to Amazon S3 (The cluster configuration is one m3.xlarge instance leader node and one m3.xlarge instance core node)

Pipeline Cost

It charges for pipeline creation, execution, and storage

0.06$ per low frequency task [e.g. daily activity lambda]
1.00$ per high frequency task [e.g. hourly activity lambda]
Amazon EMR (+EC2) cost for 120 minutes = 17 USD per month
additional S3 cost.

Overview: Export a table from Amazon DynamoDB to an Amazon S3 bucket steps via AWS Data Pipeline:

	AWS Dynamo Export	AWS Glue Job	AWS Data Pipeline
Use for	Data transfer	ETL, Data Catalog, AWS Glue Crawlers	Data transfer, transform and process
Serverless	Yes	Yes	No (default setting manages the lifecycle of AWS EMR clusters and AWS EC2 instances to execute jobs)
Allows filters / mapping	No	Yes	Yes
Cost	$0.10/GB + S3 storage	$0.44/DPU + S3 storage	$1.00/high-freq, $0.06/low-freq task + Amazon EMR(+EC2)+ S3 storage
Data Replication	Full table; Export from a specific point in time	Full table; Export from specific point of time; Incremental	Full table; Incremental replication via Timestamp
Output format	JSON, Ion(json.gz - compressed)	JSON, Ion, CSV, Parquet, XML, Avro, grokLog, ORC(compression optional)	CSV, JSON, Custom formats

Final thoughts on export costs 💰

As an AWS developer, you should have a little bit of the accountant inside your heart too 🖤.
Try to keep your records as small as possible, and use on-demand pricing wisely. It is ✨ so convenient ✨, I know. While it may seem not expensive and you don't need to think about scaling, it can sometimes be up to 4-6 times more expensive per request compared to a provisioned capacity. Therefore, it's better to sit down and calculate before making a final decision.

For me on-demand is cheaper than fixed capacity, but please refer to oldest programmer's answer "IT DEPENDS".

What's next?

Now, your data is in S3? That's time to think about retention policy 🧹, when should we archive data to AWS S3 Glacier. But maybe that's subject for next post.

Worth reading 📚

How can I back up a DynamoDB table to Amazon S3?

Accelerate Amazon DynamoDB data access in AWS Glue jobs using the new AWS Glue DynamoDB Export connector

Amazon CloudWatch alarms on AWS Glue job

Top comments (2)

Roman Roshchin • Apr 10 '23

just what I needed!

noddyStark • Jan 5 • Edited

While selection the incremental export, it says "The export period must be between 15 minutes and 24 hours in length. The start is inclusive. The end time is exclusive. For date, use YYYY/MM/DD format. For time, use 24-hour format."

What does it really mean? Does it mean we can export data in a time difference of minimum 15 minutes and maximum 24 hours?

DEV Community

How-To Export AWS DynamoDB Data To S3 For Recurring Tasks

Why export data from DynamoDB to S3?

How to export data from DynamoDB to S3?

Requirements:

Build-in Export DynamoDB to S3

HOW-TO: Export DynamoDB → S3

Step by step instruction:

Lambda

Monitoring

Limitations

Cost

AWS Glue Jobs

HOW-TO: Export AWS Glue Jobs → S3

Monitoring

Jobs Cost

AWS Data Pipeline [to be deprecated]

HOW-TO: Export AWS Data Pipeline → S3

Pipeline Cost

Overview: Export a table from Amazon DynamoDB to an Amazon S3 bucket steps via AWS Data Pipeline:

Final thoughts on export costs 💰

What's next?

Top comments (2)

Read next

The Need for Security in Firms:

Understanding Kubernetes Security: Common Vulnerabilities and Modern Solutions

AWS re:Invent 2024 Reflection

Leverage On-Premises Infrastructure in Amazon EKS Clusters with Amazon EKS Hybrid Nodes