GargeeBhatnagar for AWS Community Builders

Posted on Aug 6, 2021

Export database snapshots manually to S3 & export S3 content in Glue Data Catalog using Crawler for tables fetched in Athena

#aws #awsglue #awsathena #security

“Challenges faced to find the solution of how to reduce the cost of exporting rds data to S3 without giving input as rds username and password and other details”. First and foremost I have checked the different methods of how I am able to export the rds data in S3. I got some solutions such as using Data Pipeline, Glue and doing tasks manually with database snapshot. In data pipeline, problem faced is of region as i have to do the task in mumbai region but data pipeline service is not available in mumbai region. Next I checked to do the task with AWS Glue but in Glue for exporting data to S3 i need to input rds username, password and other details in crawler. Next I have only 1 option: export the db snapshot manually to S3 and then i have calculated the cost which is charged as per GB basics. So I have chosen to do it manually.

“Challenges faced to find the solution for running the crawler in Glue to get the data in Glue data catalog”. When I have exported the data to S3 manually I have not input any KMS key arn during the configuration process and as my RDS database is encrypted for security purposes. Due to which crawler configured for getting S3 data to Glue data catalog is not running properly and not adding any tables in catalog. So to resolve this problem, I have again done the process manually to export the db snapshot data to S3 using KMS key and then after configuration of crawler in Glue I have assigned the Glue role configured to KMS as a user.

AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. AWS Glue provides all of the capabilities needed for data integration so that you can start analyzing your data and putting it to use in minutes instead of months. AWS Glue provides both visual and code-based interfaces to make data integration easier. Users can easily find and access data using the AWS Glue Data Catalog. To learn more, read the AWS Glue.

AWS Athena Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. Athena is easy to use. Simply point to your data in Amazon S3, define the schema, and start querying using standard SQL. Most results are delivered within seconds.

With Athena, there’s no need for complex ETL jobs to prepare your data for analysis. This makes it easy for anyone with SQL skills to quickly analyze large-scale datasets. Athena is out-of-the-box integrated with AWS Glue Data Catalog, allowing you to create a unified metadata repository across various services, crawl data sources to discover schemas and populate your Catalog with new and modified table and partition definitions, and maintain schema versioning. To learn more, read the AWS Athena.

In this post, you will get to know how to export database snapshots manually to S3 and export S3 content in Glue Data Catalog using crawler to get the tables fetched in Athena. Here I have taken a centos ec2 server and mysql rds database. And created a database snapshot which I need to export to S3 for getting S3 data in Glue by which I am able to run the queries in athena and store the queries output in S3 bucket from athena for ease of access by users.

Prerequisites

You’ll need an Amazon EC2 Server for this post. Getting started with amazon EC2 provides instructions on how to launch an EC2 Server.

You’ll also need an Amazon RDS Database. Getting started with RDS provides instructions on how to create a RDS. For this blog, I assume that I have an ec2 server and a rds database with a latest database snapshot which contains the current databases and tables of rds.

Architecture Overview

Diagram 1

Diagram 2

Diagram 3

The architectural diagrams show the overall deployment architecture with data flow, RDS DB Snapshot, S3, AWS Glue and AWS Athena.

Solution overview

The blog post consists of the following phases:

Export of RDS db snapshot to S3.
Create and run Crawler in AWS Glue to export S3 data in Glue Data Catalog.
In Athena, run of queries and store of queries output in S3 bucket.

I have a ec2 server and a rds database with latest db snapshot as below →

Phase 1: Export of RDS db snapshot to S3

Open the AWS RDS Console and goto snapshots option. Choose the snapshot and click on actions drop down and choose option export to amazon S3.
Give the export identifier name as test-demo-export-1.
You can choose a partial option for export of data as any database or table.
I have chosen the option as ALL as I need to export full data to S3.
Created bucket named as test-demo-export with default settings.
Selected S3 bucket “test-demo-export” and Created new IAM role named as test-demo-export in export db snapshot configurations.
Created KMS key as “export-rds-snapshot-to-S3” in Key Management Service with default settings and copied the arn of key created to export of db snapshot configuration.
Click on export to amazon S3 button and then the process gets started.
After completion of the export task to S3 bucket, I am able to see the databases and tables in the bucket named as test-demo-export.

Phase 2: Create and run Crawler in AWS Glue to export S3 data in Glue Data Catalog

In AWS Glue Console, Goto crawler option and click on the add crawler button. Then give the crawler name as test-demo and click next.
In Specify crawler source type, choose data stores option and crawl all folders option and then click next.
Choose S3 as a source data store from the drop down list. And choose the specified path in my account option and then give the path of s3 data location as s3://test-demo-export/test-demo-export-1/ and click on next.
Choose No in add another data store and click next.
Create a new role as test-demo and click next.
Choose a schedule for crawler as a Run on demand option as you need to run the crawler once and then click next.
Click on add database and give name as test-demo and leave other option default and click on create button. As prefix added to tables, grouping behavior for s3 data and configuration options are optional so leave it default and click on next as do not need to add it.
Review the crawler information and click on the finish button.
Crawler is created and ready to run.
Before running the crawler, goto customer managed keys and add a key user as glue services role created while configuration of crawler named as AWSGlueServiceRole-test-demo. Note that without this step crawler won't add tables in Data Catalog.
Choose the crawler and click on the run crawler option.
Once the crawler starts and stops, again it will be in a ready state. And we can see the number of tables and logs in cloudwatch.
Click on the crawler name and can see the configuration of crawler.
Click on the tables option to see the number of tables added in the test-demo database.

Phase 3: In Athena, run of queries and store of queries output in S3 bucket

Goto AWS Athena console, Select the database from drop down list and can see the tables of it.
Create the folder named as “athena-test-demo-output” in the same bucket to save the output of queries run on athena.
In Athena console, goto settings option and give the query result location as path of output folder as s3://test-demo-export/athena-test-demo-output/ and then click on save option.
Choose any table and can preview it by clicking on the three dots option. And then click on the run query button to get the results of it.
Check the output folder to see the results of the run query. And can download the results from the bucket by choosing the file.
For creating the table manually without running the crawler, just goto create table option and then give the database name, path of s3 bucket, data format type, column name, column type and partition details and then click on create table button.
You can view the history of the run query and can download the results.

Clean-up

Delete the environment as: EC2, RDS, AWS Glue, AWS Athena, KMS, IAM, S3 and Cloudwatch log group.

Pricing

I review the pricing and estimated cost of this example.

For every db snapshot export to S3 depends on total storage size of data →
Cost of per GB = $0.01
For S3 storage →
Cost of per GB = $0.025
For AWS Glue Crawler depends on crawler run time →
Cost of per Data Processing Unit-Hour = $0.44
Your glue storage cost is $0, as the storage for the first million tables is free. Your first million glue requests are also free. You will be billed for one million requests above the free tier, which is $1.
For AWS Athena →
Cost of 1 TB = $5
For Key Management Service →
Cost of customer managed KMS key version = $1

You can learn more by visiting the AWS Glue pricing and AWS Athena pricing service details page.

In my case, costing of using AWS glue and AWS Athena service:
For every db snapshot export to S3 depends on total storage size of data →
Cost of 20 GB = $0.20
For S3 storage →
Cost of 2 GB = $0.05
For AWS Glue Crawler depends on crawler run time →
Cost of 0.074 DPU-Hour = $0.03
For AWS Athena →
Cost of 0.000060 Terabytes = $0.00
For Key Management Service,
Cost of 0.022 keys = $0.02
Total Cost = $(0.20+0.05+0.03+0.00+0.02) = $0.30

Summary

In this post, I had shown you how to export database snapshots manually to S3 and export S3 content in Glue Data Catalog using crawler to get the tables fetched in Athena.
For more details on glue and athena service, Checkout Get started with AWS Glue and AWS Athena, open the AWS Glue console and AWS Athena console. To learn more, read the AWS Glue documentation and AWS Athena documentation.

Thanks for reading!

Connect with me: Linkedin