Matt Houghton for AWS Community Builders

Posted on Mar 12, 2024

Frugal SQL data access with Athena and Blue / Green support

#aws #database #sql #architecture

Introduction

In this post I look at a frugal architecture for SQL based data access.

The prompt for writing this blog post came from a recent discussion on an application a team were looking to migrate to the cloud.

The requirements for the migration were the ability to run SQL against the data which was very small in volume (<200Mb).

During the discussion I turned to Athena which is one of my favourite AWS Services. Athena offers JDBC drivers so I suggested we could swap from the MySQL database which was going to be provided by RDS.

I was also asked how I would handle a blue / green style deployment with Athena. The specific requirement was that each time the application was deployed the database would be replaced with a new version including all data.

SQL Setup

With Athena there is no visible database resource to create like there is with RDS. The steps to allow SQL access to data are as follows.

A bucket to store the data in.
A Glue database / table created that defines the structure of the data held in S3

A quick way to test this out is to use a tool like Mockaroo to generate some test data and then have a Glue Crawler analyse the data in S3 and create the required data catalog entries.

Here is the sample schema definition in Mockaroo.

From here I created two S3 buckets. One would hold data for my 'Blue' deployment and one for the 'Green' deployment. I called the buckets myapp.sql.blue and myapp.sql.green.

In Glue I created a database called MyApp just to provide a logical separation between this and any other databases I may have in the same account.

I downloaded two sets of data from Mockaroo and uploaded a file to each of the S3 buckets.

I then created a Glue Crawler for each bucket. Here is an example.

Running both crawlers populates the Glue data catalog with two tables.

At this point I'm able to run SQL in Athena against each of the tables.

Athena Blue / Green

For the Blue / Green component I utilise a View created in Athena. Just like in RDS views can be created with one more more tables in Athena using an SQL query. We don't need anything too complex here.

create view myapp_sql as select * from myapp_sql_blue;

In the data catalog the view appears alongside the tables.

This gives me a consistent name for my application to point to. When I want to switch over the data being used I can simply recreate the view pointing to either the blue or green buckets data.

create or replace view myapp_sql as select * from myapp_sql_green;

Testing

Lets create a lambda function to test this out.

import json
import pyathena


def lambda_handler(event, context):

    connection = pyathena.connect(
      s3_staging_dir="s3://athena.myapp.work/",
      region_name="us-east-1"
    )

    cursor = connection.cursor()

    query = "SELECT * FROM myapp.myapp_sql"

    cursor.execute(query)

    results = cursor.fetchall()

    print(results)

    return {
        'statusCode': 200
    }

Running this gives the output below.

A note on IAM. The Lambda function will require permissions for Athena and S3. For testing purposes I attached the AmazonAthenaFullAccess and AmazonS3FullAccess managed roles. In production you should scope the IAM down to least privileges required

Deployment Switchover

Let's now imagine that I have a pipeline that has loaded up my new data to my second bucket. In the pipeline I can run the following step to switch the view to the latest data.

export TABLE_SUFFIX=green
aws athena start-query-execution --query-string "create or replace view myapp_sql as select * from myapp_sql_$TABLE_SUFFIX" --result-configuration "OutputLocation=s3://athena.myapp.work" --query-execution-context "Database=myapp"

Considerations

This setup should work well for simple SQL based access to data where volumes are not too high. You can optimise queries further within Athena by using data formats such as Parquet.

Costs for the storage assuming S3 standard tier will be ~$0.023 per GB/Month. Querying via Athena costs $5.00 per TB of data scanned. We only pay when we run a query unlike RDS which we have to pay for even when we are not running SQL.

As long as the access characteristics of your application are a match for the performance of the AWS Services used then S3 based SQL access via Athena is a tough one to beat for those looking to be The Frugal Architect

DEV Community