DEV Community: Matt Houghton

Battle of the LLMs – How the Rise of DeepSeek Changes the Competitive Landscape

Matt Houghton — Mon, 11 Aug 2025 16:14:36 +0000

CDL Head of Architecture, Matt Eisengruber, and Data & AI Architect, Matt Houghton, assess the implications of DeepSeek as it emerges from China as a major player in the LLM space.

Part 1: Industry Coverage and Benchmarks

Introduction

In the rapidly evolving world of artificial intelligence, DeepSeek has emerged as a formidable contender in the landscape of Large Language Models (LLMs). Developed with a focus on high-performance reasoning and multilingual capabilities, DeepSeek gained traction for its open-source transparency and competitive benchmark results. As organisations increasingly rely on LLMs for automation, analytics, and customer engagement, DeepSeek’s rise signals a shift toward more accessible and customisable AI solutions.

“Open-source LLMs like DeepSeek are democratising access to cutting-edge AI, enabling innovation beyond the walls of Big Tech.” — Dr. Andrew Ng, AI Pioneer

Industry Coverage

The LLM space is currently dominated by models from OpenAI (GPT-4), Anthropic (Claude), Google DeepMind (Gemini), and Meta (LLaMA). However, the emergence of open-weight and open-source alternatives like Mistral and DeepSeek is reshaping the competitive landscape.

DeepSeek, developed by a Chinese AI research group, has positioned itself as a high-performing, multilingual model with strong reasoning capabilities. It supports both instruction-following and code generation tasks, making it versatile for enterprise use.

Key differentiators:

Excellent price-for-performance ratio
Open-source availability (Apache 2.0 licence)
Multilingual support, including Chinese and English
Strong performance on reasoning benchmarks like MATH and GSM8K

Benchmarks

DeepSeek has demonstrated impressive results across several industry-standard benchmarks:

Source: DeepSeek GitHub, llm-stats.com

Expert Opinions

At the 2025 AI Frontiers Conference, DeepSeek was highlighted as a “breakthrough in open source reasoning”, with several researchers praising its balance of performance and accessibility.

“DeepSeek’s performance on multilingual and mathematical reasoning tasks is a game-changer for global enterprises.” — Dr. Fei-Fei Li, Stanford University

Part 2: CDL’s approach to testing LLMs

When testing Generative AI (GenAI) systems, "non-deterministic" refers to the inherent challenge where the same input can produce different outputs on repeated runs, making it difficult to reliably test and verify the system's behaviour due to its unpredictable nature.

At CDL, we have introduced new testing approaches and tools to address these non-deterministic challenges in GenAI:

Run multiple tests with the same input and analyse the distribution of outputs to understand the range of possible results.
Develop specific metrics to measure different aspects of the output like coherence, accuracy, relevance, and bias to assess quality even with variations in responses.
Use a wide variety of input prompts to test the model's ability to handle different contexts and situations.
Continuously monitor the performance of the model in real-world scenarios and use feedback to refine the training data and improve its accuracy.
Move to intent-based testing where we focus on evaluating whether the output aligns with the intended meaning or purpose of the prompt rather than just checking for exact matches.

Testing Methodology

Our process for testing a model is first to define a set of questions. These are shown in the diagram as the prompts. This is a JSON lines file containing the question, the expected answer, also known as the ground truth data and a category.

The tests, known as evaluations, are run in a couple of modes:

First, we run an automated LLM-as-a-judge evaluation. This is where we ask our LLM under test to answer the question and then we pass that along with the ground truth data to a second LLM and ask for it to check the first LLM.
The second mode is a human evaluation. We take the same prompts and ground truth data, but this time we ask a team of people to evaluate the model.

Evaluation output is shown in the Bedrock console and all output and results are stored in S3 for further analysis. We have also enabled these tests as part of our CI / CD Pipelines, you can read Matt Houghton’s blog on how this was completed.

We utilise the Amazon Bedrock evaluations feature to assess performance and effectiveness of the model.

Amazon Bedrock computes our required performance metrics, such as the semantic robustness of a model and the correctness of a knowledge base in retrieving information and generating responses.

For model evaluations, we use both automatic evaluations and a team of human workers to rate and provide their input for the evaluation. This approach provides us with flexibility, such as utilising both company employees and industry subject-matter experts - in this case, the insurance industry. We can also include and assess retrieval-augmented generation (RAG) workloads to validate knowledge bases, provide highly relevant information and generate useful, appropriate responses.

Test Results

As you can see from the table above, DeepSeek holds its own and lives up to its hype against other models as a chat bot when fielding insurance specific queries in a RAG based architecture.

Part 3: Closing Thoughts on the Results and Possible Implications on the Insurance Industry

Summary of Findings

DeepSeek R1 is a top-tier open-source LLM that could be used in the insurance industry as a chatbot.
It performs competitively with proprietary models in most benchmarks.
Its low hallucination rate and high accuracy make it suitable for enterprise applications.

Implications for the Insurance Industry

DeepSeek’s capabilities open new possibilities for insurers:
Risk Assessment: Automating underwriting with accurate, explainable reasoning.
Fraud Detection: Analysing patterns in claims with multilingual support.
Customer Service: Deploying chatbots that understand complex queries.

“LLMs like DeepSeek can transform how insurers interact with customers and assess risk, especially in multilingual markets.” — Insurance AI Journal, May 2025

Future Outlook

As DeepSeek continues to evolve, we anticipate:
Larger context windows for document-heavy industries
Integration with retrieval-augmented generation (RAG) for real-time data access
Domain-specific fine-tuning for insurance, legal, and healthcare sectors

Conclusion

DeepSeek is not just another LLM—it’s a signal of the growing power and potential of open-source AI. For the insurance industry, it represents a cost-effective, high-performance alternative to proprietary models. As we continue to explore its applications, stay tuned for future posts where we dive into other models as they emerge.

Frugal SQL data access with Athena and Blue / Green support

Matt Houghton — Tue, 12 Mar 2024 21:21:03 +0000

Introduction

In this post I look at a frugal architecture for SQL based data access.

The prompt for writing this blog post came from a recent discussion on an application a team were looking to migrate to the cloud.

The requirements for the migration were the ability to run SQL against the data which was very small in volume (<200Mb).

During the discussion I turned to Athena which is one of my favourite AWS Services. Athena offers JDBC drivers so I suggested we could swap from the MySQL database which was going to be provided by RDS.

I was also asked how I would handle a blue / green style deployment with Athena. The specific requirement was that each time the application was deployed the database would be replaced with a new version including all data.

SQL Setup

With Athena there is no visible database resource to create like there is with RDS. The steps to allow SQL access to data are as follows.

A bucket to store the data in.
A Glue database / table created that defines the structure of the data held in S3

A quick way to test this out is to use a tool like Mockaroo to generate some test data and then have a Glue Crawler analyse the data in S3 and create the required data catalog entries.

Here is the sample schema definition in Mockaroo.

From here I created two S3 buckets. One would hold data for my 'Blue' deployment and one for the 'Green' deployment. I called the buckets myapp.sql.blue and myapp.sql.green.

In Glue I created a database called MyApp just to provide a logical separation between this and any other databases I may have in the same account.

I downloaded two sets of data from Mockaroo and uploaded a file to each of the S3 buckets.

I then created a Glue Crawler for each bucket. Here is an example.

Running both crawlers populates the Glue data catalog with two tables.

At this point I'm able to run SQL in Athena against each of the tables.

Athena Blue / Green

For the Blue / Green component I utilise a View created in Athena. Just like in RDS views can be created with one more more tables in Athena using an SQL query. We don't need anything too complex here.

create view myapp_sql as select * from myapp_sql_blue;

In the data catalog the view appears alongside the tables.

This gives me a consistent name for my application to point to. When I want to switch over the data being used I can simply recreate the view pointing to either the blue or green buckets data.

create or replace view myapp_sql as select * from myapp_sql_green;

Testing

Lets create a lambda function to test this out.

import json
import pyathena


def lambda_handler(event, context):

    connection = pyathena.connect(
      s3_staging_dir="s3://athena.myapp.work/",
      region_name="us-east-1"
    )

    cursor = connection.cursor()

    query = "SELECT * FROM myapp.myapp_sql"

    cursor.execute(query)

    results = cursor.fetchall()

    print(results)

    return {
        'statusCode': 200
    }

Running this gives the output below.

A note on IAM. The Lambda function will require permissions for Athena and S3. For testing purposes I attached the AmazonAthenaFullAccess and AmazonS3FullAccess managed roles. In production you should scope the IAM down to least privileges required

Deployment Switchover

Let's now imagine that I have a pipeline that has loaded up my new data to my second bucket. In the pipeline I can run the following step to switch the view to the latest data.

export TABLE_SUFFIX=green
aws athena start-query-execution --query-string "create or replace view myapp_sql as select * from myapp_sql_$TABLE_SUFFIX" --result-configuration "OutputLocation=s3://athena.myapp.work" --query-execution-context "Database=myapp"

Considerations

This setup should work well for simple SQL based access to data where volumes are not too high. You can optimise queries further within Athena by using data formats such as Parquet.

Costs for the storage assuming S3 standard tier will be ~$0.023 per GB/Month. Querying via Athena costs $5.00 per TB of data scanned. We only pay when we run a query unlike RDS which we have to pay for even when we are not running SQL.

As long as the access characteristics of your application are a match for the performance of the AWS Services used then S3 based SQL access via Athena is a tough one to beat for those looking to be The Frugal Architect

Billing for SaaS with EMF and CloudWatch Metric Streams

Matt Houghton — Fri, 08 Mar 2024 16:24:38 +0000

In this post I'm looking at how Software as a Service (SaaS) providers running on AWS can use a few AWS Services to build out a mechanism for collecting billing/metering metrics from their software and process them in order to bill a customer based on usage.

The main services I will cover are use of AWS CloudWatch embedded metric format (EMF) together with AWS CloudWatch Metric Streams.

What is EMF?

The CloudWatch embedded metric format allows you to generate custom metrics asynchronously in the form of logs written to CloudWatch Logs. You can embed custom metrics alongside detailed log event data, and CloudWatch automatically extracts the custom metrics so that you can visualize and alarm on them.

What is CloudWatch Metric Streams?

You can use metric streams to continually stream CloudWatch metrics to a destination of your choice, with near-real-time delivery and low latency. Supported destinations include AWS destinations such as Amazon Simple Storage Service and several third-party service provider destinations.

Using EMF in your Application

Imagine a sample Python application returning "hello world" to simulate a successful call. Each call to the application is captured for billing purposes using EMF. Lambda Powertools is used to reduce the amount of code we need to write.

metrics.add_metric(name="SuccessfulGet", unit=MetricUnit.Count, value=1)
metrics.add_dimension(name="Customer", value="MattHoughton")

These two lines output the required billing metrics.

The SuccessfulGet can be customised for your application. This value should indicate a sensible identifier for the chargeable action. For example in the world of insurance you may have actions such as CreatePolicy, CreateQuote, UpdateCar etc.

On the Lambda function configuration the following environment variables also need to be set.

POWERTOOLS_SERVICE_NAME: SuggestTheNameOfYourSoftware
POWERTOOLS_METRICS_NAMESPACE: SuggestSomethingLikeBilling

Here is the sample Lambda function.

import json
from aws_lambda_powertools import Metrics
from aws_lambda_powertools.metrics import MetricUnit

metrics = Metrics()

@metrics.log_metrics
def lambda_handler(event, context):

    #do something of value

    metrics.add_metric(name="CreatePolicy", unit=MetricUnit.Count, value=1)
    metrics.add_dimension(name="Customer", value="MattHoughton") #just an example dont hard code this for real source it from the payload or something

    return {
        "statusCode": 200,
        "body": json.dumps({
            "message": "hello world"
        }),
    }

Testing the function in the console you should get this response.

{
  "statusCode": 200,
  "body": "{\"message\": \"hello world\"}"
}

START RequestId: xxxx Version: $LATEST
{
  "_aws": {
    "Timestamp": 1709911806737,
    "CloudWatchMetrics": [
      {
        "Namespace": "DemoBilling",
        "Dimensions": [
          [
            "Customer",
            "service"
          ]
        ],
        "Metrics": [
          {
            "Name": "CreatePolicy",
            "Unit": "Count"
          }
        ]
      }
    ]
  },
  "Customer": "MattHoughton",
  "service": "DemoProductName",
  "CreatePolicy": [
    1
  ]
}
END RequestId: xxxx
REPORT RequestId: xxxx  Duration: 1.42 ms   Billed Duration: 2 ms   Memory Size: 128 MB Max Memory Used: 37 MB  Init Duration: 177.72 ms

When the Lambda is executed the billing metrics get stored in CloudWatch Logs and are visible in CloudWatch Metrics.

Costs for EMF are based on CloudWatch log ingestion which in EU-WEST-1 is $0.57 per GB. When I was testing with an example 624 byte payload that is generated by Powertools the costs came out as:

Each metric above stored costs $0.0000003
One million metrics stored costs: $0.33
Ten million metrics stored costs: $3.31

Collecting and Processing the Billing Metrics

To pull out all of the EMF metrics relating to billing we will setup a Metric Stream to send them to an S3 bucket.

Under Cloudwatch in the console select Metric Steams and Create a metric stream. We will walk through the Quick setup for S3.

Under metrics to be streamed limit this to only the metrics related to billing.

Looking at the metric stream that is created you will see details for the other components created for you.

Amazon Data Firehose
IAM Roles
S3 Bucket

If you run the Lambda function a few more times then view the Data Firehose you will see the metrics being delivered.

Now if you look in the S3 bucket you will find object created. By default they are partitioned by Year/Month/Day/Hour.

Here is sample content published to S3.

{"metric_stream_name":"DemoBillingMetricStream","account_id":"xxxx","region":"us-east-1","namespace":"DemoBilling","metric_name":"CreatePolicy","dimensions":{"Customer":"MattHoughton","service":"DemoProductName"},"timestamp":1709913900000,"value":{"max":1.0,"min":1.0,"sum":9.0,"count":9.0},"unit":"Count"}
{"metric_stream_name":"DemoBillingMetricStream","account_id":"xxxx","region":"us-east-1","namespace":"DemoBilling","metric_name":"CreatePolicy","dimensions":{"Customer":"MattHoughton","service":"DemoProductName"},"timestamp":1709913960000,"value":{"max":1.0,"min":1.0,"sum":30.0,"count":30.0},"unit":"Count"}

Further Processing

From this point we have a lot of flexibility in how we can choose to process this data.

We can trigger a Lambda function that sends these metric payloads to an accounting / invoicing system.

We can also continue to use AWS Services. As the data is in S3 we can easily add this to a Glue data catalog and query it using Athena. We could even start to build dashboards and reports using QuickSight.

Using Snowflake data hosted in GCP with AWS Glue

Matt Houghton — Mon, 29 Jan 2024 18:37:51 +0000

This post covers a use case of accessing data held in a Snowflake database hosted in GCP within an AWS Glue ETL job.

What is Snowflake?

Snowflake is a cloud-based data warehousing platform that provides a fully managed and scalable solution for storing and analyzing large volumes of data. It is not a traditional relational database but rather a data warehouse as a service. Snowflake is designed to work with cloud platforms like Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP).

What is AWS Glue?

AWS Glue is a fully managed extract, transform, and load (ETL) service provided by Amazon Web Services (AWS). It is designed to make it easy for users to prepare and load their data for analysis. AWS Glue simplifies the process of building and managing ETL workflows by providing a serverless environment for running ETL jobs.

Snowflake Prerequisites

As I wasn't a user of Snowflake already I signed up for a free trial in order to work on this use case. The process was simple and I didn't need to provide any form of payment to try it out. Well done Snowflake!

Test Data

I generated some test data to load into Snowflake using Mockaroo

Snowflake Setup

Within Snowflake I created a database, schema and a warehouse.

With these things in place I loaded the data from my Mockaroo generated JSON file into a new table.

The data is now able to be queried.

Next I created the required user and role for Glue to use to connect to Snowflake.

Create a role called car_sales.

Execute the SQL below to assign privileges to the new role.

grant select on table matt.matt.car_sales to role car_sales;

grant usage on database matt to role car_sales;

grant usage on schema matt to role car_sales;

grant usage on warehouse matt to role car_sales;

Create a user for Glue to connect to - ensure you set the default warehouse and assign the car_sales role.

AWS Glue Setup

Create an IAM role for the Glue job to execute. Note Glue will need to be able to access Secrets Manager.

In Secrets Manager create a secret.

Select Other type of secret. For the keys use sfUser, sfPassword and sfWarehouse.

Now in Glue create a Data connection to Snowflake

You can find your Snowflake URL in Snowflake by
selecting Admin > Accounts > . . . Manage URLs

For the AWS Secret, select the one you created earlier.

Give your connection a sensible name.

Extract Snowflake Data with Glue ETL

Create a new Visual ETL job in Glue.

From Sources select Snowflake

On data source properties complete the details as shown.

Test the connections working by selecting the Snowflake connection in the Visual. This will open the Data preview window. Select the role created earlier. This will start a data preview and display a sample of the data from Snowflake. This will take a few minutes to run.

Within activity in Snowflake you will see the query has been executed by the GLUE user.

You can continue to complete your ETL job within Glue. For example if I only want cars made by Porsche I could add an SQL transformation step.

Selecting the Script tab provides the Glue ETL code for the job.

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue import DynamicFrame


def sparkSqlQuery(glueContext, query, mapping, transformation_ctx) -> DynamicFrame:
    for alias, frame in mapping.items():
        frame.toDF().createOrReplaceTempView(alias)
    result = spark.sql(query)
    return DynamicFrame.fromDF(result, glueContext, transformation_ctx)


args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

# Script generated for node Snowflake
Snowflake_node1706530098901 = glueContext.create_dynamic_frame.from_options(
    connection_type="snowflake",
    connection_options={
        "autopushdown": "on",
        "dbtable": "car_sales",
        "connectionName": "snowflake_glue_connection",
        "sfDatabase": "matt",
        "sfSchema": "matt",
    },
    transformation_ctx="Snowflake_node1706530098901",
)

# Script generated for node SQL Query
SqlQuery0 = """
select * from myDataSource
where car_make = 'Porsche'
"""
SQLQuery_node1706531170324 = sparkSqlQuery(
    glueContext,
    query=SqlQuery0,
    mapping={"myDataSource": Snowflake_node1706530098901},
    transformation_ctx="SQLQuery_node1706531170324",
)

job.commit()

Conclusion

The cloud provider used for hosting Snowflake makes little difference to the easy integration of Snowflake as a data source in Glue ETL jobs.

The low code / no code interface for building ETL jobs with Glue makes it simple to gather data from many sources once initial IAM, secrets and connections are in place.

Scaling ML Education With AWS DeepRacer

Matt Houghton — Wed, 30 Aug 2023 13:41:06 +0000

In this blog post I will outline how we used DeepRacer to scale Machine Learning (ML) education across our organisation.

The blog will cover:

How to acquire the cars, track and build the associated event space.
The various roles on our Pit Crew and how each contributes to the success of the event.
How AWS can support you in running your event along with links to training resources.
The timeline for events leading up to our championship race and how to keep your team engaged.
The costs for the event and the results we saw for the investment.

Introduction

AWS DeepRacer is a service offered by Amazon Web Services (AWS) that combines machine learning, cloud computing, and robotics to provide a platform for learning and experimenting with reinforcement learning.

Reinforcement learning is a type of machine learning where agents learn to make decisions by interacting with an environment and receiving feedback in the form of rewards.

In the context of DeepRacer, the environment is a physical or virtual racetrack, and the agent is an autonomous racing car. Participants can train and develop the car's racing skills using reinforcement learning techniques, allowing the car to learn how to navigate the track and optimize its racing performance over time.

Overall, AWS DeepRacer is an educational and hands-on way for people to dive into reinforcement learning and gain practical experience in training AI models to perform specific tasks, such as autonomous racing.

Between April 2023 and July 2023, I put together a team to organise and run a DeepRacer event at CDL with the aim to scale Machine Learning education across our organisation.

Equipment Required

The Track

We used the 2018 DeepRacer track. This is the smallest track available, and it provided us with some flexibility on location. The track is now referred to as the A-Z Speedway.

The spec for the track is:

Track size: 26’ x 17’ or 7.9248m x 5.1816m

The track should have the following colours:

Field ("the green") = PMS 3395C
Road surface and AWS logo = PMS 432C
Dotted center line ("yellow") = PMS 137C
Track boundaries ("white side lines") = CMYK 0-0-2-0

We were advised that carpet-based tracks have fewer issues with light reflection than their vinyl alternatives.

The spec in Illustrator Format for the track is available to download from the AWS Documentation

We ordered the track in July 2022 and were given a figure of £1314.02 for us to get this printed to carpet using bannerworld.co.uk. It came to us in three sections.

The Barrier

After looking at this A to Z Speedway Printed Wall Barrier for AWS DeepRacer Race Track I felt the cost and issues around delivery to the UK were going to be problematic so I decided to try a build my own.

Spoiler alert, although the barrier worked it is probably the main thing I would change for future events.

The materials used were:

Green Plastic ‘Corex’ Board. 4mm thick - A1 in Size. I ordered 30 sheets from Vesey Gallery via their EBay shop.

To join the corex board together I used plastic Shower/Bath screen joiner/seals. They came from Shower Seal UK Ltd. I ordered them from their eBay store.

8 x 2M lengths of Straight Seals
2 x 2M lengths of Corner Seals

I cut these to the length required to fit the A1 corex plastic sheets with a small handsaw.

To provide additional support to the barrier I added some desk draw furniture at regular intervals. The barrier has a couple of issues:

It's too flexible in the gaps without a desk draw. This meant that the cars could crash into it and cause the corex board to detach from the jointing strips.
Where there was no gap, the car would occasionally get damaged when it hit the desk draw at high speed. We had to fix the car shells during our event with black gaffa tape. We also had to secure the camera with elastic bands to prevent them from coming loose.

The company logo and DeepRacer event branding was added by ordering large stickers from Sticker Mule 5 copies of our logo printed using the Wall graphics option and sized at 508 mm x 173 mm cost a total of $63.

The Cars

We ordered 2 x DeepRacer EVO’s. I already owned one so we had three cars in total to work with.

Each car comes in two boxes. The original DeepRacer car and then the EVO upgrade kit.

We used Shipito to get the cars shipped to the UK.

For racing events we do not use the EVO Kit. This matches what AWS do at their events.

When racing the cars, we learnt a couple of tips about battery placement. It's much easier to secure the motor battery using the Velcro strip to the top of the compute battery. This allows much faster swapping out of batteries during races.

The cars need to be calibrated following the instructions within the DeepRacer Car console. In practice we found that calibration mode would sometimes not work. When this happens, we learned that the following actions are helpful:

When working with the cars during calibration place them on something so the wheels don't touch the surface. The masking tape included in the car box works well for this.
Cycle the power on the car.
Put the car into manual mode and spin the wheels forward and then backwards multiple times.
Check the compute battery, if it has two bars or less replace it.

Car Parts

You will need an iPad or similar to control the car speed trackside. We had two iPad minis to ensure we always had one charged up.

We ordered spare batteries for the cars:

URGENEX 2S Lipo Battery 7.4v Lipo with JST Plug

These batteries come with USB charging cables. We ordered this charger

Compute Battery

For storage of batteries we ordered a number of these protective bags.

As the cars ship with US plugs we ordered 4 of these Combo Travel Adaptors By SKROSS to allow us to keep up with our battery charging requirements.

A battery tester is also essential to check quickly that the battery is in a good state.

A few of tips on battery health:

Ensure the compute battery is over two bars full. If it isn't swap it out for a fresh battery.
Rotate the motor batteries regularly - we wouldn't race a car unless it showed green on the DeepRacer Car console.
Have a system within the Pit Crew for rotating batteries. We used a large bowl to put used batteries in and someone would pick these up and get them on charge for later use.

Timer

So that lap times can be accurately captured we added a pressure sensor-based timer to the track. By the final race we ended up with two versions of this as a prefabricated module became available.

Timer Option 1

For the first timer I mostly followed the instructions on David Smith's GitHub Repo

For the parts I ordered the following items

After removing the microphone sensor, I soldered a terminal block so the pressure sensor cables could screwed into place.

I soldered two meters of cable to each pressure sensor.

The final connection to the RaspberryPi is shown below. I just used off the shelf jumper cables.

To make the pressure sensor look like a finish line of a race track I bought some thin card, some double sided tape and some chequered flag tape.

The pressure sensors were attached to the card with double sided tape and the chequered flag tape added to the card to create the finish line effect. The wires were run under the carpet track and stuck to the track using green gaffer tape.

Here is a video of the timer being tested.

Timer Option 2

Shortly before our final race the Digital Racing Kings Unofficial DeepRacer Timer became available. We ordered one of these to replace the adapted microphone sensors. We found this much better and required much less calibration.

Schedule of Events

We built up to the final race on the physical track. This took the form of a number of virtual and in person events.

Engagement Sessions

Our AWS account team arranged for David Smith who is a Solutions Architect within AWS ML Thought Leadership to deliver our initial training. David is the person you will often see at the AWS Summits leading the Pit Crew.

David delivered a number of sessions with the CDL team.

The first session covered:

Intro to DeepRacer, Machine Learning and the reward functions. (26 Mins)
First Model and Training (36 Mins)
Q&A (21 Mins)

After the Q&A we had a break and returned a few hours later for further Q&A once everyone had got the chance to build a model.

This session attracted 105 attendees. After the first session David shared a number of follow-on resources to help our team continue their DeepRacer learning.

DeepRacer Video to learn more about Deep Racer.
Getting started With DeepRacer
Train Your First Model
Input parameters available
Racing tips
Deepracer community
Waypoints
DeepRacer for cloud
DeepRacer workshop
PRO TIP: Suggest setting the hyperparameter of Discount Factor to 0.95
Workshop console dive
Workshop pre-recorded session

About a month after the initial engagement session we had a couple of check-in sessions with AWS. The first was a 'pro' session for racers who wanted to do a deep dive on particular topics. The second session was for people who were still getting started with DeepRacer and wanted to ask more basic questions.

About a month before the final race, we had our track in place and a timer working so we started to offer weekly check-in sessions trackside. These sessions allowed anyone to bring their model along and a member of our Pit Crew would help the team race. We also offered a remote service where people could send a message with their model, and we would record the car going round the track.

This time on track proved one of the most popular sessions and we were often oversubscribed. Having the chance to tinker with the track, cars, play about, crash the cars etc also helped train our Pit Crew for the final race.

Other Engagement Tips

In between the engagement sessions we tried a number of things to keep people interested and build the momentum towards the final race.

Blog Posts: One of the Pit Crew wrote about their experience of training models and offered hints and tips.
Race Chat: We used MS Teams to host a chat with the racers. This helped build up a healthy competition between the teams, offer support, answer questions and debug issues.
Take our models to the in-person London Meetup and test them out.
Take our models to the AWS London Summit to try them out.
We also used the chat to post links to the live feeds of the various AWS Deep Racer events that were happening at the summits across the world.
Virtual League. Within the DeepRacer section of the AWS console we created a CDL league. This allowed our racers to submit their models to the virtual league over the three months build up to the final race.

The Final

The Pit Crew

The Pit Crew are the hardest working people during the DeepRacer event. Do not underestimate how vital they are to keep the event running smoothly. Our event had 26 teams / 72 people taking part.

Track Boss: Monitoring the car as it races on the track and putting it back on the track if it crashes. This is hard work. Rotate the person after each race.
Mechanics: Charging batteries and swapping them out within the cars and doing the calibration. Fixing the cars if any parts fail.
Lap Timer: Working with the track boss to capture complete laps and disqualify any invalid laps.
Team Liaison: Finding the next team to race, explaining how to use the DeepRacer Car console via the iPad and getting them ready to race.
Commentary: Helping the event flow for those watching in person or via live stream by making announcements, doing interviews or general comments on the action.
Merchandise: Dealing with the free SWAG
Audio Visual: Connecting AV equipment, sound checking and checking on the live stream.
Official Photographer: Capturing the joy / pain of the winners / losers.

Racing Structure

Our Pit Crew also established the running order for the event. Each race is three minutes. Within the three minutes each team can complete as many laps as they are able to. The fastest of their laps is what establishes their position on the leader board.

As we knew the number of teams taking part, we did some estimates of how long it would take to put the car on the track, get each team in position and complete three minutes of racing. We validated with AWS that 5-7 minutes per team was about right. This did work out correctly for us but the Pit Crew didn't get any time to relax. We're told this is normal for the AWS run event too and there is an unwritten rule that everyone on the Pit Crew should be active at all times.

We held two racing sessions with an hour break for lunch. Starting at 10AM we got all teams to complete two three-minute races by around 3:30PM.

There was some debate amongst teams on if they should try two different models. Whilst we didn't come up with any real conclusions on this point, we did notice the following real-world effects.

Environmental factors can impact model performance. We did some prior testing with lighting for example and went with what we found to be most reliable.
Healthy competition amongst teams led to a feeling that certain cars were more reliable than others.
We tried to support teams by advising them to start their models at a slower speed and experiment with the speed adjustment via the iPad over the three minutes.
Once a speed benchmark is established further adjustments can be made around corners and straight sections of the track.

Event Space

In addition to the track, we added screens so people could keep an eye on our leader board and watch the racing. We had two cameras on the track and connected these back to a Blackmagic ATEM Mini Extreme. This allowed us to record all camera and graphics feeds for editing after the event and also live stream via MS Teams to the whole company.

In preparation for our event I attended a couple of DeepRacer events at AWS Summits and also through the UK DeepRacer Meetup. The latter was hosted by JP Morgan who have been doing events for a long time. Having in person conversations with the community members, JP Morgan and AWS was really helpful to me.

DeepRacer Event Manager (DREM)

DREM is an AWS built application that is the missing piece for any DeepRacer event. If you have ever raced at an AWS Summit this is the system in use at those events.

At the time of writing, it is working its way towards a public beta but we were fortunate enough to get our hands on it via our AWS account team. DREM made the event run much smoother and has some awesome features just not available via the public AWS or Open-Source tools that we found.

Allows racers to upload their models via login to their DREM account.
Car management, which includes loading models to the cars. Effectively you have a fleet of cars registered in DREM and can send the model to any car. Doing this manually via separate logins to each car console now feels painful after using DREM.
Integrates with the lap timer device recording all key lap data.
Uses lap timer data to create a live scoreboard for the event.
Provides on-screen graphics showing the current racer and their lap time stats such as current and best.
Provides a console for the Track Boss / Pit Crew to invalidate laps e.g. when the car leaves the track and crosses the pressure sensor incorrectly.

I can't wait for the beta version to be available and recommend that anyone doing a DeepRacer event speaks to AWS about availability of DREM.

Prizes

We offered a prize to the team for the fastest lap and another prize for a 'highly commended' category. The latter was intended to be a way for us to incentivise people to take part in the final race even if they had not done that well in the virtual league.

In practice we found the virtual race to be very different to the physical race. There are far more variables in real life and the virtual race is more of a dust-free lab.

Swag

As an added incentive to take part in the racing we arranged a selection of DeepRacer swag. In the morning of the event, we only allowed our racers to pick up swag. In the afternoon we opened up the swag to everyone.

To capture feedback on the event we made swag pickup conditional on completion of a survey. See results later on.

The T-Shirts were kindly arranged and provided by our AWS Account team. They also designed the logo for the event.

The rest of the Swag came through contacts in the DeepRacer Community.

This slack workspace is where the DeepRacer community get together and talk about all things DeepRacer. The community was an invaluable resource for ideas and support during the run up to our event. A special thanks to Tomasz Ptak who put me in contact with Promo Veritas as they sent us DeepRacer Hoodies, Caps and Socks.

Costs

Overall, we had capital costs of £4,973.22

The two DeepRacer Cars cost £1,507.48
Various cables and dedicated 4G router cost £474.59
DeepRacer AWS charges not covered by our APN credits £623.66
The barrier £799.47
Track and lap timer £1,415.79
Prizes £152.23

We estimated that our AWS bill for DeepRacer would come to $3719. This was covered by Innovation Sandbox Credits. The AWS innovation Sandbox Credits are designed to help you effectively integrate AWS services into your solution or launch a product to general availability on AWS. The AWS Innovation Sandbox credits help offset AWS usage costs incurred during the development. This benefit is available to AWS partners that build or offer services and solutions.

Results

We had 71 people respond to our survey. 100% of the number signed up to race. After the event we also held a retrospective to help us form suggestions for the future of DeepRacer at CDL.

94% said they would take part in another CDL track event with 90% wanting to take part again in the virtual aspect of DeepRacer.

All respondents thought that DeepRacer encouraged or improved collaboration within our teams. 96% of people thought the ~£5k expenditure was good value for money.

87.4% of people reported learning more about Machine Learning by taking part.

We also captured free text feedback:

"Great event, would like to see this on an annual basis."

"Found training on the physical track invaluable. There are quite a few differences in behaviour between the virtual and physical track."

"To properly explore the different factors that go into a successful model takes time. Would happily have continued exploring different models and different approaches to the models with more hours of training."

"Small team, so was easy to engage with each other and bounce ideas off each other."

"All in all, a very enjoyable way to get into AI and machine learning."

"Great use of space. Fantastic collaboration opportunity. Great team building. Fun."

"Cracking thing to do. Encourages a very social twist on a tech thing, it improves diversification with gamified learning which a lot of people benefit from."

What's Next?

We are currently looking at running future DeepRacer events. We are particularly interested in running an event with one or more of our customers. We are also looking at contributing to the community initially through a DeepRacer event for a local college or university.

SSL For RDS With Glue Python Job and AWS SDK For Pandas

Matt Houghton — Sun, 06 Nov 2022 10:40:36 +0000

This blog post is the result of a recent interaction with AWS Support. As always they were very helpful in resolving the issue.

AWS SDK For Pandas

Recently AWS renamed the AWS data wrangler python library to AWS SDK for Pandas. This is an AWS Professional Service open source python initiative that extends the power of Pandas library to AWS connecting DataFrames and AWS data related services.

Built on top of other open-source projects like Pandas, Apache Arrow and Boto3, it offers abstracted functions to execute usual ETL tasks like load/unload data from Data Lakes, Data Warehouses and Databases.

I was looking to use the integration with AWS Glue to use a glue connection within some Python ETL code. The connection in my case was to an Amazon RDS PostgreSQL database.

For example:

import sys
import awswrangler as wr
import pandas as pd

con_postgresql = wr.postgresql.connect(connection="My-RDS-PostgreSQL-Connection")

con_postgresql.close()

The theory was that the connection could be defined in Glue once and used by multiple AWS Glue ETL

Amazon RDS Ready - Encryption Requirements

The purpose of the Amazon Relational Database Service (RDS) Ready Program is to recognise AWS Partner products that support the use of Amazon RDS database as a backend for business applications deployed within a customer’s AWS account or provided as SaaS deployed in APN Partner’s AWS Account.

This program requires that products follow AWS security, availability, reliability, performance and other architecture best practices while integrating with Amazon RDS.

At CDL our software has been accredited as Amazon RDS Ready and we apply these standards when developing new solutions. Specifically on Data encryption the Amazon RDS Ready states:

DBCONN-004 - Data Encryption:
For business applications where data encryption is a requirement for security compliance, the product must support encryption of data at rest and in transit for Amazon RDS.

At CDL we ensure that data to RDS is encrypted in transit by setting the rds.force_ssl parameter to 1. See Using SSL with a PostgreSQL DB instance - Amazon Relational Database Service

Attempting an SSL Connection From Glue To RDS

A connection in Glue is created to a RDS database that has rds.force_ssl set.

This is done via the legacy glue connection screen in the console as this allows us to test the connection.

As you can see running the test works.

The Problem

Next we try an use that connection in a AWS Glue Python Job utilising the AWS SDK For Pandas.

import sys
import awswrangler as wr
import pandas as pd

con_postgresql = wr.postgresql.connect(connection="My-RDS-PostgreSQL-Connection")

con_postgresql.close()

Running the job will return errors about SSL. I got a couple of different errors when trying to debug different versions of the code.

After a bit of back and forth with AWS Support trying to debug the issue the service team identified the following.

Currently, awswrangler loads and uses default SSL configuration for creating boto3 session clients.

It was clear from the errors we receive that this default did not include the Amazon RDS Root CA.

To overwrite a default configuration, it’s possible to use the connect() function in awswrangler that allows to pass an SSL context.

We need to download the RDS root certificate and point to it.

import sys
import awswrangler as wr
import pandas as pd
import ssl
import os
import urllib.request

def download_rds_root_ca(filename: str):
    print("Downloading RDS CA root cert…")
    urllib.request.urlretrieve('https://s3.amazonaws.com/rds-downloads/rds-ca-2019-root.pem', filename=filename)
    print("Downloaded RDS CA root cert.")

def create_rds_ssl_context():
    cafile = '/tmp/rds-ca-2019-root.pem'
    download_rds_root_ca(cafile)
    ssl_context = ssl.SSLContext(ssl.PROTOCOL_TLS);
    ssl_context.verify_mode = ssl.CERT_REQUIRED;
    ssl_context.load_verify_locations(cafile=cafile, capath=None, cadata=None)
    return ssl_context

print("Connecting to RDS database…")
rds_ssl_context = create_rds_ssl_context()
con_postgresql = wr.postgresql.connect(connection="My-RDS-PostgreSQL-Connection", ssl_context=rds_ssl_context)
print("Successfully connected to RDS database.")

Run With SSL

Running the job again with the correct SSL certificate in place we get a successful execution.

Using Athena Views As A Source In Glue

Matt Houghton — Wed, 16 Feb 2022 17:03:02 +0000

Whilst working with AWS Glue recently I noticed that I was unable to use a view created in Athena as a source for an ETL job in the same way that I could use a table that had been cataloged.

The error I received was this.

An error occurred while calling o73.getCatalogSource. No classification or connection in mydatabase.v_my_view

Rather than try and recreate the view using a new PySpark job I used the Athena JDBC drivers as a custom JAR in a glue job to be able to query the view I wanted to use.

This blog are my notes on how this works.

Drivers

Create or reuse an existing S3 bucket to store the Athena JDBC drivers JAR file. The JAR files are available to download from AWS. I used the latest version which at the time of writing was JDBC Driver with AWS SDK AthenaJDBC42_2.0.27.1000.jar (compatible with JDBC 4.2 and requires JDK 8.0 or later).

IAM

The Glue job will need not only Glue Service privileges but also IAM privileges to access the S3 Buckets and also the AWS Athena Service.

For Athena this would provide Glue will full permissions.

{
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogStream",
                "logs:AssociateKmsKey",
                "athena:*",
                "logs:CreateLogGroup",
                "logs:PutLogEvents"
            ],
            "Resource": [
                "arn:aws:athena:*:youraccount:workgroup/*",
                "arn:aws:athena:*:youracccont:datacatalog/*",
                "arn:aws:logs:*:*:/aws-glue/*"
            ]
        }

Create Glue ETL Job

My use case for the Glue job was to query the view I had and save the results into Parquet format to speed up future queries against the same data.

The following code allows you to query an Athena view as a source for a data frame.

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

athena_view_dataframe = (
    glueContext.read.format("jdbc")
    .option("driver", "com.simba.athena.jdbc.Driver")
    .option("AwsCredentialsProviderClass","com.simba.athena.amazonaws.auth.InstanceProfileCredentialsProvider")
    .option("url", "jdbc:awsathena://athena.eu-west-1.amazonaws.com:443")
    .option("dbtable", "AwsDataCatalog.yourathenadatabase.yourathenaview")
    .option("S3OutputLocation","s3://yours3bucket/temp")
    .load()
    )

athena_view_dataframe.printSchema()

The key things in this code snippet to be aware of are.

.option("driver", "com.simba.athena.jdbc.Driver")

We are telling Glue which class within the JDBC driver to use.

.option("AwsCredentialsProviderClass","com.simba.athena.amazonaws.auth.InstanceProfileCredentialsProvider")

This uses the IAM role assigned to the Glue job to authenticate to Athena. You can use other authentication method like AWS_ACCESS_KEY or federated authentication but using IAM I think makes most sense for an ETL job that will most likely run on a schedule or event.

.option("url", "jdbc:awsathena://athena.eu-west-1.amazonaws.com:443")

I am using Athena in Ireland (EU-WEST-1) if you are using a different region update this accordingly.

.option("dbtable", "AwsDataCatalog.yourathenadatabase.yourathenaview")

The fully qualified name of view in your Athena catalog. It's in the format of 'AwsDataCatalog.Database.View'. For example this query run in Athena.

SELECT * FROM "AwsDataCatalog"."vehicles"."v_electric_cars";

You would set the dbtable option to this

.option("dbtable", "AwsDataCatalog.vehicles.v_electric_cars")

The last option tells Glue which S3 location to use as temporary storage to store the data returned from Athena.

.option("S3OutputLocation","s3://yours3bucket/temp")

At this point you can test it works. When running the job you need to tell Glue about the location for the Athena JDBC drivers JAR file that was uploaded to S3.

If you are working in the AWS Glue Console the parameter to set can be found under Job Details --> Advanced --> Dependent JARs path.

The parameter needs to be set to the full path and filename of the JAR file. For example s3://yours3bucket/jdbc-drivers/AthenaJDBC42_2.0.27.1000.jar

By setting this in the console it ensures that the correct argument is passed into the Glue job.

--extra-jars s3://yours3bucket/jdbc-drivers/AthenaJDBC42_2.0.27.1000.jar

The final code including the conversion to Parquet format looked like this.

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

athena_view_dataframe = (
    glueContext.read.format("jdbc")
    .option("driver", "com.simba.athena.jdbc.Driver")
    .option("AwsCredentialsProviderClass","com.simba.athena.amazonaws.auth.InstanceProfileCredentialsProvider")
    .option("url", "jdbc:awsathena://athena.eu-west-1.amazonaws.com:443")
    .option("dbtable", "AwsDataCatalog.vehicles.v_electric_cars")
    .option("S3OutputLocation","s3://yours3bucket/temp")
    .load()
    )

athena_view_dataframe.printSchema()

athena_view_datasource = DynamicFrame.fromDF(athena_view_dataframe, glueContext, "athena_view_source")

pq_output = glueContext.write_dynamic_frame.from_options(
    frame=athena_view_datasource,
    connection_type="s3",
    format="glueparquet",
    connection_options={
        "path": "s3://yourotherS3Bucket/",
        "partitionKeys": [],
    },
    format_options={"compression": "snappy"},
    transformation_ctx="ParquetConversion",
)

job.commit()

What's New in ML? re:Invent ML Keynote

Matt Houghton — Tue, 08 Dec 2020 19:14:53 +0000

It's week two of re:Invent and that includes the first ever dedicated keynote for Machine Learning. Here are the features that I found interesting.

Aurora ML

This feature aims to bring ML to SQL without you needing to learn ML. Running your query Amazon Aurora provides the integration to AWS ML services.

Aurora exposes ML models as SQL functions, allowing you to use standard SQL to build applications that call ML models, pass data to them, and return predictions as query results. The models can include ones you trained in SageMaker, Comprehend or models offered by AWS partners.

https://aws.amazon.com/rds/aurora/machine-learning/

Lookout For Metrics

Amazon Lookout for Metrics uses machine learning to automatically detect and diagnose anomalies in business and operational time series data.

You can connect up data stores like S3, RDS, Redshift and SaaS applications and monitor metrics that are important to your business.

Amazon Lookout simplifies the process by automatically inspecting and preparing the data and building a custom ML model. It's powered by experience built up of doing this within Amazon.

Neptune ML

Amazon Neptune is a fully managed graph database service that is designed to work with highly connected datasets. The new ML feature looks to bring predictions on graphs.

For large graphs with billions of relationships, it’s hard to discover insights using queries based only on human intuition. For this reason, you can use ML on graphs to automatically reveal new insights and make predictions.

Using graph neural networks (GNNs), a machine learning (ML) technique purpose-built for graphs you can improve the accuracy of most predictions for graphs by over 50%.

Neptune ML uses the Deep Graph Library (DGL), an open-source library to which AWS contributes that makes it easy to develop and apply GNN models on graph data.

Read the AWS Database blog on the announcement here https://aws.amazon.com/blogs/database/announcing-amazon-neptune-ml-easy-fast-and-accurate-predictions-on-graphs/

There is also a getting started guide here https://aws.amazon.com/blogs/database/how-to-get-started-with-neptune-ml/

Redshift ML

Like Aurora ML this feature aims to bring ML to Redshift using SQL.

The CREATE MODEL SQL command is used in Redshift to specify your training data. Redshift ML will then compile and import the trained model inside the Redshift data warehouse and prepare a SQL function for use in SQL queries.

See the product page for more details and to get started https://aws.amazon.com/redshift/features/redshiftML/

HealthLake

HealthLake aims to take medical data and provide tools and machine learning to make it available for analytics in a way that is 'HIPAA-eligible' and that supports Fast Healthcare Interoperability Resources (FHIR) industry standard format.

Using NLP and Comprehend Medical processing the data is made available for search and query using QuickSight, SageMaker and third party applications.

Watch the two minute into video.

re:Invent Week 2: Data Sessions

Matt Houghton — Mon, 07 Dec 2020 20:56:05 +0000

Week 2 is starting soon so here are my picks of the data related sessions.

Machine Learning Keynote
Using Amazon QLDB as a system-of-trust database for core business apps
Get started with Amazon SageMaker in minutes
What’s new in Amazon RDS for SQL Server
Fast distributed training and near-linear scaling with PyTorch on AWS
Building a successful inventory planning solution with Amazon Forecast
Get deep insights about your ML models during training
Amazon Aurora Serverless v2: Instant scaling for demanding workloads
Paving the way toward automated driving with BMW Group
Migrating databases to Amazon DocumentDB (with MongoDB compatibility)
Train large models with billions of parameters in TensorFlow 2.0
How New Relic is migrating its Apache Kafka cluster to Amazon MSK
Deliver viewing experiences for super fans with Amazon Personalize
Running Apache Cassandra workloads with Amazon Keyspaces
Harness the power of data with AWS analytics
What’s new with Amazon Redshift
Power modern serverless applications with GraphQL and AWS AppSync
How Amazon Redshift powers large-scale analytics for Amazon.com
New use cases for Amazon Redshift
Beyond AWS DMS: Programs and partners to ace your migration
Amazon.com’s use of AI/ML to enhance the customer experience
Migrating a legacy data warehouse to Amazon Redshift
What’s new with Amazon EMR
Choose the right machine learning algorithm in Amazon SageMaker
Understanding AWS Lambda streaming events
Infrastructure Keynote (ok not strictly data services but it's always interesting to see the scale of what is powering them)
Serverless data preparation with AWS Glue
Deep dive on Amazon Aurora with MySQL compatibility
Under the hood: How Amazon uses AWS for analytics at petabyte scale
Building real-time applications using Apache Flink

Whats New In Data: re:invent Andy Jassy Keynote

Matt Houghton — Tue, 01 Dec 2020 21:41:05 +0000

It's a different experience this year. The chat with my teammates is a mixture of discussion about new features and pictures of good times in Vegas from previous re:Invent conferences.

Andy Jassy has finished the first keynote of 2020 and I was not disappointed. Lots of great new features that we have use cases for.

Here are my favourite data related features announced during the Andy Jassy re:Invent keynote.

Glue Elastic Views

Most data teams and customers I work with have data in multiple places. You might have a CRM system, an accounts system, document management etc. Bringing all this data together and keeping it up to date in a 'single customer view' for analytics workloads is something data engineers spend a lot of time thinking about.

I've used Materialised Views heavily in the past to convert transactional data models into views more suitable for reporting queries.

Glue Elastic views seems to be a great feature where you have data in multiple types of databases and want to apply Change Data Capture (CDC) and Materialised view type functionality.

I cant wait to get hands on with the preview. You can sign up today at https://aws.amazon.com/glue/features/elastic-views/

Quicksight Q

I was already a fan of QuickSight due to the pay per session pricing. It works really well when you consider the minimum user licensing for some other data visualisation tools.

I also like the features for embedding QuickSight dashboards into your applications.

With the newly announced feature of using natural language to ask questions of your data it makes it even easier for end users of your applications to benefit from analytics in a much more consistent and integrated way.

The Q feature is in preview and you can sign up at https://aws.amazon.com/quicksight/q/?nc=sn&loc=4

Check out the blog on QuickSight Q here https://aws.amazon.com/blogs/aws/amazon-quicksight-q-to-answer-ad-hoc-business-questions/

New gp3 EBS Volumes

You can now scale your storage volume performance independent of storage capacity. Oh and it's up to 20% cheaper than gp2.

https://aws.amazon.com/about-aws/whats-new/2020/12/introducing-new-amazon-ebs-general-purpose-volumes-gp3/

Aurora Serverless v2

v2 now claims to be able to scale instantly in a fraction of a second. The scaling is adjusted in fine-grained increments to provide just the right amount of database resources that the application needs.

The preview will be MySQL currently and will have Aurora features like Global Database, Multi-AZ deployment and read replicas.

Babelfish for PostgreSQL

I've seen quite a number of database workload migrations to the cloud. Often these will also include moving from a commercial database engine to an open source engine like PostgreSQL. There are tools like AWS DMS and Qlik Replicate that do a good job of handing the data migration and conversion of data types. What is often is more time consuming is migration of database code such as PL/SQL to the open source equivalent.

Babelfish looks to address the database code migration problem for MS SQL to PostgreSQL migrations.

Babelfish adds an endpoint to PostgreSQL that understands the SQL Server wire protocol Tabular Data Stream (TDS), as well as commonly used T-SQL commands used by SQL Server.

With Babelfish enabled, you don’t have to swap out database drivers or take on the significant effort of rewriting and verifying all of your applications’ database requests.

Check out the AWS Open Source blog on Babelfish here https://aws.amazon.com/blogs/opensource/want-more-postgresql-you-just-might-like-babelfish/

AWS are going to open source Babelfish in Q1 2021 until then you can sign up for the Amazon Aurora preview. You can also check out the Babelfish community here https://babelfish-for-postgresql.github.io/babelfish-for-postgresql/

SageMaker Data Wrangler

In some industries up to 92% of analytics project time is spent doing data wrangling (sourcing, ETL, cleaning etc) in order to get ready for the actual Machine Learning and Analytics workloads.

Amazon SageMaker Data Wrangler claims to reduce the time it takes to aggregate and prepare data for machine learning and simplify the process of data preparation and feature engineering, and complete each step of the data preparation workflow, including data selection, cleansing, exploration, and visualization from a single visual interface.

https://aws.amazon.com/sagemaker/data-wrangler/

SageMaker Feature Store

Like data wrangling feature engineering can be a time consuming process. Once completed it makes sense to be able to share the results with other people who might be developing machine learning workloads based on the same datasets.

Just as a data catalog enables an organisation to discover data assets the new Feature Store in Sagemaker provides a repository where you can store and access features so it’s much easier to name, organise, and reuse them across teams.

Check out the details here https://aws.amazon.com/sagemaker/feature-store/

SageMaker Pipelines

Bringing CI/CD to machine learning workloads SageMaker Pipelines has been launched to help you automate different steps of the ML workflow, including data loading, data transformation, training and tuning, and deployment.

Check out the details here https://aws.amazon.com/sagemaker/pipelines/

Whats Next?

It's been a great start to re:Invent I can't wait to see what else they have in store for us.

re:Invent: Data Sessions Week 1

Matt Houghton — Sun, 29 Nov 2020 23:14:05 +0000

Here is my list of sessions for week one of re:Invent focussed around data and analytics.

How to use fully managed Jupyter notebooks in Amazon SageMaker
What’s new with Amazon S3
How BMW Group uses AWS serverless analytics for a data-driven ecosystem
Innovate faster with applications on AWS storage
Embed analytics in your applications with Amazon QuickSight
Discovering insights from customer surveys at McDonald’s
What’s new in Amazon ElastiCache
How FINRA operates PB-scale analytics on data lakes with Amazon Athena
How Zynga modernized mobile analytics with Amazon Redshift RA3
Implementing MLOps practices with Amazon SageMaker
Break down data silos: Build a serverless data lake on Amazon S3
Amazon DocumentDB (with MongoDB compatibility) Deep Dive
Gameloft: A zero downtime data lake migration deep dive
Nationwide’s journey to a governed data lake on AW
BI at hyperscale: Quickly build and scale dashboards with Amazon QuickSight
Building for the future with AWS databases
How the NFL builds computer vision training datasets at scale
What’s new in Amazon RDS
Serverless analytics at Equinox Media: Handling growth during disruption
Data modeling with Amazon DynamoDB – Part 1
Dive deep into AWS Schema Conversion Tool and AWS DMS
How Vyaire uses AWS analytics to scale ventilator production
From POC to production: Strategies for achieving machine learning at scale
The right tool for the job: Enabling analytics at scale at Intuit
Secure and compliant machine learning for regulated industries
How Disney+ uses fast data ubiquity to improve the customer experience
Deep dive on Amazon Aurora with PostgreSQL compatibility
Train and tune ML models to the highest accuracy using Amazon SageMaker
How Nielsen built a multi-petabyte data platform using Amazon EMR
How Goldman Sachs uses an Amazon MSK backbone for its Transaction Banking Platform
Data modeling with Amazon DynamoDB – Part 2
How Disney+ scales globally on Amazon DynamoDB
Productionizing R workloads using Amazon SageMaker, featuring Siemens

For a full list of sessions and to register visit https://reinvent.awsevents.com

Unify data silos with AWS AppSync

Matt Houghton — Tue, 24 Nov 2020 13:14:52 +0000

Silos

Most organisations that process data will have experienced the concept of data in silos. This is where an application is built for a particular purpose and tied to a data store. While this may solve a particular business problem, as time passes developers and engineers may start to spend time extracting data from these silos for other purposes such as analytics and machine learning.

If you are lucky your teams might have provided API's to access the data, but what if that API is missing two key fields that you need or returns too much data?

For older software that is using a relational database for its data store its more likely that the software is using SQL with JDBC/ODBC and you might not have an API available.

Pulling disparate datasets together to present them for new projects can be time consuming. Engineers also have to deal with application modernisation projects such breaking up monoliths as part of a cloud migration. Keeping the lights on whilst providing a path to making your architecture cloud friendly is a delicate balancing act.

This post looks into GraphQL, specifically the AWS implementation via AppSync and how it can be used to help:

Provide a flexible API for developers
Join data from silos together
Provide a migration path for application modernisation by moving some data into DynamoDB while keeping some in a RDBMS.

What's GraphQL?

Organizations choose to build APIs with GraphQL because it gives developers the ability to query multiple databases, microservices, and APIs with a single GraphQL endpoint.

What's AppSync?

AWS AppSync is a fully managed service that makes it easy to develop GraphQL APIs. Out of the box it allows connections to data sources like AWS DynamoDB, Lambda, and more.

Data Sources

In this example we will provide a unified API that is able to query data from the following data stores:

DynamoDB - Representing a fairly new cloud native application.
RDS - Representing a traditional 3 tier app that has been migrated to the cloud.
Lambda - Representing a serverless application.

Throughout we'll use dummy/test vehicle data that we want to bring together.

DynamoDB

Create a table named vehicle. The key is vehicle_id (string).

Add some test data by adding a couple of items.

Lambda

We now create a quick Lambda that will mock returning some data for a vehicle_id.

The lambda code is shown below.

import json

print('Loading function')

def lambda_handler(event, context):
    print (json.dumps(event))
    print (context)

    vehicle_id={}
    vehicle_id=event['source']['vehicle_id']
    print(vehicle_id)

    vehicles = {
        "123456" : { "vehicle_id" : "123456", "fuel" : "electric", "category": "SUV" },
        "987654321" : { "vehicle_id" : "987654321", "fuel": "hybrid", "category": "Saloon"}
    }

    print(vehicles[vehicle_id])
    return (vehicles[vehicle_id])

RDS (Aurora PostgreSQL)

Out of the box AppSync supports Aurora Serverles RDS instances. Create an RDS Aurora PostgreSQL instance named vehicle-accident.

It's important to enable the Data API feature which is a connectionless Web Service API for running SQL queries against the database.

Once the instance has been created, connect to it using the RDS query editor and run the following SQL.

create table accident (
vehicle_id varchar,
accident_date date,
damage varchar,
cost integer);

insert into accident values (123456, '2020-11-23 18:00:00', 'windscreen smashed', 100);
insert into accident values (987654321, '2020-11-24 18:00:00', 'dent in front passenger door', 600);
commit;

In order for AppSync to connect to RDS later we need to store database credentials in AWS Secrets Manager.

Create a file names creds.json with the database credentials in.

{
    "username": "xxxxxxxxxxxxxx",
    "password": "xxxxxxxxxxxxxx"
}

Add the credentials using the AWS CLI.

aws secretsmanager create-secret --name HttpRDSSecret --secret-string file://creds.json --region eu-west-1

Make a note of the ARN returned as this is needed later.

Create The GraphQL API

From the AppSync console select build from scratch.

Give your API a name.

Schema

Click edit schema.

Add the following schema.

type Query {
    #Get a single vehicle.
    singleVehicle(vehicle_id: String): Vehicle
}

type Vehicle {
    vehicle_id: String
    model: String
    year: String
    colour: String
    make: String
    fuel: String
    category: String
    accident_date: String
    accident_damage: String
    accident_cost: String
}

schema {
    query: Query
}

Data Sources

Next we define the three data sources. DynamoDB, RDS and Lambda. Click Data Sources and add them one by one.

Resolvers

DynamoDB

Back on the Schema screen select Attach for the resolver of "singleVehicle(...): Vehicle"

Select vehicle_ddb as the data source and add the following for the request mapping temple.

{
    "version": "2017-02-28",
    "operation": "GetItem",
    "key": {
        "vehicle_id": $util.dynamodb.toDynamoDBJson($ctx.args.vehicle_id),
    }

And the following for the response template.

## Pass back the result from DynamoDB. **
$util.toJson($ctx.result)

At this point the data for some of the defined schema will be able to be queried. You can check this on the query screen of AppSync.

Lambda

On the schema definition screen scroll down to the fuel field and click attach.

Select the lambda function created earlier and enable the response mapping template with and add the following.

$util.toJson($context.result.get("fuel"))

Repeat these steps for the category field. The response mapping template should be defined as follows.

$util.toJson($context.result.get("category"))

RDS

On the schema definition screen scroll down to the accident_date field and click attach.

Select the RDS database created earlier. Configure the request mapping template as follows.

{
    "version": "2018-05-29",
    "statements": [
            $util.toJson("select accident_date from accident WHERE vehicle_id = '$ctx.source.vehicle_id'")
    ]
}

Specify the response mapping template as below.

#if($ctx.error)
    $util.error($ctx.error.message, $ctx.error.type)
#end
#set($output = $utils.rds.toJsonObject($ctx.result)[0])
## Make sure to handle instances where fields are null
## or don't exist according to your business logic
#foreach( $item in $output )
    #set($accident_date = $item.get('accident_date'))
#end
$util.toJson($accident_date)

Repeat these steps for the accident_damage and accident_cost fields. The request and response mapping templates are shown below.

{
    "version": "2018-05-29",
    "statements": [
            $util.toJson("select damage from accident WHERE vehicle_id = '$ctx.source.vehicle_id'")
    ]
}

#if($ctx.error)
    $util.error($ctx.error.message, $ctx.error.type)
#end
#set($output = $utils.rds.toJsonObject($ctx.result)[0])
## Make sure to handle instances where fields are null
## or don't exist according to your business logic
#foreach( $item in $output )
    #set($damage = $item.get('damage'))
#end
$util.toJson($damage)

{
    "version": "2018-05-29",
    "statements": [
            $util.toJson("select cost from accident WHERE vehicle_id = '$ctx.source.vehicle_id'")
    ]
}

#if($ctx.error)
    $util.error($ctx.error.message, $ctx.error.type)
#end
#set($output = $utils.rds.toJsonObject($ctx.result)[0])
## Make sure to handle instances where fields are null
## or don't exist according to your business logic
#foreach( $item in $output )
    #set($cost = $item.get('cost'))
#end
$util.toJson($cost)

Query

The three data sources are now in place to resolve all the fields for our API. Go back to the query screen and check that the fields all get populated when you run a query.

Tips

Turn on CloudWatch Logs so you can see details of any errors. You can do this under settings.

The following webpages were useful to me getting started with this demo.

https://docs.aws.amazon.com/appsync/latest/devguide/resolver-mapping-template-reference-programming-guide.html

https://adrianhall.github.io/cloud/2019/01/03/early-return-from-graphql-resolvers/

https://stackoverflow.com/questions/58031076/aws-appsync-rds-util-rds-tojsonobject-nested-objects

https://github.com/xai1983kbu/apollo-server/blob/pulumi_appsync_2/bff_pulumi/graphql/resolvers/Query.message.js