DEV Community

Peter Hanssens #BlackLivesMatter for AWS Heroes

Posted on

My favourite re:Invent data announcements

Rahul Pathak, VP of Analytics, Amazon Web Services

Hi everyone,

What a re:Invent it has been so far with so many announcements across the board. My name is Peter Hanssens and I am a Serverless Hero based out of Sydney, Australia where I also run a Data Engineering meetup. I thought I'd spend some time talking about some announcements that are of interest to folks working within the data ecosystem.

Many of these announcements listed below are from Rahul Pathak's leadership session on harnessing the power of data with AWS analytics - well worth a watch if you haven't done so already.

Redshift

Redshift is a cloud data warehouse and, up until last re:Invent, coupled compute and storage. Now the RA3 instances have been around for a year, but the new XLPlus instances are available at a much lower price point which is great for established startups to take advantage of the innovative features it brings in being able to scale compute and storage independently.

Here are my top announcements for Redshift:

Glue

AWS Glue Elastic Views

AWS Glue is a serverless (Yay!) ETL tool with a data catalogue baked in. There have been some wonderful announcements across re:Invent as well as pre:Invent!

  • Preview - Elastic Views - source data from RDS, Aurora, and DynamoDB using SQL to query across them and surface the results continuously in a materialised view to a variety of destinations including Redshift, S3 and Elasticsearch Service.

  • Pre:Invent - Schema Registry - this service allows better collaboration across teams maintaining data schemas which allows for schema evolution. It integrates with MSK, Kinesis and Lambda out of the box!

  • Pre:Invent - DataBrew - making data preparation easier is what this service is all about with the idea that it solves the challenge that data scientists using 80% of their time doing data prep - very much looking forward to exploring this service further.

Lake Formation

HealthLake

Lake Formation is a set of best practises in rolling out a data lake on AWS including security and governance.

Preview - Transactions, Row-level Security, and Acceleration - bringing lakehouse features to the data lake.
HealthLake - using the FHIR industry standard to bring together lots of disparate and unstructured data sources allowing for powerful querying and search capabilities.

EMR

EMR is a big data processing platform that gives you access to open source tools such as Presto, Spark, Flink and Hive to name a few.

  • EMR Studio - is a fully managed JupyterNotebook with a rich feature set that you can log into using SSO and your corporate credentials.

  • EMR on EKS - now you can run spark jobs on EKS with the rich feature set that EMR brings to the table.

  • Graviton2 instances - Graviton2 has been a revolution in compute performance and now its doing its thing with EMR with up to 30% lower cost and up to 15% improved performance.

AppFlow

AppFlow

AppFlow allows you to securely transfer data between SaaS apps such as Salesforce, Marketo, and Slack and AWS Services such as S3 and Redshift.

  • Lookout for Metrics integration - you can now detect anomalies and unexpected changes in your metrics without needing to have machine learning expertise.

Batch

Batch is a service that optimally provisions the type and quantity of compute for batch processes that you would like to run.

  • Fargate support - now you can submit your Batch jobs without needing to worry about patching your EC2 instances!

Neptune

Neptune is a fast and reliable managed graph database service - many data teams are using graph databases to store metadata and lineage for their data lakes.

  • ML Integration - this allows you to run Graph Neural Networks over your data and return results within hours as opposed to weeks with traditional tabular methods.

Managed Airflow

Airflow Dag

Last but definitely not least, we have airflow which is a workflow orchestration service that allows data engineers create DAGs or directed acyclic graphs to manage dependencies across various data pipelines. Managing an airflow cluster can easily require a lot of effort so having this in a managed service is a huge win for data engineering teams already managing their own clusters.

Pre:Invent - MWAA - is a new serverless service that allows you to deploy airflow at scale rapidly.

Thanks for sticking with me for the long read - hope you enjoyed the wrap - and let me know what's your pick out of the lot?!

Discussion (0)