Welcome, DataEngHack online!

Peter Hanssens #BlackLivesMatter — Thu, 28 Apr 2022 01:57:00 +0000

Hey folks,

Peter Hanssens here... welcome to the online DataEngHack blogging competition that we are running in the month of May!

This blogging competition is designed to get you hands on with some cutting edge data engineering technology and public exposure for your awesome work in the process. You will be featured on the DataEngAu website and you have the chance to win an awesome set of prizes. You can submit your blog right away and in order to win a prize, your blog must be submitted by the 31st of May.

Prizes

So first up, anyone who submits a blog (with a few caveats around it being an appropriate data engineering blog) will be sent a free DataEngHack t-shirt.

For the top 10 blogs as determined by weighted popularity, will be each given one of either:

one of 5 Lego kits valued around $100 including the Lego Vesper 125
Zhamak Dheghani's book on Data Mesh

Technology sponsors

This is a sponsored event and as such we encourage our participants to use at least one of the technologies of our technology partners in their solution. These vendors are leaders in their space and often provide really fun and innovative ways of achieving great Data Engineering outcomes, so why not give them a go:

So our list of sponsors are:

How to get involved

See the comments below for the way to register!

This blog was originally published on DataEngAu

My favourite re:Invent data announcements

Peter Hanssens #BlackLivesMatter — Thu, 17 Dec 2020 03:18:28 +0000

Hi everyone,

What a re:Invent it has been so far with so many announcements across the board. My name is Peter Hanssens and I am a Serverless Hero based out of Sydney, Australia where I also run a Data Engineering meetup. I thought I'd spend some time talking about some announcements that are of interest to folks working within the data ecosystem.

Many of these announcements listed below are from Rahul Pathak's leadership session on harnessing the power of data with AWS analytics - well worth a watch if you haven't done so already.

Redshift

Redshift is a cloud data warehouse and, up until last re:Invent, coupled compute and storage. Now the RA3 instances have been around for a year, but the new XLPlus instances are available at a much lower price point which is great for established startups to take advantage of the innovative features it brings in being able to scale compute and storage independently.

Here are my top announcements for Redshift:

Amazon Redshift launches RA3.xlplus nodes with managed storage
Automatic Table Optimization - this is huge as you no longer need to think about distribution or sort keys!
Preview - Aqua for Redshift - game changing query performance - this looks to be a quantum leap forward for Redshift.
Preview - native JSON support - JSON and semi-structured data are a feature of many modern data sources and being able to parse this natively within Redshift means less pre-work in landing data into your warehouse.
Preview - Federated query support for RDS and Aurora MySQL - this makes it even easier ingest data into your data warehouse.
Preview - Amazon Redshift ML - another feature enabling data engineers to do more within the comforts of a data warehouse using SQL - very keen to see what folks can build with this great functionality.
Preview - Data Sharing - a great new feature that allows companies to share data with other third parties.
Preview - Native console integration with partners - another preview aimed at making data integration much faster with third parties such as Salesforce and Slack.

Glue

AWS Glue is a serverless (Yay!) ETL tool with a data catalogue baked in. There have been some wonderful announcements across re:Invent as well as pre:Invent!

Preview - Elastic Views - source data from RDS, Aurora, and DynamoDB using SQL to query across them and surface the results continuously in a materialised view to a variety of destinations including Redshift, S3 and Elasticsearch Service.
Pre:Invent - Schema Registry - this service allows better collaboration across teams maintaining data schemas which allows for schema evolution. It integrates with MSK, Kinesis and Lambda out of the box!
Pre:Invent - DataBrew - making data preparation easier is what this service is all about with the idea that it solves the challenge that data scientists using 80% of their time doing data prep - very much looking forward to exploring this service further.

Lake Formation

Lake Formation is a set of best practises in rolling out a data lake on AWS including security and governance.

Preview - Transactions, Row-level Security, and Acceleration - bringing lakehouse features to the data lake.
HealthLake - using the FHIR industry standard to bring together lots of disparate and unstructured data sources allowing for powerful querying and search capabilities.

EMR

EMR is a big data processing platform that gives you access to open source tools such as Presto, Spark, Flink and Hive to name a few.

EMR Studio - is a fully managed JupyterNotebook with a rich feature set that you can log into using SSO and your corporate credentials.
EMR on EKS - now you can run spark jobs on EKS with the rich feature set that EMR brings to the table.
Graviton2 instances - Graviton2 has been a revolution in compute performance and now its doing its thing with EMR with up to 30% lower cost and up to 15% improved performance.

AppFlow

AppFlow allows you to securely transfer data between SaaS apps such as Salesforce, Marketo, and Slack and AWS Services such as S3 and Redshift.

Lookout for Metrics integration - you can now detect anomalies and unexpected changes in your metrics without needing to have machine learning expertise.

Batch

Batch is a service that optimally provisions the type and quantity of compute for batch processes that you would like to run.

Fargate support - now you can submit your Batch jobs without needing to worry about patching your EC2 instances!

Neptune

Neptune is a fast and reliable managed graph database service - many data teams are using graph databases to store metadata and lineage for their data lakes.

ML Integration - this allows you to run Graph Neural Networks over your data and return results within hours as opposed to weeks with traditional tabular methods.

Managed Airflow

Last but definitely not least, we have airflow which is a workflow orchestration service that allows data engineers create DAGs or directed acyclic graphs to manage dependencies across various data pipelines. Managing an airflow cluster can easily require a lot of effort so having this in a managed service is a huge win for data engineering teams already managing their own clusters.

Pre:Invent - MWAA - is a new serverless service that allows you to deploy airflow at scale rapidly.

Thanks for sticking with me for the long read - hope you enjoyed the wrap - and let me know what's your pick out of the lot?!

DEV Community: Peter Hanssens #BlackLivesMatter