Want to build an end-to-end data pipeline in AWS?
You're in luck! In this post, I will introduce you to AWS' Big Data portfolio.
TL;DR (below image)
In my future posts, I will be going into details on these services, so make sure you watch out for that!
Before we dive into the suite of AWS Services, let's first define what Big Data is.
Big data can be described in terms of data management challenges that – due to increasing volume, velocity and variety of data – cannot be solved with traditional databases.
To make this definition more simple.
A data set is considered big data when it is too big or complex to be stored or analyzed by traditional data systems
Obviously, there are many definitions of Big Data around the web, but for me, this is the most simple one to understand.
Now that we've defined what Big Data is, we'll proceed with the AWS Services that will help you answer those challenges.
The collection of raw data has always been a challenge for many organizations, especially for us developers because you have these different complex source systems that are scattered in the company such as ERP systems, CRM systems, Transactional DBs, and etc.
You have to also think about how you would integrate the data between these systems to create a unified view of your data.
AWS helps you make these steps easier, allowing us developers to ingest data from - structured and unstructured, real-time to batch.
AWS Direct Connect is a networking service that provides an alternative to using the internet to connect to AWS.
Using AWS Direct Connect, data that would have previously been transported over the internet is delivered through a private network connection between your facilities and AWS.
This is useful if you want consistent network performance or if you have workloads that are bandwidth-heavy. I personally haven't tried this yet. Most of our implementations, we just use AWS Site-to-Site VPN.
Easily collect, process, and analyze video and data streams in real time
Amazon Kinesis enables you to process and analyze data as it arrives and respond instantly instead of having to wait until all your data is collected before the processing can begin.
Amazon Kinesis is fully managed and runs your streaming applications without requiring you to manage any infrastructure.
Kinesis has 4 capabilities namely:
- Kinesis Video Streams
- Kinesis Data Streams
- Kinesis Data Firehose
- Kinesis Data Analytics
Capture, process, and store video streams
Kinesis Video Streams makes it easy to securely stream video from connected devices to AWS for analytics, machine learning (ML), and other processing.
Capture, process, and store data streams
Kinesis Data Streams is a scalable and durable real-time data streaming service that can continuously capture gigabytes of data per second from hundreds of thousands of sources.
Load data streams into AWS data stores
Kinesis Data Firehose is the easiest way to capture, transform, and load data streams into AWS data stores for near real-time analytics with existing business intelligence tools.
Analyze data streams with SQL or Apache Flink
Kinesis Data Analytics is the easiest way to process data streams in real time with SQL or Apache Flink without having to learn new programming languages or processing frameworks.
An interesting way to move your data from on-premise to AWS Cloud would be AWS Snowball. Which is a service that provides secure, rugged devices, so you can bring AWS computing and storage capabilities to your edge environments, and transfer data into and out of AWS.
I personally haven't tried this yet but would love to do so in the future!
The most famous AWS Service would be Amazon S3. Which is an object storage built to store and retrieve any amount of data from anywhere.
It’s a simple storage service that offers industry leading durability, availability, performance, security, and virtually unlimited scalability at very low costs.
Amazon S3 is AWS' first service that launched back in 2006!
S3 Glacier is an extremely low-cost storage service that provides secure, durable, and flexible storage for data backup and archival.
Which is excellent for businesses or organizations that needs to retain their data for years and even decades!
I'm honestly a big fan of Amazon S3, given how scalable and how easy it is to use. I'll just say that if you aren't using Amazon S3 for your data lakes, then you are missing out on a lot of things lol.
There are obviously a lot of factors that need to be considered when building your Big Data project. Any big data platform needs a secure, scalable, and durable repository to store data prior or even after processing tasks. AWS provides you with services depending on your specific requirements.
DynamoDB is a key-value and document database that delivers single-digit millisecond performance at any scale.
It is one of the AWS Services that is fully-managed, meaning that you don't have to worry about setting up the infrastructure and software updates, you just use the service.
Amazon RDS is a managed service that makes it easy to set up, operate, and scale a relational database in the cloud.
Amazon RDS supports Amazon Aurora, MySQL, MariaDB, Oracle, SQL Server, and PostgreSQL database engines.
Amazon Aurora is a relational database engine that combines the speed and reliability of high-end commercial databases with the simplicity and cost-effectiveness of open source databases.
This is the step where data is transformed from its raw state into a consumable format – usually by means of sorting, aggregating, joining and even performing more advanced functions and algorithms.
The resulting data sets are then stored for further processing or made available for consumption via business intelligence and data visualization tools.
Amazon Redshift is the most widely used cloud data warehouse.
It makes it fast, simple and cost-effective to analyze all your data using standard SQL and your existing Business Intelligence (BI) tools.
It allows you to run complex analytic queries against terabytes to petabytes of structured and semi-structured data, using sophisticated query optimization, columnar storage on high-performance storage, and massively parallel query execution.
We've had some successful implementations on Redshift and I can share you guys some experiences that I've had with it, so watch out for that.
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL.
Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.
We've used Athena a lot in our implementations and I must say that they really helped us in terms of the Data Exploration and Data Validation.
AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development.
AWS Glue has evolved significantly from its initial release 0.9 to AWS Glue 2.0. Along with that are enhancements that glue (pun intended) all your pipelines together. Definitely worth looking into.
Amazon EMR is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data.
It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).
As opposed to Glue, being serverless meaning you don't need to provision your own server, EMR allows you to be more flexible in terms of the workload depending on how "big" your data processing workloads are.
Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides resizable compute capacity in the cloud.
Basically your virtual machine in cloud which has a lot of use cases, living up to its name "Elastic".
Amazon SageMaker is a fully managed service that provides every developer and data scientist with the ability to build, train, and deploy machine learning (ML) models quickly.
SageMaker removes the heavy lifting from each step of the machine learning process to make it easier to develop high quality models.
AWS re:Invent 2020 introduced us a lot of significant improvements to Amazon Sagemaker such as Data Wrangler, Clarify, SageMaker pipeline, and many more! I'll be making a deep dive on these exciting features soon!
Big data is all about getting high value, actionable insights from your data assets.
Ideally, data is made available to stakeholders through self-service business intelligence and agile data visualization tools that allow for fast and easy exploration of datasets.
Depending on the type of analytics, end-users may also consume the resulting data in the form of statistical “predictions” – in the case of predictive analytics – or recommended actions – in the case of prescriptive analytics.
Amazon QuickSight is a very fast, easy-to-use, cloud-powered business analytics service that makes it easy for all employees within an organization to build visualizations, perform ad-hoc analysis, and quickly get business insights from their data, anytime, on any device.
QuickSight is easy to use and has also made some major improvements ever since it was publicly released. It's still fairly new compared to other major BI Tools but I think it has potential. Look into QuickSight if you want a cost-effective BI solutions.
That's it for me. Would love to hear your thoughts!