loading...
Cover image for Data Engineering Series #2: Cloud Services and FOSS in Data Engineer's world

Data Engineering Series #2: Cloud Services and FOSS in Data Engineer's world

srinidhi profile image Srinidhi ・4 min read

Let's Engineer This - The Data Engineering Series (2 Part Series)

1) Data Engineering Series #1: 10 Key tech skills you need, to become a competent Data Engineer. 2) Data Engineering Series #2: Cloud Services and FOSS in Data Engineer's world


"Open Source (OSS) frameworks have improved the quality of Big Data processing with its diverse set of tools addressing numerous use cases

In fact, if you are a part of a team working on building a modern data architecture, chances are high you are using an open-source stack.


Similarly, Cloud Computing has been enabling Big Data Solutions in yielding scalable and cost-effective solutions in analytics space.


Open Source and Cloud : The Correlation

In the cloud ecosystem, many of the commercially available cloud services are either

Similar to an OSS ➑ Similar in Features (Eg: AWS Step Functions and Apache Airflow )

OR

Modeled after an OSS ➑ Follows/ Inherits the design principles of an existing Open Source framework. (Eg: AWS Kinesis and Apache Kafka)
OR

Managed service of an OSS ➑ Takes care of deployment & maintenance of the OSS framework and making it ready to use. (Eg: AWS RDS Postgres and PostgresDB)

To understand more, Let's touch upon the basics...


Getting to know the cloud

The first step that many of us go through while getting to know about cloud services is to start wondering where to start from the plethora of services available out there.

So, For the ease of understanding, Irrespective of the cloud provider (AWS, Azure, GCP, etc). let's group the big data related cloud services into these stages.

Alt Text


Now, Let's try to understand the cloud ecosystem by comparing AWS cloud services with its equivalent open source frameworks. (Similar comparison can be drawn with Azure and GCP as well)

πŸ“ Data Ingestion:

AWS Service What it does Relation with OSS OSS Alternative
Kinesis Stream Processing Modelled After Apache Kafka
SQS Message Queue Similar to RabbitMQ
Managed Streaming for Kafka (MSK) Stream Processing Managed Service of Apache Kafka

πŸ“ Data Storage:

AWS Service What it does Relation with OSS OSS Alternative
S3 Object store Similar to Minio, Swift, Ceph, ...
RDS Relational database Managed Service of MariaDB, MySQL, Postgres
DynamoDB NoSQL database Similar to Apache Cassandra
ElastiCache In-memory cache Managed Service of Memcached, Redis
Neptune Graph database Similar to Neo4j
Amazon QLDB Ledger database Modelled After Hyperledger
Amazon DocumentDB Document database Similar to MongoDB
AWS Lake Formation Data lake Similar to HDFS
EC2 EBS Block storage for EC2 Similar to OpenEBS, Portworx

πŸ“ Data Processing:

AWS Service What it does Relation with OSS OSS Alternative
Elastic Map Reduce Hadoop Managed Service of Hadoop,
Step Functions Worflow Orchestrator Similar to Apache Airflow , Flyte
AWS Glue ETL Managed Service of Apache Spark
Lambda Serverless Similar to Knative, OpenFaaS, Fn
Batch Batch Job Computing Similar to Apache Airflow on Kubernetes

πŸ“ Data Analysis & Visualization:

AWS Service What it does Relation with OSS OSS Alternative
Amazon Redshift Data warehousing Similar to Spark SQL, Apache Hive, Presto
Athena Data warehousing Similar to Spark SQL, Apache Hive, Presto
CloudSearch Search Similar to Elasticsearch
Elasticsearch Service Search Managed Service of Elasticsearch
QuickSight Business analytics Similar to PowerBI

πŸ“ Deployment:

AWS Service What it does Relation with OSS OSS Alternative
Elastic Container Registry (ECR) Container registry Managed Service of Docker Registry, Quay
Elastic Container Service (ECS) Container orchestration Managed Service of Kubernetes, Marathon
Elastic Kubernetes Services (EKS) Container orchestration Managed Service of Kubernetes
Cloud Formation Infrastructure as a code Similar to Terraform

Some of the notable cloud adoptions with respect to Big Data.

- Till now, AWS users have launched more than 15 million Hadoop clusters. (EMR / Containerized versions)
- "container-as-a-service" (EKS, ECS) and "Database-as-a-service" (RDS, DynamoDB) are the most commonly used managed services in 2020.
- Database services usage up 127% year over year.

Next Steps...

  1. You can understand how these services are put to use in real-world use cases in this article
  2. This Whitepaper from AWS on Big Data will be a good place to understand its Services.
  3. And start getting hands-on following this repo

Going forward, I'll publish detailed posts on tools and frameworks used by Data Engineers day in and day out.

Follow for updates.

Let's Engineer This - The Data Engineering Series (2 Part Series)

1) Data Engineering Series #1: 10 Key tech skills you need, to become a competent Data Engineer. 2) Data Engineering Series #2: Cloud Services and FOSS in Data Engineer's world

Posted on May 19 by:

srinidhi profile

Srinidhi

@srinidhi

I believe that the future belongs to the polymath who are profoundly skilled enough to combine them in creative ways. Having worked in the field of Data Engineering, I would love to share my learning.

Discussion

markdown guide
 

Great job on your Data Engineer series so far!
I see lots of amazing talent coming out of Chennai

 

Good one for starters. Keep going...

 

Good, write up. Keep it going. πŸ‘ I am just starting with Python already loving it.

 

That is some good analysis right there! πŸ’―