DEV Community

Haris
Haris

Posted on

Simplifying an event driven pipeline with AWS Lambda & EMR.

Alt Text
A batch event driven pipeline could contains the following steps:

  1. A file containing new records would be dropped to a S3 landing bucket directory by a source system.
  2. A lambda function using the S3 bucket directory landing directory as trigger to either initialize a new transient EMR cluster with the new file as a Spark Step or add the new file in an existing EMR. This check could be performed against the EMR name.
  3. The EMR Spark process would perform the data transformations & enrichments using third party data contained in an RDS. The produced file would be exported to an output bucket directory.
  4. An Athena table would point on the s3 output bucket directory for users perform analytics on the produced data.

EMR Configurations

Step Concurrency - should be more than one if the data pipeline can handle parallel step executions.
Auto-Termination - If true cluster is transient, it is suggested for unpredicted and unscheduled loads as having an EMR cluster running indefinitely is not cost-effective. The EMR while bootstraping occurs no costs.

Top comments (0)