Data Tech Bridge

Posted on Mar 1, 2025 • Edited on Jan 5

Amazon EMR (Elastic MapReduce) Cheat Sheet

#aws #cloudcomputing #dataengineering

AWS EMR (Elastic MapReduce) Cheat Sheet for AWS Certified Data Engineer - Associate (DEA-C01)

Core Concepts and Building Blocks

Amazon EMR (Elastic MapReduce) is a cloud-based big data platform for processing and analyzing vast amounts of data using open-source tools such as Apache Hadoop, Apache Spark, HBase, Presto, and others.

Key Components:

Clusters: Collection of EC2 instances running open-source applications
Nodes: Individual EC2 instances in a cluster (Master, Core, Task)
Steps: Work units submitted to a cluster
EMRFS: EMR File System for accessing S3 data directly
Instance Groups/Fleets: Ways to manage cluster capacity

EMR Architecture Mind Map

                                    ┌─────────────┐
                                    │  EMR Cluster │
                                    └──────┬──────┘
                 ┌──────────────────┬──────┴───────┬───────────────────┐
         ┌───────▼───────┐  ┌───────▼───────┐  ┌───▼───┐        ┌──────▼──────┐
         │  Master Node  │  │   Core Node   │  │Task Node│        │ EMR Services│
         └───────┬───────┘  └───────┬───────┘  └───┬───┘        └──────┬──────┘
                 │                  │              │                   │
     ┌───────────▼───────────┐      │              │          ┌────────▼────────┐
     │Resource Management    │      │              │          │EMR Notebooks    │
     │Application Management │      │              │          │EMR Studio       │
     │Cluster Management     │      │              │          │EMR Serverless   │
     └───────────────────────┘      │              │          │EMR on EKS       │
                                    │              │          └─────────────────┘
                            ┌───────▼───────┐      │
                            │HDFS Storage   │      │
                            │Data Processing│      │
                            └───────────────┘      │
                                                   │
                                           ┌───────▼───────┐
                                           │Additional     │
                                           │Processing     │
                                           │Capacity       │
                                           └───────────────┘

EMR Features and Details

Feature	Description
Deployment Options	EMR on EC2, EMR Serverless, EMR on EKS, EMR on Outposts
Release Types	Amazon EMR releases (e.g., 6.9.0) with different open-source applications
Storage Options	HDFS, EMRFS (S3), Local File System, EBS volumes
Security	IAM roles, Security Groups, Encryption (at-rest, in-transit), Kerberos
Scaling	Manual scaling, Automatic scaling, Managed scaling
Purchasing Options	On-Demand, Reserved, Spot instances
Networking	Public subnet, Private subnet with NAT gateway
Monitoring	CloudWatch, Ganglia, Hadoop Web Interfaces
Integration	AWS Glue Data Catalog, Lake Formation, Step Functions
Cost Management	Instance fleets, Spot instances, Auto-termination

Node Types Comparison

Node Type	Purpose	Minimum Required	Scaling	Data Loss Risk
Master Node	Manages cluster, runs primary components	1 (always)	Cannot scale	High (single point of failure)
Core Node	Runs tasks and stores data on HDFS	1+ recommended	Can scale up/down (carefully)	High (HDFS data loss)
Task Node	Runs tasks only, no HDFS	0+ (optional)	Can scale freely	None (no data storage)

Instance Group vs Instance Fleet

Feature	Instance Groups	Instance Fleets
Instance Types	Same type per group	Multiple types per fleet
Purchasing Options	On-Demand or Spot per group	Mix of On-Demand and Spot
Capacity Strategy	Manual allocation	Target capacity in units
Availability Zones	Single AZ per group	Multiple AZs supported
Spot Behavior	Limited flexibility	Advanced allocation strategies
Complexity	Simpler to configure	More complex but flexible

Important EMR Pointers

EMR clusters can be launched in minutes and scaled to process petabytes of data.
Master node runs the primary components like ResourceManager, NameNode, and application clients.
Core nodes both store data (HDFS) and run tasks; removing core nodes risks data loss.
Task nodes only run tasks and don't store HDFS data; they can be added/removed freely.
EMR supports multiple open-source frameworks including Hadoop, Spark, HBase, Presto, and Hive.
EMR releases are versioned (e.g., EMR 6.9.0) and include specific versions of open-source applications.
EMRFS allows EMR clusters to directly use S3 as a data layer without loading into HDFS first.
S3DistCp utility helps efficiently move large amounts of data between HDFS and S3.
EMR clusters can be long-running or transient (auto-terminating after tasks complete).
Steps are work units that contain instructions for data processing (e.g., a Spark job).
EMR Studio provides managed Jupyter notebooks for interactive data analysis.
EMR Serverless eliminates the need to configure clusters for Spark and Hive applications.
EMR on EKS allows running big data frameworks on Amazon EKS clusters.
Instance fleets provide more flexibility than instance groups for mixed instance types.
Spot instances can reduce costs by up to 90% but may be terminated with short notice.
Managed scaling automatically adjusts cluster capacity based on workload.
Maximum limit of 500 instances per cluster (can be increased via support ticket).
Bootstrap actions run custom scripts during cluster setup before applications start.
Security configurations provide encryption options for data at rest and in transit.
EMR supports VPC private subnets with NAT gateway for enhanced security.

EMR Storage Options

HDFS provides distributed storage across cluster nodes with data replication.
EMRFS extends Hadoop to directly access S3 data with consistency checking.
Local file system on EC2 instances provides temporary storage that's lost when instances terminate.
EBS volumes attached to EC2 instances provide additional storage capacity.
S3 is recommended for persistent storage as it survives cluster termination.
HDFS replication factor defaults to 3 but can be configured based on needs.
S3 data access from EMR can use EMRFS (s3://) or s3a:// protocol.
EMR File System (EMRFS) provides S3 consistency view to track S3 object versions.
Local disk encryption can be enabled for HDFS and local file system.
EMR supports SSE-S3, SSE-KMS, and CSE-KMS for S3 data encryption.

Performance Optimization

Use instance types optimized for your workload (compute, memory, or I/O intensive).
Prefer r5d.2xlarge or similar instances with local SSDs for better performance.
Configure appropriate EBS volumes for I/O-intensive workloads.
Use EMRFS S3-optimized committer for better Spark write performance to S3.
Enable EMRFS consistent view only when needed as it adds overhead.
Use appropriate compression codecs (Snappy for processing, Gzip for storage).
Partition data properly to enable partition pruning during queries.
Use columnar formats like Parquet or ORC for analytical workloads.
Configure Spark executor memory and cores based on instance size.
Use dynamic allocation in Spark to efficiently utilize resources.
Tune the number of partitions in Spark based on cluster size and data volume.
Broadcast small tables in Spark joins to reduce shuffle operations.
Use EMR instance fleets to optimize for both cost and availability.
Separate storage from compute by using S3 instead of HDFS when possible.
Use appropriate serialization formats (e.g., Kryo for Spark).

Cost Optimization

Use Spot instances for task nodes to reduce costs by up to 90%.
Set auto-termination policies for clusters that don't need to run continuously.
Use instance fleets with multiple instance types to optimize for cost.
Leverage Reserved Instances for master and core nodes in long-running clusters.
Scale clusters based on workload using automatic scaling.
Use EMR Managed Scaling to automatically adjust cluster size based on metrics.
Choose appropriate instance sizes to avoid over-provisioning.
Use S3 lifecycle policies to transition or expire EMR output data.
Compress and partition data to reduce storage costs and improve query performance.
Use EMR Notebooks instead of running a separate cluster for development.

Security Best Practices

Use IAM roles to control access to EMR clusters and other AWS resources.
Configure security groups to restrict network access to clusters.
Enable encryption in transit using TLS for all communications.
Enable encryption at rest for HDFS, EBS volumes, and S3 data.
Use Kerberos or LDAP for user authentication on long-running clusters.
Implement Apache Ranger for fine-grained access control.
Launch clusters in private subnets with NAT gateway for enhanced security.
Use block public access settings for S3 buckets used with EMR.
Enable logging and auditing for EMR operations.
Rotate credentials and keys regularly for long-running clusters.

Monitoring and Troubleshooting

Use CloudWatch metrics to monitor cluster and application performance.
Enable S3 access logging to track data access patterns.
Configure EMR to send logs to S3 for persistent storage.
Use Ganglia for real-time cluster monitoring.
Access Hadoop web interfaces via SSH tunneling for detailed monitoring.
Check for HDFS space issues when jobs fail with "No space left on device" errors.
Monitor garbage collection in Spark applications for memory-related issues.
Use EMR step failure action to determine cluster behavior on step failure.
Check bootstrap action logs in S3 for cluster startup issues.
Monitor instance state changes for Spot instance terminations.

Important CloudWatch Metrics for EMR

Metric	Description	Threshold Recommendation
`IsIdle`	Indicates if cluster is idle	>30 minutes may indicate underutilization
`HDFSUtilization`	Percentage of HDFS storage used	Alert at >80%
`MRTotalNodes`	Total nodes in the cluster	Monitor for unexpected changes
`MRActiveNodes`	Number of active nodes	Should match expected cluster size
`MRLostNodes`	Number of lost nodes	Alert if >0
`MemoryAvailableMB`	Available memory	Alert if consistently low
`YARNMemoryAvailablePercentage`	Available YARN memory percentage	Alert if <15%
`ContainerAllocated`	Number of containers allocated	Monitor for resource utilization
`AppsCompleted`	Number of completed applications	Track job completion rates
`AppsFailed`	Number of failed applications	Alert if >0

Open Source Components in EMR

EMR uses Apache Hadoop ecosystem components with AWS optimizations for better performance.
EMR's Hadoop distribution includes performance improvements for S3 integration.
EMR Spark includes optimizations for S3 access and improved shuffle performance.
EMR Hive includes predicate pushdown optimizations for S3 Select.
EMR HBase can use S3 as a storage layer through EMRFS.
EMR Presto includes AWS Glue Data Catalog integration for metastore.
EMR supports Jupyter notebooks through EMR Notebooks/Studio with SparkMagic.
EMR includes Hadoop ecosystem tools like Pig, Tez, Phoenix, and Zeppelin.
EMR Flink provides stream processing capabilities with Kinesis integration.
EMR supports custom AMIs for specialized requirements.

Data Ingestion and Processing

Use Spark Streaming or Flink on EMR for real-time data processing from Kinesis.
EMR can read directly from DynamoDB using EMR DynamoDB connector.
Use AWS Glue Data Catalog as a metastore for Hive, Spark, and Presto.
Implement throttling in Spark Streaming by adjusting batch intervals and parallelism.
Kinesis ingestion throughput limit is 1MB/sec per shard when reading from EMR.
S3 has a recommended 3,500 PUT/COPY/POST/DELETE and 5,500 GET/HEAD requests per second per prefix.
Implement exponential backoff for rate limit handling in EMR applications.
Use Kafka on EMR for buffering high-volume data streams before processing.
EMR can process data from AWS MSK (Managed Streaming for Kafka) for stream processing.
Use AWS Glue ETL jobs for lightweight transformations and EMR for complex processing.

Data Pipeline Replayability

Store raw data in S3 to enable reprocessing when needed.
Use checkpointing in Spark Streaming to resume processing after failures.
Implement idempotent operations to safely reprocess data without duplicates.
Use S3 versioning to maintain historical versions of processed data.
Store processing state in DynamoDB to track progress and enable resumption.
Implement dead-letter queues for messages that fail processing.
Use Kinesis enhanced fan-out for high-throughput consumers from EMR.
Configure appropriate Kinesis data retention (up to 365 days) for replay capability.
Use EMR step concurrency to process multiple workloads simultaneously.
Implement data quality checks before and after processing.

EMR Serverless

EMR Serverless automatically provisions and manages compute capacity.
Supports Apache Spark and Apache Hive applications without managing clusters.
Pre-initialized capacity reduces job startup time for frequent workloads.
Maximum concurrency limit of 100 applications per AWS account.
Default worker timeout is 15 minutes of inactivity.
Maximum worker capacity is 400 vCPUs per application (can be increased).
Supports job runs with up to 30 days maximum duration.
Provides automatic scaling based on workload requirements.
Integrates with AWS Glue Data Catalog for table metadata.
Supports S3 and JDBC data sources but not HDFS.

EMR on EKS

Allows running Spark jobs on EKS clusters without provisioning EMR clusters.
Uses Kubernetes Operators to manage Spark application lifecycle.
Supports virtual clusters that map to Kubernetes namespaces.
Maximum of 100 virtual clusters per AWS account.
Integrates with AWS Glue Data Catalog for table metadata.
Supports both job runs and interactive endpoints for development.
Uses pod templates for customizing Spark executor environments.
Enables sharing EKS clusters between EMR and other applications.
Supports auto-scaling of Kubernetes nodes based on demand.
Provides monitoring through CloudWatch Container Insights.

EMR Studio

Provides web-based IDE for notebook development with EMR clusters.
Supports Jupyter, JupyterLab, and Apache Zeppelin notebooks.
Enables collaboration through notebook sharing and version control.
Integrates with Git repositories for source control.
Supports SQL Explorer for interactive SQL queries.
Provides workspace isolation for different teams or projects.
Enables fine-grained access control through IAM and Lake Formation.
Supports connecting to multiple EMR clusters from a single workspace.
Provides pre-installed Python libraries and kernels.
Enables SQL, PySpark, Scala, and SparkR development.

Throughput and Latency Characteristics

S3 provides virtually unlimited throughput by partitioning data across prefixes.
Kinesis Data Streams provides 1MB/sec write and 2MB/sec read per shard.
EMR Spark Streaming typical latency is seconds to minutes depending on batch interval.
EMR Flink can achieve sub-second latency for stream processing.
DynamoDB throughput is limited by provisioned capacity or on-demand limits.
Kafka on EMR can handle millions of records per second with proper configuration.
S3 Select can reduce query latency by up to 80% by pushing filtering to S3.
EMR on EC2 provides lower latency than EMR Serverless for interactive workloads.
Network throughput varies by EC2 instance type (up to 100 Gbps for some types).
EMRFS read/write performance depends on S3 throughput and request rates.

Advanced EMR Features

EMR WAL (Write Ahead Logging) enables HBase recovery after cluster termination.
EMR supports custom AMIs for specialized software requirements.
EMR supports LDAP integration for user authentication.
EMR supports direct connection to on-premises data sources through Direct Connect.
EMR supports running Docker containers through YARN.
EMR supports GPU instances for machine learning workloads.
EMR supports Spot instance fleets with allocation strategy for cost optimization.
EMR supports running multiple master nodes for high availability.
EMR supports custom security configurations for encryption and authentication.
EMR supports running on AWS Outposts for on-premises deployments.

EMR Integration with AWS Services

AWS Glue Data Catalog provides a central metadata repository for EMR applications.
AWS Lake Formation provides fine-grained access control for data lakes.
Amazon Redshift Spectrum can query data in S3 processed by EMR.
AWS Step Functions can orchestrate EMR workflows.
Amazon EventBridge can trigger EMR workflows based on events.
AWS Lambda can submit EMR steps or create clusters.
Amazon QuickSight can visualize data processed by EMR.
AWS CloudFormation can automate EMR cluster provisioning.
AWS Service Catalog can provide self-service EMR cluster templates.
Amazon SageMaker can use data processed by EMR for machine learning.

Example Calculations

EMR Cluster Size Calculation: For processing 1TB of data with a compression ratio of 3:1 and 3x HDFS replication: 1TB ÷ 3 × 3 = 1TB of HDFS capacity needed.
Spark Memory Calculation: For a node with 32GB RAM: 32GB × 0.9 (overhead) × 0.6 (Spark default) = ~17GB available for Spark executors.
Kinesis Shard Calculation: For ingesting 50MB/sec: 50MB/sec ÷ 1MB/sec per shard = 50 shards needed.
Cost Calculation: 10-node EMR cluster with m5.2xlarge (8 vCPU, 32GB RAM) at $0.384/hour = $0.384 × 10 × 24 = $92.16/day.
Spot Savings Calculation: With Spot instances at 70% discount: $92.16 × 0.3 = $27.65/day (saving $64.51/day).

Data Processing Patterns

Use EMR for ETL processing of large datasets before loading into data warehouses.
Implement lambda architecture with batch processing (EMR) and stream processing (Kinesis).
Use EMR for data lake processing with S3 as the storage layer.
Implement medallion architecture (bronze, silver, gold) for progressive data refinement.
Use EMR for machine learning feature engineering at scale.
Implement data quality validation using EMR before downstream consumption.
Use EMR for regular data aggregation and reporting generation.
Implement slowly changing dimension processing using EMR and Hive/Spark.
Use EMR for sessionization and user behavior analysis of web logs.
Implement data archiving and compliance workflows using EMR.

Limits and Quotas

Maximum of 500 instances per EMR cluster (can be increased via support).
Maximum of 100 steps per cluster (can be increased via support).
Maximum of 256MB total size for all bootstrap action scripts.
Maximum of 10,000 concurrent step executions per AWS region.
Maximum of 500 active clusters per AWS region.
Maximum of 100 instance groups per cluster.
Maximum of 30 task instance groups per cluster.
Maximum of 50 tags per EMR cluster.
Maximum of 100 EMR Serverless applications per AWS account.
Maximum of 100 EMR Studio workspaces per AWS account.

Best Practices for EMR Exam

Understand the differences between EMR deployment options (EC2, Serverless, EKS).
Know the node types (master, core, task) and their specific purposes.
Understand storage options (HDFS, EMRFS, local) and when to use each.
Be familiar with instance purchasing options (On-Demand, Reserved, Spot).
Know how to optimize costs using instance fleets and Spot instances.
Understand security configurations including encryption and authentication.
Know how EMR integrates with other AWS services like Glue and Lake Formation.
Understand monitoring options including CloudWatch metrics and logs.
Know common troubleshooting approaches for EMR clusters and applications.
Understand data processing patterns and best practices for big data workloads.

Top comments (1)

Wakeup Flower • Jun 27 '25

Very helpful , thank you