AWS EMR (Elastic MapReduce) Cheat Sheet for AWS Certified Data Engineer - Associate (DEA-C01)
Core Concepts and Building Blocks
Amazon EMR (Elastic MapReduce) is a cloud-based big data platform for processing and analyzing vast amounts of data using open-source tools such as Apache Hadoop, Apache Spark, HBase, Presto, and others.
Key Components:
Clusters : Collection of EC2 instances running open-source applications
Nodes : Individual EC2 instances in a cluster (Master, Core, Task)
Steps : Work units submitted to a cluster
EMRFS : EMR File System for accessing S3 data directly
Instance Groups/Fleets : Ways to manage cluster capacity
EMR Architecture Mind Map
┌─────────────┐
│ EMR Cluster │
└──────┬──────┘
┌──────────────────┬──────┴───────┬───────────────────┐
┌───────▼───────┐ ┌───────▼───────┐ ┌───▼───┐ ┌──────▼──────┐
│ Master Node │ │ Core Node │ │Task Node│ │ EMR Services│
└───────┬───────┘ └───────┬───────┘ └───┬───┘ └──────┬──────┘
│ │ │ │
┌───────────▼───────────┐ │ │ ┌────────▼────────┐
│Resource Management │ │ │ │EMR Notebooks │
│Application Management │ │ │ │EMR Studio │
│Cluster Management │ │ │ │EMR Serverless │
└───────────────────────┘ │ │ │EMR on EKS │
│ │ └─────────────────┘
┌───────▼───────┐ │
│HDFS Storage │ │
│Data Processing│ │
└───────────────┘ │
│
┌───────▼───────┐
│Additional │
│Processing │
│Capacity │
└───────────────┘
Enter fullscreen mode
Exit fullscreen mode
EMR Features and Details
Feature
Description
Deployment Options
EMR on EC2, EMR Serverless, EMR on EKS, EMR on Outposts
Release Types
Amazon EMR releases (e.g., 6.9.0) with different open-source applications
Storage Options
HDFS, EMRFS (S3), Local File System, EBS volumes
Security
IAM roles, Security Groups, Encryption (at-rest, in-transit), Kerberos
Scaling
Manual scaling, Automatic scaling, Managed scaling
Purchasing Options
On-Demand, Reserved, Spot instances
Networking
Public subnet, Private subnet with NAT gateway
Monitoring
CloudWatch, Ganglia, Hadoop Web Interfaces
Integration
AWS Glue Data Catalog, Lake Formation, Step Functions
Cost Management
Instance fleets, Spot instances, Auto-termination
Node Types Comparison
Node Type
Purpose
Minimum Required
Scaling
Data Loss Risk
Master Node
Manages cluster, runs primary components
1 (always)
Cannot scale
High (single point of failure)
Core Node
Runs tasks and stores data on HDFS
1+ recommended
Can scale up/down (carefully)
High (HDFS data loss)
Task Node
Runs tasks only, no HDFS
0+ (optional)
Can scale freely
None (no data storage)
Instance Group vs Instance Fleet
Feature
Instance Groups
Instance Fleets
Instance Types
Same type per group
Multiple types per fleet
Purchasing Options
On-Demand or Spot per group
Mix of On-Demand and Spot
Capacity Strategy
Manual allocation
Target capacity in units
Availability Zones
Single AZ per group
Multiple AZs supported
Spot Behavior
Limited flexibility
Advanced allocation strategies
Complexity
Simpler to configure
More complex but flexible
Important EMR Pointers
EMR clusters can be launched in minutes and scaled to process petabytes of data.
Master node runs the primary components like ResourceManager, NameNode, and application clients.
Core nodes both store data (HDFS) and run tasks; removing core nodes risks data loss.
Task nodes only run tasks and don't store HDFS data; they can be added/removed freely.
EMR supports multiple open-source frameworks including Hadoop, Spark, HBase, Presto, and Hive.
EMR releases are versioned (e.g., EMR 6.9.0) and include specific versions of open-source applications.
EMRFS allows EMR clusters to directly use S3 as a data layer without loading into HDFS first.
S3DistCp utility helps efficiently move large amounts of data between HDFS and S3.
EMR clusters can be long-running or transient (auto-terminating after tasks complete).
Steps are work units that contain instructions for data processing (e.g., a Spark job).
EMR Studio provides managed Jupyter notebooks for interactive data analysis.
EMR Serverless eliminates the need to configure clusters for Spark and Hive applications.
EMR on EKS allows running big data frameworks on Amazon EKS clusters.
Instance fleets provide more flexibility than instance groups for mixed instance types.
Spot instances can reduce costs by up to 90% but may be terminated with short notice.
Managed scaling automatically adjusts cluster capacity based on workload.
Maximum limit of 500 instances per cluster (can be increased via support ticket).
Bootstrap actions run custom scripts during cluster setup before applications start.
Security configurations provide encryption options for data at rest and in transit.
EMR supports VPC private subnets with NAT gateway for enhanced security.
EMR Storage Options
HDFS provides distributed storage across cluster nodes with data replication.
EMRFS extends Hadoop to directly access S3 data with consistency checking.
Local file system on EC2 instances provides temporary storage that's lost when instances terminate.
EBS volumes attached to EC2 instances provide additional storage capacity.
S3 is recommended for persistent storage as it survives cluster termination.
HDFS replication factor defaults to 3 but can be configured based on needs.
S3 data access from EMR can use EMRFS (s3://) or s3a:// protocol.
EMR File System (EMRFS) provides S3 consistency view to track S3 object versions.
Local disk encryption can be enabled for HDFS and local file system.
EMR supports SSE-S3, SSE-KMS, and CSE-KMS for S3 data encryption.
Performance Optimization
Use instance types optimized for your workload (compute, memory, or I/O intensive).
Prefer r5d.2xlarge or similar instances with local SSDs for better performance.
Configure appropriate EBS volumes for I/O-intensive workloads.
Use EMRFS S3-optimized committer for better Spark write performance to S3.
Enable EMRFS consistent view only when needed as it adds overhead.
Use appropriate compression codecs (Snappy for processing, Gzip for storage).
Partition data properly to enable partition pruning during queries.
Use columnar formats like Parquet or ORC for analytical workloads.
Configure Spark executor memory and cores based on instance size.
Use dynamic allocation in Spark to efficiently utilize resources.
Tune the number of partitions in Spark based on cluster size and data volume.
Broadcast small tables in Spark joins to reduce shuffle operations.
Use EMR instance fleets to optimize for both cost and availability.
Separate storage from compute by using S3 instead of HDFS when possible.
Use appropriate serialization formats (e.g., Kryo for Spark).
Cost Optimization
Use Spot instances for task nodes to reduce costs by up to 90%.
Set auto-termination policies for clusters that don't need to run continuously.
Use instance fleets with multiple instance types to optimize for cost.
Leverage Reserved Instances for master and core nodes in long-running clusters.
Scale clusters based on workload using automatic scaling.
Use EMR Managed Scaling to automatically adjust cluster size based on metrics.
Choose appropriate instance sizes to avoid over-provisioning.
Use S3 lifecycle policies to transition or expire EMR output data.
Compress and partition data to reduce storage costs and improve query performance.
Use EMR Notebooks instead of running a separate cluster for development.
Security Best Practices
Use IAM roles to control access to EMR clusters and other AWS resources.
Configure security groups to restrict network access to clusters.
Enable encryption in transit using TLS for all communications.
Enable encryption at rest for HDFS, EBS volumes, and S3 data.
Use Kerberos or LDAP for user authentication on long-running clusters.
Implement Apache Ranger for fine-grained access control.
Launch clusters in private subnets with NAT gateway for enhanced security.
Use block public access settings for S3 buckets used with EMR.
Enable logging and auditing for EMR operations.
Rotate credentials and keys regularly for long-running clusters.
Monitoring and Troubleshooting
Use CloudWatch metrics to monitor cluster and application performance.
Enable S3 access logging to track data access patterns.
Configure EMR to send logs to S3 for persistent storage.
Use Ganglia for real-time cluster monitoring.
Access Hadoop web interfaces via SSH tunneling for detailed monitoring.
Check for HDFS space issues when jobs fail with "No space left on device" errors.
Monitor garbage collection in Spark applications for memory-related issues.
Use EMR step failure action to determine cluster behavior on step failure.
Check bootstrap action logs in S3 for cluster startup issues.
Monitor instance state changes for Spot instance terminations.
Important CloudWatch Metrics for EMR
Metric
Description
Threshold Recommendation
IsIdle
Indicates if cluster is idle
>30 minutes may indicate underutilization
HDFSUtilization
Percentage of HDFS storage used
Alert at >80%
MRTotalNodes
Total nodes in the cluster
Monitor for unexpected changes
MRActiveNodes
Number of active nodes
Should match expected cluster size
MRLostNodes
Number of lost nodes
Alert if >0
MemoryAvailableMB
Available memory
Alert if consistently low
YARNMemoryAvailablePercentage
Available YARN memory percentage
Alert if <15%
ContainerAllocated
Number of containers allocated
Monitor for resource utilization
AppsCompleted
Number of completed applications
Track job completion rates
AppsFailed
Number of failed applications
Alert if >0
Open Source Components in EMR
EMR uses Apache Hadoop ecosystem components with AWS optimizations for better performance.
EMR's Hadoop distribution includes performance improvements for S3 integration.
EMR Spark includes optimizations for S3 access and improved shuffle performance.
EMR Hive includes predicate pushdown optimizations for S3 Select.
EMR HBase can use S3 as a storage layer through EMRFS.
EMR Presto includes AWS Glue Data Catalog integration for metastore.
EMR supports Jupyter notebooks through EMR Notebooks/Studio with SparkMagic.
EMR includes Hadoop ecosystem tools like Pig, Tez, Phoenix, and Zeppelin.
EMR Flink provides stream processing capabilities with Kinesis integration.
EMR supports custom AMIs for specialized requirements.
Data Ingestion and Processing
Use Spark Streaming or Flink on EMR for real-time data processing from Kinesis.
EMR can read directly from DynamoDB using EMR DynamoDB connector.
Use AWS Glue Data Catalog as a metastore for Hive, Spark, and Presto.
Implement throttling in Spark Streaming by adjusting batch intervals and parallelism.
Kinesis ingestion throughput limit is 1MB/sec per shard when reading from EMR.
S3 has a recommended 3,500 PUT/COPY/POST/DELETE and 5,500 GET/HEAD requests per second per prefix.
Implement exponential backoff for rate limit handling in EMR applications.
Use Kafka on EMR for buffering high-volume data streams before processing.
EMR can process data from AWS MSK (Managed Streaming for Kafka) for stream processing.
Use AWS Glue ETL jobs for lightweight transformations and EMR for complex processing.
Data Pipeline Replayability
Store raw data in S3 to enable reprocessing when needed.
Use checkpointing in Spark Streaming to resume processing after failures.
Implement idempotent operations to safely reprocess data without duplicates.
Use S3 versioning to maintain historical versions of processed data.
Store processing state in DynamoDB to track progress and enable resumption.
Implement dead-letter queues for messages that fail processing.
Use Kinesis enhanced fan-out for high-throughput consumers from EMR.
Configure appropriate Kinesis data retention (up to 365 days) for replay capability.
Use EMR step concurrency to process multiple workloads simultaneously.
Implement data quality checks before and after processing.
EMR Serverless
EMR Serverless automatically provisions and manages compute capacity.
Supports Apache Spark and Apache Hive applications without managing clusters.
Pre-initialized capacity reduces job startup time for frequent workloads.
Maximum concurrency limit of 100 applications per AWS account.
Default worker timeout is 15 minutes of inactivity.
Maximum worker capacity is 400 vCPUs per application (can be increased).
Supports job runs with up to 30 days maximum duration.
Provides automatic scaling based on workload requirements.
Integrates with AWS Glue Data Catalog for table metadata.
Supports S3 and JDBC data sources but not HDFS.
EMR on EKS
Allows running Spark jobs on EKS clusters without provisioning EMR clusters.
Uses Kubernetes Operators to manage Spark application lifecycle.
Supports virtual clusters that map to Kubernetes namespaces.
Maximum of 100 virtual clusters per AWS account.
Integrates with AWS Glue Data Catalog for table metadata.
Supports both job runs and interactive endpoints for development.
Uses pod templates for customizing Spark executor environments.
Enables sharing EKS clusters between EMR and other applications.
Supports auto-scaling of Kubernetes nodes based on demand.
Provides monitoring through CloudWatch Container Insights.
EMR Studio
Provides web-based IDE for notebook development with EMR clusters.
Supports Jupyter, JupyterLab, and Apache Zeppelin notebooks.
Enables collaboration through notebook sharing and version control.
Integrates with Git repositories for source control.
Supports SQL Explorer for interactive SQL queries.
Provides workspace isolation for different teams or projects.
Enables fine-grained access control through IAM and Lake Formation.
Supports connecting to multiple EMR clusters from a single workspace.
Provides pre-installed Python libraries and kernels.
Enables SQL, PySpark, Scala, and SparkR development.
Throughput and Latency Characteristics
S3 provides virtually unlimited throughput by partitioning data across prefixes.
Kinesis Data Streams provides 1MB/sec write and 2MB/sec read per shard.
EMR Spark Streaming typical latency is seconds to minutes depending on batch interval.
EMR Flink can achieve sub-second latency for stream processing.
DynamoDB throughput is limited by provisioned capacity or on-demand limits.
Kafka on EMR can handle millions of records per second with proper configuration.
S3 Select can reduce query latency by up to 80% by pushing filtering to S3.
EMR on EC2 provides lower latency than EMR Serverless for interactive workloads.
Network throughput varies by EC2 instance type (up to 100 Gbps for some types).
EMRFS read/write performance depends on S3 throughput and request rates.
Advanced EMR Features
EMR WAL (Write Ahead Logging) enables HBase recovery after cluster termination.
EMR supports custom AMIs for specialized software requirements.
EMR supports LDAP integration for user authentication.
EMR supports direct connection to on-premises data sources through Direct Connect.
EMR supports running Docker containers through YARN.
EMR supports GPU instances for machine learning workloads.
EMR supports Spot instance fleets with allocation strategy for cost optimization.
EMR supports running multiple master nodes for high availability.
EMR supports custom security configurations for encryption and authentication.
EMR supports running on AWS Outposts for on-premises deployments.
EMR Integration with AWS Services
AWS Glue Data Catalog provides a central metadata repository for EMR applications.
AWS Lake Formation provides fine-grained access control for data lakes.
Amazon Redshift Spectrum can query data in S3 processed by EMR.
AWS Step Functions can orchestrate EMR workflows.
Amazon EventBridge can trigger EMR workflows based on events.
AWS Lambda can submit EMR steps or create clusters.
Amazon QuickSight can visualize data processed by EMR.
AWS CloudFormation can automate EMR cluster provisioning.
AWS Service Catalog can provide self-service EMR cluster templates.
Amazon SageMaker can use data processed by EMR for machine learning.
Example Calculations
EMR Cluster Size Calculation : For processing 1TB of data with a compression ratio of 3:1 and 3x HDFS replication: 1TB ÷ 3 × 3 = 1TB of HDFS capacity needed.
Spark Memory Calculation : For a node with 32GB RAM: 32GB × 0.9 (overhead) × 0.6 (Spark default) = ~17GB available for Spark executors.
Kinesis Shard Calculation : For ingesting 50MB/sec: 50MB/sec ÷ 1MB/sec per shard = 50 shards needed.
Cost Calculation : 10-node EMR cluster with m5.2xlarge (8 vCPU, 32GB RAM) at $0.384/hour = $0.384 × 10 × 24 = $92.16/day.
Spot Savings Calculation : With Spot instances at 70% discount: $92.16 × 0.3 = $27.65/day (saving $64.51/day).
Data Processing Patterns
Use EMR for ETL processing of large datasets before loading into data warehouses.
Implement lambda architecture with batch processing (EMR) and stream processing (Kinesis).
Use EMR for data lake processing with S3 as the storage layer.
Implement medallion architecture (bronze, silver, gold) for progressive data refinement.
Use EMR for machine learning feature engineering at scale.
Implement data quality validation using EMR before downstream consumption.
Use EMR for regular data aggregation and reporting generation.
Implement slowly changing dimension processing using EMR and Hive/Spark.
Use EMR for sessionization and user behavior analysis of web logs.
Implement data archiving and compliance workflows using EMR.
Limits and Quotas
Maximum of 500 instances per EMR cluster (can be increased via support).
Maximum of 100 steps per cluster (can be increased via support).
Maximum of 256MB total size for all bootstrap action scripts.
Maximum of 10,000 concurrent step executions per AWS region.
Maximum of 500 active clusters per AWS region.
Maximum of 100 instance groups per cluster.
Maximum of 30 task instance groups per cluster.
Maximum of 50 tags per EMR cluster.
Maximum of 100 EMR Serverless applications per AWS account.
Maximum of 100 EMR Studio workspaces per AWS account.
Best Practices for EMR Exam
Understand the differences between EMR deployment options (EC2, Serverless, EKS).
Know the node types (master, core, task) and their specific purposes.
Understand storage options (HDFS, EMRFS, local) and when to use each.
Be familiar with instance purchasing options (On-Demand, Reserved, Spot).
Know how to optimize costs using instance fleets and Spot instances.
Understand security configurations including encryption and authentication.
Know how EMR integrates with other AWS services like Glue and Lake Formation.
Understand monitoring options including CloudWatch metrics and logs.
Know common troubleshooting approaches for EMR clusters and applications.
Understand data processing patterns and best practices for big data workloads.
Top comments (0)