DEV Community

Cover image for How to Use EMR APIs?
Wanda
Wanda

Posted on • Originally published at apidog.com

How to Use EMR APIs?

TL;DR

AWS EMR (Elastic MapReduce) APIs allow you to manage big data clusters for Hadoop, Spark, Hive, and Presto. Use the API to programmatically create clusters, submit jobs as steps, scale resources automatically, and terminate clusters when done. Authentication is handled with AWS IAM. To validate cluster configs, test job submissions, and document your data pipelines, use Apidog.

Try Apidog today

Introduction

AWS EMR is AWS’s managed Hadoop/Spark service for processing petabytes of data in analytics, machine learning, and ETL pipelines. Instead of setting up Hadoop clusters manually, EMR handles the infrastructure on EC2 instances.

You configure:

  • Instance types (master, core, task nodes)
  • Applications (Spark, Hadoop, Hive, Presto, HBase)
  • Bootstrap actions (setup scripts)
  • Steps (jobs to run)

With the EMR API, you automate cluster creation, submit jobs, monitor them, and integrate with other AWS services.

💡 Tip: Use Apidog to test cluster configs, validate job definitions, and document EMR workflows before running costly jobs.

Test AWS APIs with Apidog - free

By the end of this guide, you’ll be able to:

  • Create and configure EMR clusters via API
  • Submit jobs as steps
  • Manage auto-scaling
  • Monitor cluster health and job progress
  • Optimize costs with instance fleets and spot instances

Authentication with AWS

EMR uses standard AWS IAM authentication.

AWS SDK approach (recommended)

import { EMRClient, RunJobFlowCommand } from '@aws-sdk/client-emr'

const client = new EMRClient({
  region: 'us-east-1',
  credentials: {
    accessKeyId: process.env.AWS_ACCESS_KEY_ID,
    secretAccessKey: process.env.AWS_SECRET_ACCESS_KEY
  }
})
Enter fullscreen mode Exit fullscreen mode

Direct API with SigV4

EMR requires AWS Signature Version 4. Use SDKs, AWS CLI, or tools like boto3, or generate signatures manually.

aws emr list-clusters --region us-east-1
Enter fullscreen mode Exit fullscreen mode

IAM permissions

Minimum policy for EMR management:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "elasticmapreduce:*",
        "ec2:Describe*",
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject",
        "s3:ListBucket"
      ],
      "Resource": "*"
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Creating a Cluster

Basic cluster creation

aws emr create-cluster \
  --name "My Spark Cluster" \
  --release-label emr-7.0.0 \
  --applications Name=Spark Name=Hadoop \
  --instance-type m5.xlarge \
  --instance-count 3 \
  --service-role EMR_DefaultRole \
  --job-flow-role EMR_EC2_DefaultRole
Enter fullscreen mode Exit fullscreen mode

Via API (RunJobFlow)

{
  "Name": "Data Processing Cluster",
  "ReleaseLabel": "emr-7.0.0",
  "Applications": [
    { "Name": "Spark" },
    { "Name": "Hadoop" },
    { "Name": "Hive" }
  ],
  "Instances": {
    "MasterInstanceType": "m5.xlarge",
    "SlaveInstanceType": "m5.xlarge",
    "InstanceCount": 3,
    "KeepJobFlowAliveWhenNoSteps": true,
    "TerminationProtected": false
  },
  "Steps": [],
  "ServiceRole": "EMR_DefaultRole",
  "JobFlowRole": "EMR_EC2_DefaultRole",
  "LogUri": "s3://my-bucket/emr-logs/",
  "Tags": [
    { "Key": "Environment", "Value": "Production" }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Response:

{
  "JobFlowId": "j-ABC123DEF456"
}
Enter fullscreen mode Exit fullscreen mode

Instance groups vs instance fleets

Instance groups: Fixed instance types per group (master, core, task).

Instance fleets: Multiple instance types/options per group. EMR chooses based on availability and price.

{
  "Instances": {
    "InstanceFleets": [
      {
        "Name": "MasterFleet",
        "InstanceFleetType": "MASTER",
        "TargetOnDemandCapacity": 1,
        "InstanceTypeConfigs": [
          { "InstanceType": "m5.xlarge" },
          { "InstanceType": "m4.xlarge" }
        ]
      },
      {
        "Name": "CoreFleet",
        "InstanceFleetType": "CORE",
        "TargetOnDemandCapacity": 2,
        "TargetSpotCapacity": 4,
        "InstanceTypeConfigs": [
          { "InstanceType": "m5.2xlarge" },
          { "InstanceType": "m4.2xlarge" }
        ],
        "LaunchSpecifications": {
          "SpotSpecification": {
            "TimeoutDurationMinutes": 60,
            "TimeoutAction": "SWITCH_TO_ON_DEMAND"
          }
        }
      }
    ]
  }
}
Enter fullscreen mode Exit fullscreen mode

Submitting Jobs as Steps

EMR jobs are executed as “steps” in sequence.

Add a Spark step

aws emr add-steps \
  --cluster-id j-ABC123DEF456 \
  --steps '[
    {
      "Name": "Process Data",
      "ActionOnFailure": "CONTINUE",
      "HadoopJarStep": {
        "Jar": "command-runner.jar",
        "Args": [
          "spark-submit",
          "--deploy-mode",
          "cluster",
          "--class",
          "com.example.DataProcessor",
          "s3://my-bucket/jars/processor.jar",
          "s3://my-bucket/input/",
          "s3://my-bucket/output/"
        ]
      }
    }
  ]'
Enter fullscreen mode Exit fullscreen mode

Via API (AddJobFlowSteps)

{
  "JobFlowId": "j-ABC123DEF456",
  "Steps": [
    {
      "Name": "Spark ETL Job",
      "ActionOnFailure": "CONTINUE",
      "HadoopJarStep": {
        "Jar": "command-runner.jar",
        "Args": [
          "spark-submit",
          "--executor-memory",
          "4g",
          "--executor-cores",
          "2",
          "s3://my-bucket/scripts/process.py",
          "--input",
          "s3://my-bucket/input/",
          "--output",
          "s3://my-bucket/output/"
        ]
      }
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

ActionOnFailure options

  • TERMINATE_CLUSTER: Stop cluster on failure
  • CANCEL_AND_WAIT: Cancel remaining steps, keep cluster running
  • CONTINUE: Continue with next step

Hive step

{
  "Name": "Hive Query",
  "HadoopJarStep": {
    "Jar": "command-runner.jar",
    "Args": [
      "hive-script",
      "--run-hive-script",
      "--args",
      "-f",
      "s3://my-bucket/scripts/transform.q"
    ]
  }
}
Enter fullscreen mode Exit fullscreen mode

Auto-scaling

EMR can scale task nodes based on workload.

Create auto-scaling policy

aws emr put-auto-scaling-policy \
  --cluster-id j-ABC123DEF456 \
  --instance-group-id ig-ABC123 \
  --auto-scaling-policy '{
    "Constraints": {
      "MinCapacity": 2,
      "MaxCapacity": 10
    },
    "Rules": [
      {
        "Name": "ScaleOut",
        "Description": "Add nodes when memory is high",
        "Action": {
          "SimpleScalingPolicyConfiguration": {
            "AdjustmentType": "CHANGE_IN_CAPACITY",
            "ScalingAdjustment": 2,
            "CoolDown": 300
          }
        },
        "Trigger": {
          "CloudWatchAlarmDefinition": {
            "ComparisonOperator": "GREATER_THAN",
            "EvaluationPeriods": 3,
            "MetricName": "MemoryAvailableMB",
            "Namespace": "AWS/ElasticMapReduce",
            "Period": 300,
            "Threshold": 2000,
            "Statistic": "AVERAGE"
          }
        }
      }
    ]
  }'
Enter fullscreen mode Exit fullscreen mode

Metrics for scaling

  • MemoryAvailableMB: Free memory
  • MemoryTotalMB: Total memory
  • HDFSUtilization: HDFS space used
  • AppsRunning: YARN applications running
  • AppsPending: YARN applications waiting

Monitoring and Logging

List clusters

aws emr list-clusters --states RUNNING
Enter fullscreen mode Exit fullscreen mode

Describe cluster

aws emr describe-cluster --cluster-id j-ABC123DEF456
Enter fullscreen mode Exit fullscreen mode

Response:

{
  "Cluster": {
    "Id": "j-ABC123DEF456",
    "Name": "My Cluster",
    "Status": {
      "State": "RUNNING",
      "StateChangeReason": {},
      "Timeline": {
        "CreationDateTime": "2026-03-24T10:00:00.000Z"
      }
    },
    "Applications": [
      { "Name": "Spark", "Version": "3.5.0" }
    ],
    "InstanceCollectionType": "INSTANCE_GROUP",
    "LogUri": "s3://my-bucket/emr-logs/",
    "MasterPublicDnsName": "ec2-12-34-56-78.compute-1.amazonaws.com"
  }
}
Enter fullscreen mode Exit fullscreen mode

List steps

aws emr list-steps --cluster-id j-ABC123DEF456
Enter fullscreen mode Exit fullscreen mode

Step status

{
  "Id": "s-ABC123",
  "Name": "Process Data",
  "Status": {
    "State": "COMPLETED",
    "Timeline": {
      "StartDateTime": "2026-03-24T10:05:00.000Z",
      "EndDateTime": "2026-03-24T11:30:00.000Z"
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

CloudWatch integration

EMR publishes metrics to CloudWatch:

  • JobsFailed
  • JobsRunning
  • MemoryAvailableMB
  • MemoryTotalMB
  • HDFSUtilization

Cost Optimization

Use spot instances

Task nodes are ideal for spot instances. If they're terminated, jobs continue on remaining nodes.

{
  "Name": "TaskGroup",
  "InstanceRole": "TASK",
  "InstanceType": "m5.2xlarge",
  "InstanceCount": 4,
  "Market": "SPOT",
  "BidPrice": "0.10"
}
Enter fullscreen mode Exit fullscreen mode

Transient clusters

Create clusters, run jobs, and auto-terminate:

{
  "KeepJobFlowAliveWhenNoSteps": false,
  "Steps": [
    { ... step 1 ... },
    { ... step 2 ... }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Cluster terminates after all steps complete.

Instance fleets with multiple options

Let EMR select the cheapest available instance type:

{
  "InstanceTypeConfigs": [
    { "InstanceType": "m5.2xlarge", "BidPrice": "0.15" },
    { "InstanceType": "m4.2xlarge", "BidPrice": "0.12" },
    { "InstanceType": "c5.2xlarge", "BidPrice": "0.10" }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Testing with Apidog

EMR clusters are expensive. Test your configurations to avoid costly mistakes.

Apidog screenshot

1. Validate cluster configurations

Save cluster templates in Apidog:

pm.test('Cluster has required applications', () => {
  const config = pm.request.body.toJSON()
  const apps = config.Applications.map(a => a.Name)
  pm.expect(apps).to.include('Spark')
})

pm.test('Instance types are valid', () => {
  const config = pm.request.body.toJSON()
  const types = ['m5.xlarge', 'm5.2xlarge', 'm4.xlarge']
  pm.expect(types).to.include(config.Instances.MasterInstanceType)
})
Enter fullscreen mode Exit fullscreen mode

2. Test step definitions

pm.test('Spark step has valid args', () => {
  const step = pm.request.body.toJSON().Steps[0]
  const args = step.HadoopJarStep.Args
  pm.expect(args[0]).to.eql('spark-submit')
  pm.expect(args).to.include('--deploy-mode')
})
Enter fullscreen mode Exit fullscreen mode

3. Environment variables

AWS_REGION: us-east-1
EMR_SERVICE_ROLE: EMR_DefaultRole
EMR_EC2_ROLE: EMR_EC2_DefaultRole
S3_LOG_BUCKET: my-emr-logs
S3_SCRIPTS_BUCKET: my-emr-scripts
Enter fullscreen mode Exit fullscreen mode

Test AWS APIs with Apidog - free

Common Errors and Fixes

ValidationError: ServiceRole is not valid

Cause: IAM role doesn’t exist or isn’t configured for EMR.

Fix: Create the service role in IAM console or use AWS default: EMR_DefaultRole_V2.

Failed to provision EC2 instances

Cause: Instance type unavailable in your AZ, or service limits reached.

Fix:

  • Use instance fleets with multiple instance types
  • Request a service limit increase
  • Try different instance types

Step failed with Application exit code 1

Cause: The Spark/Hadoop job failed.

Fix: Check logs in S3 (LogUri path). Review stderr and stdout for the step.

Cluster stuck in STARTING state

Cause: Failing bootstrap actions or permission issues.

Fix: Check EC2 instance console output. Verify S3 access for bootstrap scripts.

Alternatives and Comparisons

Feature AWS EMR Google Dataproc Azure HDInsight Databricks
Managed Hadoop/Spark Spark only
AWS integration Excellent Limited Limited Good
Serverless option EMR Serverless Dataproc Serverless Limited
Cost Spot support Preemptible VMs Spot instances Good
ML support EMR Studio Vertex AI Synapse MLflow built-in

EMR offers the deepest AWS integration. Databricks provides advanced Spark tooling. Dataproc is cost-effective on GCP.

Real-world Use Cases

Data lake ETL: Retailers process daily sales data by ingesting CSVs from S3, transforming with Spark, and saving Parquet files. Clusters run for 2 hours daily and terminate.

Log analytics: SaaS companies aggregate logs from S3 with Spark, summarize metrics, and store results in a data warehouse. Auto-scaling manages spikes in log volume.

Machine learning pipeline: Data science teams use EMR for training MLlib models, then export to SageMaker for deployment.

Wrapping Up

Key actionable steps:

  • Create clusters with RunJobFlow API
  • Submit jobs as steps
  • Use auto-scaling for cost efficiency
  • Monitor with CloudWatch
  • Optimize costs with spot instances and transient clusters

Your next steps:

  1. Set up IAM roles for EMR
  2. Create a test cluster
  3. Submit a simple Spark job
  4. Review logs in S3
  5. Implement cost-saving strategies

Test AWS APIs with Apidog - free

FAQ

What’s the difference between master, core, and task nodes?

  • Master: Runs cluster manager (YARN ResourceManager, HDFS NameNode)
  • Core: Runs data processing and stores HDFS data
  • Task: Runs data processing only, no HDFS data (good for spot instances)

How do I SSH into the master node?

aws emr ssh --cluster-id j-ABC123DEF456 --key-pair-file my-key.pem
Enter fullscreen mode Exit fullscreen mode

Can I run Jupyter notebooks on EMR?

Yes. Use EMR Studio, enable the JupyterHub application, or use EMR Notebooks (managed Jupyter).

What’s EMR Serverless?

A serverless option to submit Spark/Hive jobs without managing clusters. Pay per job run. Best for sporadic workloads.

How do I read from DynamoDB?

Use the DynamoDB connector:

spark-submit --conf spark.hadoop.dynamodb.servicename=dynamodb \
  --conf spark.hadoop.dynamodb.input.tableName=MyTable \
  --conf spark.hadoop.dynamodb.output.tableName=MyTable \
  --conf spark.hadoop.dynamodb.region=us-east-1 \
  my-job.jar
Enter fullscreen mode Exit fullscreen mode

What release label should I use?

Use the latest stable (emr-7.x for Spark 3.x). Align versions across environments. Check compatibility in release notes.

How do I troubleshoot failed steps?

  1. Check step status: aws emr describe-step
  2. View logs in S3: s3://your-log-bucket/logs/j-ABC123/steps/s-DEF123/
  3. SSH to master and check /mnt/var/log/

Top comments (0)