DEV Community

dss99911
dss99911

Posted on • Originally published at dss99911.github.io

How to Use Spark Connect on EMR from Local Environment

Spark Connect allows you to run Spark jobs remotely, enabling local development against an EMR cluster. This guide covers setup, configuration, and common issues.

Prerequisites

  • AWS EMR cluster with Spark 3.4.0 or later
  • SSH access to EMR master node
  • Python environment on your local machine
  • Network access to EMR cluster (VPN or direct)

Reference

Spark Connect Official Documentation

Setting Up Spark Connect Server on EMR

Spark Connect is supported from version 3.4.0. Start the connect server on your EMR master node:

sudo /usr/lib/spark/sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:{your-spark-version}
Enter fullscreen mode Exit fullscreen mode

Note: Replace {your-spark-version} with your actual Spark version (e.g., 3.4.1).

Configuring Local Environment

Version compatibility is critical. Mismatched versions will cause errors.

pip install pyspark==3.4.1
pip install grpcio-status==1.64.0
pip install grpcio==1.64.0
pip install protobuf==5.27.0
Enter fullscreen mode Exit fullscreen mode

Connecting to Spark Connect

Using PySpark Shell

pyspark --remote "sc://{emr-cluster-master-ip}"
Enter fullscreen mode Exit fullscreen mode

Using Python Script

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Spark Connect Example") \
    .remote("sc://{emr-cluster-master-ip}") \
    .getOrCreate()

# Now you can use spark as usual
df = spark.createDataFrame([("Alice", 1), ("Bob", 2)], ["name", "id"])
df.show()
Enter fullscreen mode Exit fullscreen mode

SparkContext Limitations

In Spark Connect, SparkContext is deprecated. The following functions are not available:

Function Status Alternative
sc.setCheckpointDir Deprecated Use spark.sparkContext.setCheckpointDir() on server
sc.addPyFile Deprecated Pre-install packages on cluster
sc.install_pypi_package Deprecated Pre-install packages on cluster
sc.parallelize Deprecated Use spark.createDataFrame()
sc.setLogLevel Deprecated Configure on server side
sc.broadcast Deprecated Use DataFrame operations

Important: Only a single connect server can run at a time on the cluster.

Troubleshooting

Error: [NOT_ITERABLE] Column is not iterable

pyspark.errors.exceptions.base.PySparkTypeError: [NOT_ITERABLE] Column is not iterable.
Enter fullscreen mode Exit fullscreen mode

Cause: Protobuf version incompatibility

Solution: Ensure protobuf version matches the server:

pip install protobuf==5.27.0
Enter fullscreen mode Exit fullscreen mode

Connection Refused

Cause: Firewall or security group blocking port 15002

Solution:

  1. Add inbound rule for port 15002 in EMR security group
  2. Or use SSH tunnel:
ssh -L 15002:localhost:15002 hadoop@{emr-master-ip}
Enter fullscreen mode Exit fullscreen mode

Version Mismatch Errors

Cause: Local PySpark version doesn't match EMR Spark version

Solution: Install the exact same version:

# Check EMR Spark version
spark-submit --version

# Install matching local version
pip install pyspark=={same-version}
Enter fullscreen mode Exit fullscreen mode

Best Practices

  1. Use virtual environments: Isolate Spark Connect dependencies
  2. Match versions exactly: Minor version differences can cause issues
  3. Use SSH tunneling: More secure than opening ports
  4. Monitor server resources: Connect server adds overhead to master node

Originally published at https://dss99911.github.io

Top comments (0)