DEV Community

vinoth chandar
vinoth chandar

Posted on

Using Your Own Apache Spark/Hudi Versions With AWS EMR

Sometimes its useful to be able to run your own version of Apache Spark/Hudi, on a AWS EMR cluster you provisioned. You get the best of both worlds : all the AWS tooling + latest Spark + latest Hudi

This is a simple post on how you can accomplish this. First, create your EMR cluster, following works for EMR 6.2

Step 1: Build Hudi and copy the spark-bundle over

On your local mac/linux box.

# You can get this from the cluster's status page
export EMR_MASTER=<your_emr_master_public_dns>
# So you can build your own bundles and deploy
export HUDI_REPO=/path/to/hudi/repo
mvn clean package -DskipTests -Dspark3

export HUDI_SPARK_BUNDLE=hudi-spark-bundle_2.12-0.8.0-SNAPSHOT.jar
scp -i /path/to/key.pem ${HUDI_REPO}/packaging/hudi-spark-bundle/target/${HUDI_SPARK_BUNDLE} hadoop@${EMR_MASTER}:~/ 

Enter fullscreen mode Exit fullscreen mode

Step 2: Install Spark 3 with AWS Jars

ssh to the EMR master node.

ssh -i /path/to/key.pem hadoop@{EMR_MASTER}
Enter fullscreen mode Exit fullscreen mode

Then proceed to download Spark 3.

# For hadoop-aws > 3.2 versions, we need the bundle jar.
export HADOOP_VERSION=3.2.0
wget${HADOOP_VERSION}/hadoop-aws-${HADOOP_VERSION}.jar -O hadoop-aws.jar
export AWS_SDK_VERSION=1.11.375
wget${AWS_SDK_VERSION}/aws-java-sdk-bundle-${AWS_SDK_VERSION}.jar -O aws-java-sdk.jar 

# Install spark version on your own; Just Apache Spark. Need to match Hadoop version from EMR cluster created
tar -zxvf spark-3.0.1-bin-hadoop3.2.tgz
Enter fullscreen mode Exit fullscreen mode

Step 3: Fire up your spark-shell

Need to set the following

export HADOOP_CONF_DIR=/etc/hadoop/conf 
export AWS_ACCESS_KEY_ID=<your key id>
export AWS_SECRET_ACCESS_KEY=<your access key>
Enter fullscreen mode Exit fullscreen mode

Setup a spark shell

export SCALA_VERSION=2.12
export SPARK_VERSION=3.0.1
export HUDI_JAR=~/hudi-spark-bundle_${SCALA_VERSION}-0.8.0-SNAPSHOT.jar
export AWS_JARS="${HOME}/hadoop-aws.jar,${HOME}/aws-java-sdk.jar"
export JARS="${HUDI_JAR},${AWS_JARS}"

bin/spark-shell \
    --driver-memory 8g --executor-memory 8g  \
    --master yarn --deploy-mode client \
    --num-executors 2 --executor-cores 4 \
    --conf spark.rdd.compress=true \
    --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
    --conf spark.hadoop.yarn.timeline-service.enabled=false \
    --conf spark.driver.userClassPathFirst=true \
    --conf spark.executor.userClassPathFirst=true \
    --conf spark.ui.proxyBase="" \
    --jars ${JARS} \
    --packages org.apache.spark:spark-avro_${SCALA_VERSION}:${SPARK_VERSION} \
    --conf "spark.memory.storageFraction=0.8" \
    --conf "spark.driver.extraClassPath=-XX:NewSize=1g -XX:SurvivorRatio=2 -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=70" \
    --conf "spark.executor.extraClassPath=-XX:NewSize=1g -XX:SurvivorRatio=2 -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=70" 

scala> // Spark UI is otherwise broken.
scala> sys.props.update("spark.ui.proxyBase", "")
Enter fullscreen mode Exit fullscreen mode

Top comments (0)