Using Your Own Apache Spark/Hudi Versions With AWS EMR

#datascience #bigdata #aws #apachehudi

Sometimes its useful to be able to run your own version of Apache Spark/Hudi, on a AWS EMR cluster you provisioned. You get the best of both worlds : all the AWS tooling + latest Spark + latest Hudi

This is a simple post on how you can accomplish this. First, create your EMR cluster, following works for EMR 6.2

Step 1: Build Hudi and copy the spark-bundle over

On your local mac/linux box.

# You can get this from the cluster's status page
export EMR_MASTER=<your_emr_master_public_dns>
# So you can build your own bundles and deploy
export HUDI_REPO=/path/to/hudi/repo
cd ${HUDI_REPO}
mvn clean package -DskipTests -Dspark3

export HUDI_SPARK_BUNDLE=hudi-spark-bundle_2.12-0.8.0-SNAPSHOT.jar
scp -i /path/to/key.pem ${HUDI_REPO}/packaging/hudi-spark-bundle/target/${HUDI_SPARK_BUNDLE} hadoop@${EMR_MASTER}:~/

Step 2: Install Spark 3 with AWS Jars

ssh to the EMR master node.

ssh -i /path/to/key.pem hadoop@{EMR_MASTER}

Then proceed to download Spark 3.

# For hadoop-aws > 3.2 versions, we need the bundle jar.
export HADOOP_VERSION=3.2.0
wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/${HADOOP_VERSION}/hadoop-aws-${HADOOP_VERSION}.jar -O hadoop-aws.jar
export AWS_SDK_VERSION=1.11.375
wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/${AWS_SDK_VERSION}/aws-java-sdk-bundle-${AWS_SDK_VERSION}.jar -O aws-java-sdk.jar 


# Install spark version on your own; Just Apache Spark. Need to match Hadoop version from EMR cluster created
wget https://downloads.apache.org/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz
tar -zxvf spark-3.0.1-bin-hadoop3.2.tgz

Step 3: Fire up your spark-shell

Need to set the following

export HADOOP_CONF_DIR=/etc/hadoop/conf 
export AWS_ACCESS_KEY_ID=<your key id>
export AWS_SECRET_ACCESS_KEY=<your access key>

Setup a spark shell

export SCALA_VERSION=2.12
export SPARK_VERSION=3.0.1
export HUDI_JAR=~/hudi-spark-bundle_${SCALA_VERSION}-0.8.0-SNAPSHOT.jar
export AWS_JARS="${HOME}/hadoop-aws.jar,${HOME}/aws-java-sdk.jar"
export JARS="${HUDI_JAR},${AWS_JARS}"

bin/spark-shell \
    --driver-memory 8g --executor-memory 8g  \
    --master yarn --deploy-mode client \
    --num-executors 2 --executor-cores 4 \
    --conf spark.rdd.compress=true \
    --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
    --conf spark.hadoop.yarn.timeline-service.enabled=false \
    --conf spark.driver.userClassPathFirst=true \
    --conf spark.executor.userClassPathFirst=true \
    --conf spark.ui.proxyBase="" \
    --jars ${JARS} \
    --packages org.apache.spark:spark-avro_${SCALA_VERSION}:${SPARK_VERSION} \
    --conf "spark.memory.storageFraction=0.8" \
    --conf "spark.driver.extraClassPath=-XX:NewSize=1g -XX:SurvivorRatio=2 -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=70" \
    --conf "spark.executor.extraClassPath=-XX:NewSize=1g -XX:SurvivorRatio=2 -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=70" 

scala> // Spark UI is otherwise broken.
scala> sys.props.update("spark.ui.proxyBase", "")