Installing and Running Hadoop and Spark on Windows
Andrew
Nov 5 '18
・6 min read
We recently got a big new server at work to run Hadoop and Spark (H/S) on for a proof-of-concept test of some software we're writing for the biopharmaceutical industry and I hit a few snags while trying to get H/S up and running on Windows Server 2016. I've documented here, step-by-step, how I managed to install and run this pair of Apache products directly in the Windows cmd
prompt, without any need for Linux emulation.
Get the Software
The first step is to download Java, Hadoop, and Spark. Spark seems to have trouble working with newer versions of Java, so I'm sticking with Java 8 for now:
For Java, I download the "Windows x64" version of the JDK (jdk-8u191-windows-x64.exe); for Hadoop, the binary of v3.1.1 (hadoop-3.1.1.tar.gz); for Spark, v2.3.2 "Pre-built for Apache Hadoop 2.7 and later (spark-2.3.2-bin-hadoop2.7.tgz).
Move all three of these files to C:\
. Download 7-Zip to extract the *gz
archives. Once they're extracted (Hadoop takes a while), you should have two directories and the JDK application:
Run the Java installer but change the destination folder from the default C:\Program Files\Java\jdk.1.8.0_191
to just C:\Java
. (H/S can have trouble with directories with spaces in their names.)
Another box will pop up asking for the "Destination Folder" again. This time, use C:\Java\jre1.8.0_191
. Make two more directories in C:\
called C:\Hadoop
and C:\Spark
and copy the hadoop-3.1.1
and spark-2.3.2-bin-hadoop2.7
directories into those directories, respectively:
If you get "name too long"-type warnings, skip those files. These are only *.html
files and aren't critical to running H/S.
Set Up Your Environment Variables
Next, we need to set some environment variables. Go to Control Panel > System and Security > System > Advanced System Settings > Environment Variables...
:
Add new System variables (bottom box) called:
-
JAVA_HOME
-->C:\Java
-
HADOOP_HOME
-->C:\Hadoop\hadoop-3.1.1
-
SPARK_HOME
-->C:\Spark\spark-2.3.2-bin-hadoop2.7
(Adjust according to the versions of Hadoop and Spark that you've downloaded.)
Then, edit the Path
and add those variables with \bin
appended (also \sbin
for Hadoop):
If you echo %PATH%
in cmd
you should now see these three directories somewhere in the middle of the path, because the User Path is appended to the System Path for the %PATH
variable. You should check now that java -version
, hdfs -version
, and spark-shell --version
return version numbers, as shown below. This means that they were correctly installed and added to your %PATH%
:
Configure Hadoop
Next, go to C:\Hadoop\hadoop-3.1.1\etc\hadoop
and edit (or create) the file core-site.xml
so it looks like the following:
core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
In the same directory, edit (or create) mapred-site.xml
with the following contents:
mapred-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Next, edit (or create) hdfs-site.xml
:
hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///C:/Hadoop/hadoop-3.1.1/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///C:/Hadoop/hadoop-3.1.1/datanode</value>
</property>
</configuration>
...yes, they should be forward slashes, even though Windows uses backslashes. This is due to the way that Hadoop interprets these file paths. Finally, edit yarn-site.xml
so it reads:
yarn-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
The last thing we need to do is create the directories that we referenced in hdfs-site.xml
:
Patch Hadoop
Now, you need to apply a patch created by and posted to GitHub by user s911415. (Note that this patch is specific to the version of Hadoop that you're installing. Patches for other versions can be found here and here.) We need, in particular hadoop.dll
and winutils.exe
from these patches.
Download the entire directory at that link as a *.zip
and extract the directory bin
. Replace C:\Hadoop\hadoop-3.1.1\bin
with this new directory. (If you want, save the old directory by renaming it to bin.old
.)
Now, if you run hdfs namenode -format
in cmd
, you should see:
One more thing to do: copy hadoop-yarn-server-timelineservice-3.0.3
from C:\Hadoop\hadoop-3.1.1\share\hadoop\yarn\timelineservice
to C:\Hadoop\hadoop-3.1.1\share\hadoop\yarn
(the parent directory).
Boot HDFS
Finally, you can boot HDFS by running start-dfs.cmd
and start-yarn.cmd
in cmd
:
You should verify that the namenode
, datanode
, resourcemanager
, and nodemanager
are all running. You can also open localhost:8088
and localhost:9870
in your browser to monitor your shiny, new Hadoop Distributed File System:
Finally, test that you can edit the filesystem by running hadoop fs -mkdir /test
, which will make a directory called test
in the root directory:
Testing Hadoop and Spark
We know now how to create directories (fs -mkdir
) and list their contents (fs -ls
) in HDFS, what about creating and editing files? Well, files can be copied from the local file system to HDFS with fs -put
. We can then read files in the spark-shell
with sc.textFile(...)
:
Note that you read a file from HDFS on hdfs://localhost:9000/
and not just hdfs://
. This is because this is the defaultFS
we defined in core-site.xml
.
So there you have it! Spark running on Windows, reading files stored in HDFS. This took a bit of work to get going and I owe a lot to people who previously encountered the same bugs as me, or previously wrote tutorials which I used as a framework for this walkthrough. Here are the blogs, GitHub repos, and SO posts I used to build this tutorial:
- Muhammad Bilal Yar's Hadoop 2.8.0 walkthrough
- java.net.URISyntaxException
- java.lang.UnsatisfiedLinkError
- FATAL resourcemanager.ResourceManager
- localhost:50070 error
- Kuldeep Singh's walkthrough and troubleshooting guide
- Jacek Laskowski's GitBook
- java.io.IOException: Incompatible clusterIDs
- HDFS basic commands
- Spark basic commands
But ... why? Just get Fedora and done ;)
Client-specified software that only runs on Windows Server :/
Well, that's sad. Have you thought about using smth. like an IIS container for those proprietary blobs?
I haven't, no... how would that work? Can you point me to any good resources?
See the Docker hub for more info, although I don't use it personally (I use & write FLOSS exclusivly)
My god, I've spent and insane amount of time on this for an assignment, and this was the only thing I've gotten to work. Thank you for putting this together.
Happy to help!
Thanks for the guide! Just noticed a small typo with one port number:
localhost:9087 instead of localhost:9870 (I should have looked at the image:)
Thanks for pointing that out! Typo is fixed :)