Andrew (he/him)

Posted on Nov 5, 2018 • Edited on Dec 18, 2019

Installing and Running Hadoop and Spark on Windows

#hadoop #spark #windows #tutorial

Installing and Running Hadoop and Spark on Windows

We recently got a big new server at work to run Hadoop and Spark (H/S) on for a proof-of-concept test of some software we're writing for the biopharmaceutical industry and I hit a few snags while trying to get H/S up and running on Windows Server 2016 / Windows 10. I've documented here, step-by-step, how I managed to install and run this pair of Apache products directly in the Windows cmd prompt, without any need for Linux emulation.

Update 16 Dec 2019: Software version numbers have been updated and the text has been clarified.

Get the Software

The first step is to download Java, Hadoop, and Spark. Spark seems to have trouble working with newer versions of Java, so I'm sticking with Java 8 for now:

I can't guarantee that this guide works with newer versions of Java. Please try with Java 8 if you're having issues. Also, with the new Oracle licensing structure (2019+), you may need to create an Oracle account to download Java 8. To avoid this, simply download from AdoptOpenJDK instead.

For Java, I download the "Windows x64" version of the AdoptOpenJDK HotSpot JVM (jdk8u232-b09); for Hadoop, the binary of v3.1.3 (hadoop-3.1.3.tar.gz); for Spark, v3.0.0 "Pre-built for Apache Hadoop 2.7 and later" (spark-3.0.0-preview-bin-hadoop2.7.tgz). From this point on, I'll refer generally to these versions as hadoop-<version> and spark-<version>; please replace these with your version number throughout the rest of this tutorial.

Even though newer versions of Hadoop and Spark are currently available, there is a bug with Hadoop 3.2.1 on Windows that causes installation to fail. Until that patched version is available (3.3.0 or 3.1.4 or 3.2.2), you must use an earlier version of Hadoop on Windows.

Next, download 7-Zip to extract the *gz archives. Note that you may need to extract twice (once to move from *gz to *.tar files, then a second time to "untar"). Once they're extracted (Hadoop takes a while), you can delete all of the *.tar and *gz files. You should now have two directories and the JDK installer in your Downloads directory:

Note that -- as shown above -- the "Hadoop" directory and "Spark" directory each contain a LICENSE, NOTICE, and README file. With particular versions of Hadoop, you may extract and get a directory structure like

C:\Users\<username>\Downloads\hadoop-<version>\hadoop-<version>\...

...if this is the case, move the contents of the inner hadoop-<version> directory to the outer hadoop-<version> directory by copying-and-pasting, then delete the inner hadoop-<version> directory. The path to the LICENSE file, for example, should then be:

C:\Users\<username>\Downloads\hadoop-<version>\LICENSE

...and similar for the "Spark" directory.

WARNING: If you see a message like "Can not create symbolic link : A required privilege is not held by the client" in 7-Zip, you MUST run 7-Zip in Administrator Mode, then unzip the directories. If you skip these files, you may end up with a broken Hadoop installation.

Move the Spark and Hadoop directories into the C:\ directory (you may need administrator privileges on your machine to do this). Then, run the Java installer but change the destination folder from the default C:\Program Files\AdoptOpenJDK\jdk-<version>\ to just C:\Java. (H/S can have trouble with directories with spaces in their names.)

Once the installation is finished, you can delete the Java *.msi installer. Make two new directories called C:\Hadoop and C:\Spark and copy the hadoop-<version> and spark-<version> directories into those directories, respectively:

If you get "name too long"-type warnings, skip those files. These are only *.html files and aren't critical to running H/S.

Set Up Your Environment Variables

Next, we need to set some environment variables. Go to Control Panel > System and Security > System > Advanced System Settings > Environment Variables...:

...and add new System variables (bottom box) called:

JAVA_HOME --> C:\Java
HADOOP_HOME --> C:\Hadoop\hadoop-<version>
SPARK_HOME --> C:\Spark\spark-<version>

(Adjust according to the versions of Hadoop and Spark that you've downloaded.)

Then, edit the Path (again, in the System variables box at the bottom) and add those variables with \bin appended (also \sbin for Hadoop):

If you echo %PATH% in cmd you should now see these three directories somewhere in the middle of the path, because the User Path is appended to the System Path for the %PATH variable. You should check now that java -version, hdfs -version, and spark-shell --version return version numbers, as shown below. This means that they were correctly installed and added to your %PATH%:

Please note that if you try to run the above commands from a location with any spaces in the path, the commands may fail. For example, if your username is "Firstname Lastname" and you try to check the Hadoop version, you may see an error message like:

C:\Users\Firstname Lastname>hdfs -version
Error: Could not find or load main class Lastname

To fix this, simply move to a working directory without any spaces in the path (as I did in the screenshot above):

C:\Users\Firstname Lastname>cd ..

C:\Users>hdfs -version
openjdk version "1.8.0_232"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_232-b09)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.232-b09, mixed mode)

Configure Hadoop

Next, go to %HADOOP_HOME%\etc\hadoop and edit (or create) the file core-site.xml so it looks like the following:

core-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://localhost:9000</value>
  </property>
</configuration>

In the same directory, edit (or create) mapred-site.xml with the following contents:

mapred-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
</configuration>

Next, edit (or create) hdfs-site.xml:

hdfs-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>file:///C:/Hadoop/hadoop-<version>/namenode</value>
  </property>
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>file:///C:/Hadoop/hadoop-<version>/datanode</value>
  </property>
</configuration>

...yes, they should be forward slashes, even though Windows uses backslashes. This is due to the way that Hadoop interprets these file paths. Also, be sure to replace <version> with the appropriate Hadoop version number. Finally, edit yarn-site.xml so it reads:

yarn-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
  <property>
    <name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>  
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
  </property>
</configuration>

The last thing we need to do is create the directories that we referenced in hdfs-site.xml:

Patch Hadoop

Now, you need to apply a patch created by and posted to GitHub by user cdarlint. (Note that this patch is specific to the version of Hadoop that you're installing, but if the exact version isn't available, try to use the one just before the desired version... that works sometimes.)

Make a backup of your %HADOOP_HOME%\bin directory (copy it to \bin.old or similar), then copy the patched files (specific to your Hadoop version, downloaded from the above git repo) to the old %HADOOP_HOME%\bin directory, replacing the old files with the new ones.

Now, if you run hdfs namenode -format in cmd, you should see:

One more thing to do: copy hadoop-yarn-server-timelineservice-<version> from C:\Hadoop\hadoop-<version>\share\hadoop\yarn\timelineservice to C:\Hadoop\hadoop-<version>\share\hadoop\yarn (the parent directory). (These are short version numbers, like 3.1.3, and may not match between the JAR file name and the directory name.)

Boot HDFS

Finally, you can boot HDFS by running start-dfs.cmd and start-yarn.cmd in cmd:

You should verify that the namenode, datanode, resourcemanager, and nodemanager are all running using the jps command:

You can also open localhost:8088 and localhost:9870 in your browser to monitor your shiny, new Hadoop Distributed File System:

Finally, test that you can edit the filesystem by running hadoop fs -mkdir /test, which will make a directory called test in the root directory:

Testing Hadoop and Spark

We know now how to create directories (fs -mkdir) and list their contents (fs -ls) in HDFS, what about creating and editing files? Well, files can be copied from the local file system to HDFS with fs -put. We can then read files in the spark-shell with sc.textFile(...):

Note that you read a file from HDFS on hdfs://localhost:9000/ and not just hdfs://. This is because this is the defaultFS we defined in core-site.xml.

If you want to stop the HDFS, you can run the commands:

C:\Users> stop-dfs.cmd

and

C:\Users> stop-yarn.cmd

So there you have it! Spark running on Windows, reading files stored in HDFS. This took a bit of work to get going and I owe a lot to people who previously encountered the same bugs as me, or previously wrote tutorials which I used as a framework for this walkthrough. Here are the blogs, GitHub repos, and SO posts I used to build this tutorial:

Oldest comments (60)

Felicitas Pojtinger • Nov 6 '18

But ... why? Just get Fedora and done ;)

Andrew (he/him) • Nov 6 '18

Client-specified software that only runs on Windows Server :/

Felicitas Pojtinger • Nov 6 '18

Well, that's sad. Have you thought about using smth. like an IIS container for those proprietary blobs?

Andrew (he/him) • Nov 6 '18

I haven't, no... how would that work? Can you point me to any good resources?

Felicitas Pojtinger • Nov 7 '18

See the Docker hub for more info, although I don't use it personally (I use & write FLOSS exclusivly)

Mark Ferrall • Jan 28 '19

My god, I've spent and insane amount of time on this for an assignment, and this was the only thing I've gotten to work. Thank you for putting this together.

Andrew (he/him) • Jan 28 '19

Happy to help!

rfks • Feb 4 '19

Thanks for the guide! Just noticed a small typo with one port number:
localhost:9087 instead of localhost:9870 (I should have looked at the image:)

Andrew (he/him) • Feb 4 '19

Thanks for pointing that out! Typo is fixed :)

ParixitOdedara • Feb 24 '19

Thanks for putting this together and sharing knowledge. I tried to get Hadoop up and running on my Windows machine last year, and it was painful! Anywho, it encouraged me to put together a blog just like you - exitcondition.com/install-hadoop-w...

Keep Exploring!

پنوں پاکستانی • Mar 28 '19

HI ANDREW

when i run start-dfs.cmd and start-yarn.cmd this command it gives me a error msg

پنوں پاکستانی • Mar 28 '19

hi Andrew

when I run start-dfs.cmd and start-yarn.cmd command it gives me an error

C:\Java\jdk1.8.0_201\bin\java -Xmx32m -classpath "C:\Hadoop\hadoop-3.1.2\etc\hadoop;C:\Hadoop\hadoop-3.1.2\share\hadoop\common;C:\Hadoop\hadoop-3.1.2\share\hadoop\common\lib*;C:\Hadoop\hadoop-3.1.2\share\hadoop\common*" org.apache.hadoop.util.PlatformName' is not recognized as an internal or external command,
operable program or batch file.
The system cannot find the file C:\Windows\system32\cmd.exe\bin.
The system cannot find the file C:\Windows\system32\cmd.exe\bin.

Please help me

Andrew (he/him) • Mar 28 '19

Hi پنوں,

It looks like your system variables are mis-configured. The path

C:\Windows\system32\cmd.exe\bin

Doesn't make any sense, as cmd.exe is an executable, not a directory. Double-check that you have the environment variables set correctly and let me know if you continue to have issues.

پنوں پاکستانی • Mar 28 '19

There was a problem with environment variables I was trying C:\Windows\system32\cmd.exe\bin but that was prompting an error. but when I changed the system variable with C:\Windows\system32\cmd.exe it was Running fine.

Thank you BOSS for your help Stay blessed.

Andrew (he/him) • Mar 28 '19

Happy to help!

OrakXaii • Apr 26 '19

Hi Andrew, Thank you.
I have followed all the steps on win 7 but when I run hdfs -version ; got an error hdfs is not recognized
please help

Andrew (he/him) • Jul 11 '19

Can you give me the exact error message you get? I haven't tried this guide on Windows 7 -- I'm not sure it will work on that OS.

Nikhil01ranjan • Jul 10 '19

Hi Andrew,

Thanks a lot for this.
This is the only thing that worked for me.

Andrew (he/him) • Jul 10 '19

Happy to help!

Michèle • Jul 20 '19

Thank you so much, after 4 tutorials and 3 days of trying it finally worked! Yay!!!

For those who might have the same problem as I did: When I used start-dfs.cmd and start-yarn.cmd it said the command couldn't be found. After a quick internet search I figured out that I needed to go to the sbin directory because it's in there and start it from there. Worked fine then.

Andrew (he/him) • Jul 20 '19

Glad it worked! I actually went back to follow this guide again recently and skipped over the part where I say to add \sbin to the PATH, too. No worries!

Nebrod666 • Aug 2 '19

Just signed to thank you for this tutorial. Well explained and very clear. Also, thanks for the link to the patch for bin files. I was only able to work with older versions of hadoop and almost tempted to try to build the bins on my own. Cheers!

View full discussion (60 comments)