Installing and Running Hadoop and Spark on Windows
We recently got a big new server at work to run Hadoop and Spark (H/S) on for a proof-of-concept test of some software we're writing for the biopharmaceutical industry and I hit a few snags while trying to get H/S up and running on Windows Server 2016 / Windows 10. I've documented here, step-by-step, how I managed to install and run this pair of Apache products directly in the Windows cmd
prompt, without any need for Linux emulation.
Update 16 Dec 2019: Software version numbers have been updated and the text has been clarified.
Get the Software
The first step is to download Java, Hadoop, and Spark. Spark seems to have trouble working with newer versions of Java, so I'm sticking with Java 8 for now:
I can't guarantee that this guide works with newer versions of Java. Please try with Java 8 if you're having issues. Also, with the new Oracle licensing structure (2019+), you may need to create an Oracle account to download Java 8. To avoid this, simply download from AdoptOpenJDK instead.
For Java, I download the "Windows x64" version of the AdoptOpenJDK HotSpot JVM (jdk8u232-b09
); for Hadoop, the binary of v3.1.3 (hadoop-3.1.3.tar.gz
); for Spark, v3.0.0 "Pre-built for Apache Hadoop 2.7 and later" (spark-3.0.0-preview-bin-hadoop2.7.tgz
). From this point on, I'll refer generally to these versions as hadoop-<version>
and spark-<version>
; please replace these with your version number throughout the rest of this tutorial.
Even though newer versions of Hadoop and Spark are currently available, there is a bug with Hadoop 3.2.1 on Windows that causes installation to fail. Until that patched version is available (3.3.0 or 3.1.4 or 3.2.2), you must use an earlier version of Hadoop on Windows.
Next, download 7-Zip to extract the *gz
archives. Note that you may need to extract twice (once to move from *gz
to *.tar
files, then a second time to "untar"). Once they're extracted (Hadoop takes a while), you can delete all of the *.tar
and *gz
files. You should now have two directories and the JDK installer in your Downloads directory:
Note that -- as shown above -- the "Hadoop" directory and "Spark" directory each contain a LICENSE
, NOTICE
, and README
file. With particular versions of Hadoop, you may extract and get a directory structure like
C:\Users\<username>\Downloads\hadoop-<version>\hadoop-<version>\...
...if this is the case, move the contents of the inner hadoop-<version>
directory to the outer hadoop-<version>
directory by copying-and-pasting, then delete the inner hadoop-<version>
directory. The path to the LICENSE
file, for example, should then be:
C:\Users\<username>\Downloads\hadoop-<version>\LICENSE
...and similar for the "Spark" directory.
WARNING: If you see a message like "Can not create symbolic link : A required privilege is not held by the client" in 7-Zip, you MUST run 7-Zip in Administrator Mode, then unzip the directories. If you skip these files, you may end up with a broken Hadoop installation.
Move the Spark and Hadoop directories into the C:\
directory (you may need administrator privileges on your machine to do this). Then, run the Java installer but change the destination folder from the default C:\Program Files\AdoptOpenJDK\jdk-<version>\
to just C:\Java
. (H/S can have trouble with directories with spaces in their names.)
Once the installation is finished, you can delete the Java *.msi
installer. Make two new directories called C:\Hadoop
and C:\Spark
and copy the hadoop-<version>
and spark-<version>
directories into those directories, respectively:
If you get "name too long"-type warnings, skip those files. These are only *.html
files and aren't critical to running H/S.
Set Up Your Environment Variables
Next, we need to set some environment variables. Go to Control Panel > System and Security > System > Advanced System Settings > Environment Variables...
:
...and add new System variables (bottom box) called:
-
JAVA_HOME
-->C:\Java
-
HADOOP_HOME
-->C:\Hadoop\hadoop-<version>
-
SPARK_HOME
-->C:\Spark\spark-<version>
(Adjust according to the versions of Hadoop and Spark that you've downloaded.)
Then, edit the Path
(again, in the System variables box at the bottom) and add those variables with \bin
appended (also \sbin
for Hadoop):
If you echo %PATH%
in cmd
you should now see these three directories somewhere in the middle of the path, because the User Path is appended to the System Path for the %PATH
variable. You should check now that java -version
, hdfs -version
, and spark-shell --version
return version numbers, as shown below. This means that they were correctly installed and added to your %PATH%
:
Please note that if you try to run the above commands from a location with any spaces in the path, the commands may fail. For example, if your username is "Firstname Lastname" and you try to check the Hadoop version, you may see an error message like:
C:\Users\Firstname Lastname>hdfs -version
Error: Could not find or load main class Lastname
To fix this, simply move to a working directory without any spaces in the path (as I did in the screenshot above):
C:\Users\Firstname Lastname>cd ..
C:\Users>hdfs -version
openjdk version "1.8.0_232"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_232-b09)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.232-b09, mixed mode)
Configure Hadoop
Next, go to %HADOOP_HOME%\etc\hadoop
and edit (or create) the file core-site.xml
so it looks like the following:
core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
In the same directory, edit (or create) mapred-site.xml
with the following contents:
mapred-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Next, edit (or create) hdfs-site.xml
:
hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///C:/Hadoop/hadoop-<version>/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///C:/Hadoop/hadoop-<version>/datanode</value>
</property>
</configuration>
...yes, they should be forward slashes, even though Windows uses backslashes. This is due to the way that Hadoop interprets these file paths. Also, be sure to replace <version>
with the appropriate Hadoop version number. Finally, edit yarn-site.xml
so it reads:
yarn-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
The last thing we need to do is create the directories that we referenced in hdfs-site.xml
:
Patch Hadoop
Now, you need to apply a patch created by and posted to GitHub by user cdarlint. (Note that this patch is specific to the version of Hadoop that you're installing, but if the exact version isn't available, try to use the one just before the desired version... that works sometimes.)
Make a backup of your %HADOOP_HOME%\bin
directory (copy it to \bin.old
or similar), then copy the patched files (specific to your Hadoop version, downloaded from the above git repo) to the old %HADOOP_HOME%\bin
directory, replacing the old files with the new ones.
Now, if you run hdfs namenode -format
in cmd
, you should see:
One more thing to do: copy hadoop-yarn-server-timelineservice-<version>
from C:\Hadoop\hadoop-<version>\share\hadoop\yarn\timelineservice
to C:\Hadoop\hadoop-<version>\share\hadoop\yarn
(the parent directory). (These are short version numbers, like 3.1.3
, and may not match between the JAR
file name and the directory name.)
Boot HDFS
Finally, you can boot HDFS by running start-dfs.cmd
and start-yarn.cmd
in cmd
:
You should verify that the namenode
, datanode
, resourcemanager
, and nodemanager
are all running using the jps
command:
You can also open localhost:8088
and localhost:9870
in your browser to monitor your shiny, new Hadoop Distributed File System:
Finally, test that you can edit the filesystem by running hadoop fs -mkdir /test
, which will make a directory called test
in the root directory:
Testing Hadoop and Spark
We know now how to create directories (fs -mkdir
) and list their contents (fs -ls
) in HDFS, what about creating and editing files? Well, files can be copied from the local file system to HDFS with fs -put
. We can then read files in the spark-shell
with sc.textFile(...)
:
Note that you read a file from HDFS on hdfs://localhost:9000/
and not just hdfs://
. This is because this is the defaultFS
we defined in core-site.xml
.
If you want to stop the HDFS, you can run the commands:
C:\Users> stop-dfs.cmd
and
C:\Users> stop-yarn.cmd
So there you have it! Spark running on Windows, reading files stored in HDFS. This took a bit of work to get going and I owe a lot to people who previously encountered the same bugs as me, or previously wrote tutorials which I used as a framework for this walkthrough. Here are the blogs, GitHub repos, and SO posts I used to build this tutorial:
- Muhammad Bilal Yar's Hadoop 2.8.0 walkthrough
- java.net.URISyntaxException
- java.lang.UnsatisfiedLinkError
- FATAL resourcemanager.ResourceManager
- localhost:50070 error
- Kuldeep Singh's walkthrough and troubleshooting guide
- Jacek Laskowski's GitBook
- java.io.IOException: Incompatible clusterIDs
- HDFS basic commands
- Spark basic commands
Latest comments (60)
Hi Andrew,
Thank you for this tutorial. It has really been helpful.
I have been able to run start-yarn.cmd command successfully but whenever I run start-dfs.cmd it gives me an error message
"WARN datanode.Datanode: Problem connecting to server: localhost/127.0.0.1:9000"
Can you please tell me what to do to resolve this issue.
Thank you.
Hi Andrew
This was so clear.
I had been getting problems installing Hadoop for a week, and this just made it a breeze.
Thank you indeed
hi Andrew,
I cant fix this problem:
issues.apache.org/jira/browse/YARN...
Hadoop Version ist 3.1.3
When I start yarn, this folder gets created, with insuficient permissions: /usercache. Running every script with unsificcient permissions, doesnt help.
Thanks a lot in advance!
Rodrigo
Caused by: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Permissions incorrectly set for dir c:/Hadoop/hadoop-3.1.3/yarn/tmp-nm/usercache, should be rwxr-xr-x, actual value = rw-rw-rw-
at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.checkLocalDir(ResourceLocalizationService.java:1665)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.checkAndInitializeLocalDirs(ResourceLocalizationService.java:1633)
Thanks Andrew. This post was really helpful.
simply great, it worked like a charm. Thanks for this tutorial Andrew!!
Hello Andrew,
Thank you so much for your tutorial. I just have some questions hope you can help:
after running start-dfs.cmd and start-yarn.cmd in cmd (boot HDFS step) I noticed that the yarn is working fine but NameNode and DataNode started for a few secs and then both stopped working for some reasons. Any idea what might cause this issue?
during the setting path process, I couldn't run the command: hdfs -version (I cd out to C:/User but I still have the same error Error: Could not find or load main class Last Name) so I edit /etc/hadoop/hadoop-env.cmd and change this line:
set HADOOP_IDENT_STRING=%USERNAME%
to
set HADOOP_IDENT_STRING=myuser
This allows me to do hdfs -version but I don't know this change will affect anything or not could you please clarify? Is this change make my NameNode and DataNode not working
Hi Andrew,
Try Syncfusion BigData Studio and Syncfusion Cluster Manager products. It has builtin Hadoop ecosystems for Windows platform.
Much easier to install and configure Hadoop ecosystems in Windows.
Thanks Andrew for this tutorial, it was very helpful.
How does one address this encryption error:
INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
I get it after I hadoop put a file.
Hi Chinanu. I haven't encountered an error like this before, so unfortunately your guess is as good as mine. This site seems to suggest that it might be an issue with missing
jar
files? I'm not sure.Hi,
I'm getting this error when i execute start-yarn.cmd.
thepracticaldev.s3.amazonaws.com/i...
Help me, please
Hi Andrew ,
It is me again. Now i am testing in my personal machine. But now i ma having another problem. In my local machine my user is "David Serrano". As you can see it has one space in it. When i try to format the namenode with "hdfs namenode -format" I am getting this error:
Error: Could not find or load main class Serrano
Caused by: java.lang.ClassNotFoundException: Serrano
So, i guess the problem is the space in my user name. What can i do in this case?
Thanks in advance!
Hadoop doesn't like spaces in paths. I think the only thing you can do is put Java, Hadoop, and Spark in locations where there are no spaces in the path. I usually use:
Hi,
Well all the files are in paths without spaces. However, hadoop is executing something using my user "David Serrano" and it is generating the problem. I have not found the root cause of this.
Are there any spaces on your
%PATH%
at all?Hi,
Here it is my path:
C:\Users\David Serrano>echo %path%
C:\Program Files (x86)\Common Files\Oracle\Java\javapath;C:\Program Files\Microsoft MPI\Bin\;C:\Program Files (x86)\Intel\iCLS Client\;C:\Program Files\Intel\iCLS Client\;C:\WINDOWS\system32;C:\WINDOWS;C:\WINDOWS\System32\Wbem;C:\WINDOWS\System32\WindowsPowerShell\v1.0\;C:\Program Files (x86)\Intel\Intel(R) Management Engine Components\DAL;C:\Program Files\Intel\Intel(R) Management Engine Components\DAL;C:\Program Files (x86)\Intel\Intel(R) Management Engine Components\IPT;C:\Program Files\Intel\Intel(R) Management Engine Components\IPT;C:\Program Files\dotnet\;C:\Program Files\Microsoft SQL Server\130\Tools\Binn\;C:\Program Files (x86)\Microsoft SQL Server\Client SDK\ODBC\130\Tools\Binn\;C:\Program Files (x86)\Microsoft SQL Server\140\Tools\Binn\;C:\Program Files (x86)\Microsoft SQL Server\140\DTS\Binn\;C:\Program Files (x86)\Microsoft SQL Server\140\Tools\Binn\ManagementStudio\;C:\WINDOWS\System32\OpenSSH\;C:\Program Files\Git\cmd;C:\Program Files\Microsoft SQL Server\Client SDK\ODBC\130\Tools\Binn\;C:\Program Files\Microsoft SQL Server\140\Tools\Binn\;C:\Program Files\Microsoft SQL Server\140\DTS\Binn\;C:\Program Files\Java\jdk-12.0.1\bin;C:\Program Files\MySQL\MySQL Shell 8.0\bin\;C:\Progra~1\Java\jdk-12.0.1;C:\BigData\hadoop-3.1.2;C:\BigData\hadoop-3.1.2\bin;C:\BigData\hadoop-3.1.2\sbin;
As you can see, there are a lot of spaces, however, in the cofngiuration f the variables i am using--> C:\Progr~1 ... in order to avoid spaces problems. But, the problem is with my user "David Serrano". The error says:
Error: Could not find or load main class Serrano
Caused by: java.lang.ClassNotFoundException: Serrano
As you can see in the PATH there is not "Serrano" word. so, my conclusion is that the problem is in my user. But i don't know how i can to avoid this.
Maybe it's doing something with your working directory path? Try
cd
-ing toC:\
first, then running Hadoop. I'm really not sure, though.I already did that:
C:>hadoop version
Error: Could not find or load main class Serrano
Caused by: java.lang.ClassNotFoundException: Serrano
Do you know which script of hadoop call the user profile? do you know if hadoop has some way to set up the name of the user profile in the scripts?
I don't, sorry, David. I'm not sure why that should be hard-coded anywhere, if it's not in your
%PATH%
and you're not in that directory.Well, here are some info (although it is a little bit old) that could give some clue about the problem:
blog.benhall.me.uk/2011/01/install...
I think i can do something similar to the advice in the above blog. However i need to know which is the variable that hadoop is using to call java in order to change it in the config files.
If you have some info about it, please post here in order to try to solve the problem.
Thanks in advance.
Hadoop uses
JAVA_HOME
to determine where your Java distribution is installed. In a Linux installation, there's a file calledhadoop/etc/hadoop/hadoop-env.sh
. It might be.cmd
instead of.sh
on Windows, but I'm not sure.Check out my other article on installing Hadoop on Linux. (Search for "JAVA_HOME" to find the relevant bit.)
Yes, the JAVA_HOME variable is fine in my laptop. However, hadoop must use in another part of its code the variable %USERNAME% or %USERPROFILE%. Those variables are the problematic thing. I need to locate that part in hadoop and try to change in some config file (if it is possible). Actually i have another machine with ubuntu and hadoop works normally. The idea was installing on windows to do some specific work in both systems.
I appreciate your attention, and if you get some new info about this kind of problems (user name with spaces and yarn problems in windows) please don't hesitate in posting it here.
thanks a lot.
Hey Guys, I am also the same problem in my system due to space in my system user name >Did You find any Solution
thanx in advance