HADOOP installation steps -

Contents hide

In realtime, Hadoop will be installed into a network of machines to form a cluster. Here, in this article, we will see the installation of Hadoop step by step in a single Ubuntu system. The post is written with an assumption that you already know what is a Name node, Data node, HDFS, etc and the basic Hadoop components. If not, please visit the previous articles to know about these basic concepts.

Table of contents

Setting up core-site.xml
Setting up hdfs-site.xml
Setting up yarn-site.xml
Setting up mapred-site.xml
Updating slaves file

Download the Hadoop version that you want to install to your local machine(recommended to download the stable version).
Download Java 1.7(preferred for latest Hadoop versions) to your local machine. Please consider the 32bit/64bit architecture before downloading.
Assuming that the downloaded tarballs are present under the home directory of the logged-in user. Let’s extract the tarballs.
1. ~$tar -xvf hadoop-2.7.1.tar.gz
2. ~$tar -xvf jdk-7u79-linux-x86_64.gz
After the successful unzipping, you can see these two folders under the home directory.
Now we have to set the paths for these items in the .bashrc file under the home directory. In case if you are not able to see the file press Ctrl+H(to make the hidden files visible). Open the file and add the below lines.
export JAVA_HOME=/home/user/jdk1.7.0_79
export HADOOP_PREFIX=/home/user/hadoop-2.7.1
export HADOOP_HOME=${HADOOP_PREFIX}
export HADOOP_CONF_DIR=${HADOOP_PREFIX}/etc/hadoop
export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
After appending these lines you can just save and close the file. For the values to get updated and set for the current shell we need to source the .bashrc file by ~$source ~/.bashrc command.
To confirm if Java and Hadoop are installed properly just check these two items. ~$echo $JAVA_HOME, ~$hadoop version.
Now that we have Hadoop in our machine we need to set the Hadoop configurations to make it work. Navigate to “/home/user/hadoop-2.7.1/etc/hadoop” to see the configuration files of Hadoop.
- Setting up core-site.xml.
- Setting up hdfs-site.xml
- Setting up yarn-site.xml
- Setting up mapred-site.xml

Setting up core-site.xml

Open the core-site.xml in the path “/home/user/hadoop-2.7.1/etc/hadoop” using a text editor. Here you need to set the default file system and temp directory.

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
   <property>
      <name>fs.defaultFS</name>
      <value>hdfs://localhost:8020</value>
   </property>
   <property>
      <name>hadoop.tmp.dir</name>
      <value>/home/user/tmp</value>
   </property>
</configuration>

Setting up hdfs-site.xml

Here you need to set the path for name node and data node.

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
   <property>
      <name>dfs.namenode.name.dir</name>
      <value>/home/user/name</value>
   </property>
   <property>
      <name>dfs.datanode.data.dir</name>
      <value>/home/user/data</value>
   </property>
</configuration>

Setting up yarn-site.xml

Here you will set the properties related to node manager and resource manager(MR2)

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
   <property>
      <name>yarn.resourcemanager.hostname</name>
      <value>localhost</value>
   </property>
   <property>
      <name>yarn.nodemanager.aux-services</name>
      <value>mapreduce_shuffle</value>
   </property>
   <property>
      <name>yarn.log-aggregation-enable</name>
      <value>true</value>
   </property>
   <property>
      <name>yarn.nodemanager.remote-app-log-dir</name>
      <value>hdfs://localhost:8020/log/</value>
   </property>
</configuration>

Setting up mapred-site.xml

You will see mapred-site.xml.template in the same folder. Just create a copy of that and rename it to mapred-site.xml. Here you can set the properties about MR framework and JobHistory web page

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
   <property>
      <name>mapreduce.framework.name</name>
      <value>yarn</value>
   </property>
   <property>
      <name>mapreduce.jobhistory.address</name>
      <value>localhost:10020</value>
   </property>
   <property>
      <name>mapreduce.jobhistory.webapp.address</name>
      <value>localhost:19888</value>
   </property>
</configuration>

Updating slaves file

After setting up the above configurations, you need to add localhost to the slaves file. This is because we are setting up the pseudo-cluster which means, both master and slave will be localhost.

Navigate to ~hadoop-2.7.1/etc/hadoop path and open the “slaves” file in editor. Just add “localhost” to the file and save it.

After setting all these configurations Hadoop is set to start in Pseudo distributed mode.
Now you have to format the name node. Please remember this is a one-time activity and should be done only during the setup. You can use the below command to perform this.

cd $HADOOP_CONF_DIR
hdfs namenode -format

Once the setup is done, we need to enable passwordless authentication. To accomplish that, we need to execute the below set of commands in the same order as given. This procedure avoids prompting for a password when starting the daemons

~$ sudo apt-get install openssh-server
~$ sudo service ssh start
~$ ssh-keygen
~$ cd .ssh
~$ cat id_rsa.pub >> authorized_keys
~$ chmod 600 authorized_keys

All set for launching Hadoop in your machine. Now you can start and stop Hadoop daemons using the following commands.

start-all.sh – Five daemons of Hadoop: Namenodes, data nodes, secondary name node, resource manager, node manager will be started.

start-dfs.sh – The first 3 daemons will be started from the above list.

start-yarn.sh – The last 2 daemons(Yarn daemons) will be started.

You can also start the specific daemon by running

hadoop-daemons.sh start namenode

To stop the daemons use the below commands

stop-all.sh – Will stop all daemons

stop-dfs.sh – Will stop the dfs daemons.