-->

Install Hadoop on Ubuntu 20.04 (2021)

The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.


Step 1: Installing Java

Hadoop is written in Java so we need to install Java before installing Java as all the Hadoop daemons will be running as JVM processes. 

You can install OpenJDK 8 from the default apt repositories:

sudo apt-get update

sudo apt install openjdk-8-jdk

Once the installation is completed, you can verify the installation by executing the below command.

java -version

 

Step 2: Installing and configuring SSH

Install OpenSSH server by executing the below command

 sudo apt-get install openssh-server

Once installed, generate Public and Private Key pairs by executing the below command. When it asks for a file location simply press "Enter"

  ssh-keygen -t rsa -P ""

Next, copy the generated key pairs into authorized_keys locatio

cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys 

Next, verify the passwordless SSH authentication by running the below command and when prompted type "yes" and press "Enter"

ssh localhost

Step 3: Download and Install Hadoop 

Download the latest Hadoop Binary file from Apache's official site 

Apache Hadoop

Move the downloaded tar file to the home directory and untar it using the below command 

tar -xvzf hadoop-3.3.1.tar.gz

Once the file extracted, rename the folder name into simple name by executing the below command 

mv hadoop-3.3.1 hadoop

Next, we need to edit the bashrc file and add the references to Hadoop folders

 nano ~/.bashrc

Add, the below commands at the end of the bashrc file

export PDSH_RCMD_TYPE=ssh

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/

export PATH=$PATH:$JAVA_HOME/bin

export HADOOP_HOME="<your hadoop directory>"

export PATH=$PATH:$HADOOP_HOME/bin

export PATH=$PATH:$HADOOP_HOME/sbin 

export HADOOP_MAPRED_HOME=${HADOOP_HOME}

export HADOOP_COMMON_HOME=${HADOOP_HOME}

export HADOOP_HDFS_HOME=${HADOOP_HOME}

export YARN_HOME=${HADOOP_HOME}

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native

export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"

export LD_LIBRARY_PATH=<your hadoop directory>/lib/native


*Give the correct hadoop folder name where you have downloaded and extracted it



Save the changes by pressing Ctrl+X and type "Y", press "Enter".

Activate the changes by running the below command.


source ~/.bashrc


Step 4: Configuring Hadoop daemons

  1. Edit the hadoop-env.sh in $HADOOP_HOME/etc/hadoop folder
nano hadoop-env.sh

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=$HADOOP_HOME/lib/native"


  1. Editing the hdfs-site.xml in $HADOOP_HOME/etc/hadoop


nano hdfs-site.xml

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>


  1. Editing the core-site.xml in $HADOOP_HOME/etc/hadoop


nano core-site.xml


<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
    <property>
        <name>hadoop.tmp.dir</name>
        <value><your hdata directory></value> 
    </property>
</configuration>


  1. Editing the mapred-site.xml 


nano mapred-site.xml

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    <property>
         <name>yarn.app.mapreduce.am.env</name>
         <value>HADOOP_MAPRED_HOME=<your hadoop directory></value>
    </property>
    <property>
         <name>mapreduce.map.env</name>
         <value>HADOOP_MAPRED_HOME=<your hadoop directory></value>
    </property>
    <property>
        <name>mapreduce.reduce.env</name>
        <value>HADOOP_MAPRED_HOME=<your hadoop directory></value>
    </property>
</configuration>


  1. Editing the yarn-site.xml 


nano yarn-site.xml

<configuration>
<!-- Site specific YARN configuration properties -->
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
</configuration>


  1. Once all the configurations files are edited, we need to format the hdfs namenode using the below command


hdfs namenode -format


        Once the namenode is successfully formatted, you will see the SHUTDOWN_MSG as below



  1. Once you the namenode is successfully formatted you can start the services separately or you can start all the services at once by the below commands


           start-dfs.sh # to start the HDFS daemons(NameNode, DataNode, Secondary NN)

   

         start-yarn.sh # to start the YARN daemons(ResourceManager, NodeManagers)

   

           start-all.sh # to start both HDFS & YARN daemons


    To verify whether all the Hadoop daemons are started you can use jps command.




    Before shutting down the system, make sure to stop the Hadoop daemons 


           stop-all.sh


    This is the end of the post, hope you have successfully installed Hadoop in your Ubuntu machine. If you face any issues, let me know in the comments section.