Install Hadoop on Ubuntu 20.04 (2021)

The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Step 1: Installing Java

Hadoop is written in Java so we need to install Java before installing Java as all the Hadoop daemons will be running as JVM processes.

You can install OpenJDK 8 from the default apt repositories:

sudo apt-get update

sudo apt install openjdk-8-jdk

Once the installation is completed, you can verify the installation by executing the below command.

java -version

Step 2: Installing and configuring SSH

Install OpenSSH server by executing the below command

sudo apt-get install openssh-server

Once installed, generate Public and Private Key pairs by executing the below command. When it asks for a file location simply press "Enter"

ssh-keygen -t rsa -P ""

Next, copy the generated key pairs into authorized_keys locatio

cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Next, verify the passwordless SSH authentication by running the below command and when prompted type "yes" and press "Enter"

ssh localhost

Step 3: Download and Install Hadoop

Download the latest Hadoop Binary file from Apache's official site

Apache Hadoop

Move the downloaded tar file to the home directory and untar it using the below command

tar -xvzf hadoop-3.3.1.tar.gz

Once the file extracted, rename the folder name into simple name by executing the below command

mv hadoop-3.3.1 hadoop

Next, we need to edit the bashrc file and add the references to Hadoop folders

nano ~/.bashrc

Add, the below commands at the end of the bashrc file

export PDSH_RCMD_TYPE=ssh

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/

export PATH=$PATH:$JAVA_HOME/bin

export HADOOP_HOME="<your hadoop directory>"

export PATH=$PATH:$HADOOP_HOME/bin

export PATH=$PATH:$HADOOP_HOME/sbin

export HADOOP_MAPRED_HOME=${HADOOP_HOME}

export HADOOP_COMMON_HOME=${HADOOP_HOME}

export HADOOP_HDFS_HOME=${HADOOP_HOME}

export YARN_HOME=${HADOOP_HOME}

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native

export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"

export LD_LIBRARY_PATH=<your hadoop directory>/lib/native

*Give the correct hadoop folder name where you have downloaded and extracted it

Save the changes by pressing Ctrl+X and type "Y", press "Enter".

Activate the changes by running the below command.

source ~/.bashrc

Step 4: Configuring Hadoop daemons

Edit the hadoop-env.sh in $HADOOP_HOME/etc/hadoop folder

nano hadoop-env.sh

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/

export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=$HADOOP_HOME/lib/native"

Editing the hdfs-site.xml in $HADOOP_HOME/etc/hadoop

nano hdfs-site.xml

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>

</configuration>

Editing the core-site.xml in $HADOOP_HOME/etc/hadoop

nano core-site.xml

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
    <property>
        <name>hadoop.tmp.dir</name>
        <value><your hdata directory></value>
    </property>

</configuration>

Editing the mapred-site.xml

nano mapred-site.xml

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
   <property>
        <name>yarn.app.mapreduce.am.env</name>
        <value>HADOOP_MAPRED_HOME=<your hadoop directory></value>
   </property>
   <property>
         <name>mapreduce.map.env</name>
         <value>HADOOP_MAPRED_HOME=<your hadoop directory></value>
   </property>
   <property>
       <name>mapreduce.reduce.env</name>
       <value>HADOOP_MAPRED_HOME=<your hadoop directory></value>
   </property>

</configuration>

Editing the yarn-site.xml

nano yarn-site.xml

<configuration>

    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>

</configuration>

Once all the configurations files are edited, we need to format the hdfs namenode using the below command

hdfs namenode -format

Once the namenode is successfully formatted, you will see the SHUTDOWN_MSG as below

Once you the namenode is successfully formatted you can start the services separately or you can start all the services at once by the below commands

start-dfs.sh # to start the HDFS daemons(NameNode, DataNode, Secondary NN)

start-yarn.sh # to start the YARN daemons(ResourceManager, NodeManagers)

start-all.sh # to start both HDFS & YARN daemons

To verify whether all the Hadoop daemons are started you can use jps command.

Before shutting down the system, make sure to stop the Hadoop daemons

stop-all.sh

This is the end of the post, hope you have successfully installed Hadoop in your Ubuntu machine. If you face any issues, let me know in the comments section.

2 comments

Jaypee August 12, 2021 at 7:18 AM

Nice informative blog, what is the system pre requisite, how do you setup the vm, pls share these info wud be very helpful.
- karthik August 12, 2021 at 9:11 PM
  
  Hi Jaypee,
  The minimum system requirement will be 2GB RAM, 20GB space and 2GHz processor for running bare minimum Hadoop daemons. But if you want to install more components such as Hive, Spark, Sqoop and Cassandra then the required RAM should be more than 2GB. Please check the other post on how to set up the Ubuntu VM here
  
  https://k2ddna.blogspot.com/2021/08/install-ubuntu-2004-lts-on-oracles.html

K2D Data Analytics

Install Hadoop on Ubuntu 20.04 (2021)

Step 1: Installing Java

Step 2: Installing and configuring SSH

Step 3: Download and Install Hadoop

Step 4: Configuring Hadoop daemons

2 comments

Trino + Apache Ranger on Kubernetes

Building Apache Ranger from source

I Completed the DataExpert.io Free Community Data Engineering Bootcamp!

Section 2 - SQL Query