Install Hadoop on Ubuntu 20.04 (2021)
The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
Step 1: Installing Java
Hadoop is written in Java so we need to install Java before installing Java as all the Hadoop daemons will be running as JVM processes.
You can install OpenJDK 8 from the default apt repositories:
sudo apt-get update
sudo apt install openjdk-8-jdk
Once the installation is completed, you can verify the installation by executing the below command.
java -version
Step 2: Installing and configuring SSH
Once installed, generate Public and Private Key pairs by executing the below command. When it asks for a file location simply press "Enter"
ssh-keygen -t rsa -P ""
Next, copy the generated key pairs into authorized_keys locatio
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
Next, verify the passwordless SSH authentication by running the below command and when prompted type "yes" and press "Enter"
ssh localhost
Step 3: Download and Install Hadoop
Apache Hadoop
Move the downloaded tar file to the home directory and untar it using the below command
tar -xvzf hadoop-3.3.1.tar.gz
Once the file extracted, rename the folder name into simple name by executing the below command
mv hadoop-3.3.1 hadoop
Next, we need to edit the bashrc file and add the references to Hadoop folders
nano ~/.bashrc
Add, the below commands at the end of the bashrc file
export PDSH_RCMD_TYPE=ssh
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
export PATH=$PATH:$JAVA_HOME/bin
export HADOOP_HOME="<your hadoop directory>"
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=${HADOOP_HOME}
export HADOOP_COMMON_HOME=${HADOOP_HOME}
export HADOOP_HDFS_HOME=${HADOOP_HOME}
export YARN_HOME=${HADOOP_HOME}
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
export LD_LIBRARY_PATH=<your hadoop directory>/lib/native
*Give the correct hadoop folder name where you have downloaded and extracted it
Save the changes by pressing Ctrl+X and type "Y", press "Enter".
Activate the changes by running the below command.
source ~/.bashrc
Step 4: Configuring Hadoop daemons
- Edit the hadoop-env.sh in $HADOOP_HOME/etc/hadoop folder
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
Editing the hdfs-site.xml in $HADOOP_HOME/etc/hadoop
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
Editing the core-site.xml in $HADOOP_HOME/etc/hadoop
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value><your hdata directory></value>
</property>
Editing the mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=<your hadoop directory></value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=<your hadoop directory></value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=<your hadoop directory></value>
</property>
Editing the yarn-site.xml
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
Once all the configurations files are edited, we need to format the hdfs namenode using the below command
hdfs namenode -format
Once you the namenode is successfully formatted you can start the services separately or you can start all the services at once by the below commands
start-dfs.sh # to start the HDFS daemons(NameNode, DataNode, Secondary NN)
start-yarn.sh # to start the YARN daemons(ResourceManager, NodeManagers)
start-all.sh # to start both HDFS & YARN daemons
To verify whether all the Hadoop daemons are started you can use jps command.
Before shutting down the system, make sure to stop the Hadoop daemons
stop-all.sh
This is the end of the post, hope you have successfully installed Hadoop in your Ubuntu machine. If you face any issues, let me know in the comments section.
2 comments
The minimum system requirement will be 2GB RAM, 20GB space and 2GHz processor for running bare minimum Hadoop daemons. But if you want to install more components such as Hive, Spark, Sqoop and Cassandra then the required RAM should be more than 2GB. Please check the other post on how to set up the Ubuntu VM here
https://k2ddna.blogspot.com/2021/08/install-ubuntu-2004-lts-on-oracles.html