Install Apache Spark on Ubuntu 20.04 LTS (2021)
Apache Spark™ is a unified analytics engine for large-scale data processing.
Spark can run 100x faster than Hadoop's MapReduce in Memory and 10x faster in while processing the data in disk.
Apache Spark is a unified computing engine and a set of libraries for parallel data processing on computer clusters. As of this writing, Spark is the most actively developed open source engine for this task, making it a standard tool for any developer or data scientist interested in big data.
In this post we will see how to install Apache Spark on Ubuntu 20.04.
Before installing Spark, make sure you have Java and Python installed in your system by executing the below commands.
Step 1: Download Apache Spark
Step 2: Untar Spark binary file
tar -xvzf spark-3.1.2-bin-hadoop3.2.tgz
mv spark-3.1.2-bin-hadoop3.2 spark
Step 3: Configure Apache Spark
cd spark/conf
mv spark-env.sh.template spark-env.sh
nano spark-env.sh
SPARK_LOCAL_IP=127.0.0.1
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
export SPARK_CLASS_PATH=/home/karthik/hive/lib/mysql-connector-java-8.0.26.jar
export HIVE_HOME=/home/karthik/hive
export HADOOP_HOME=/home/karthik/hadoop
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
Once you edited the spark-env.sh, save the changes by pressing Ctrl+X then type "Y" and press "Enter" button.
Change the logging level "WARN" using the below command
mv log4j.properties.template log4j.properties
nano log4j.properties
Change log4j.rootCategory to WARN from INFO as shown in the below screenshot and save the changes by pressing Ctrl+X then type "Y" and press "Enter" button.
Copy the hive-site.xml file from hive/conf folder to spark/conf folder using the below command
cd
cp hive/conf/hive-site.xml spark/conf
Copy the mysql-connector-java-*.jar file from hive/lib folder to spark/jars folder using the below command
cp hive/lib/mysql-connector-java-*.jar spark/jars
Step 4: Add SPARK_HOME to ~/.bashrc
nano ~/.bashrc
export SPARK_HOME=/home/karthik/spark
export PATH=$PATH:$SPARK_HOME/bin
export SPARK_VERSION=3.1.2
export SPARK_CLASS_PATH=/home/karthik/spark/jars
export PYSPARK_PYTHON=/usr/bin/python3.8
export PYSPARK_DRIVER_PYTHON=/usr/bin/python3.8
export PYTHONPATH=$PYTHONPATH:"$SPARK_HOME/python/lib/*"
source ~/.bashrc
Step 5: Verify Spark Installation
pyspark
spark.sql("SHOW DATABASES").show()
You can exit the Scala Spark shell by typing ":q" and press "Enter"
That's it, we have successfully installed and configured Apache Spark on Ubuntu 20.04 LTS. If you face any issues, let me know in the comments section.
Post a Comment