-->

Install Apache Spark on Ubuntu 20.04 LTS (2021)


Apache Spark™ is a unified analytics engine for large-scale data processing. 

Spark can run 100x faster than Hadoop's MapReduce in Memory and 10x faster in while processing the data in disk.

 

Apache Spark is a unified computing engine and a set of libraries for parallel data processing on computer clusters. As of this writing, Spark is the most actively developed open source engine for this task, making it a standard tool for any developer or data scientist interested in big data. 

 
Spark supports multiple widely used programming languages (Python, Java, Scala, and R), includes libraries for diverse tasks ranging from SQL to streaming and machine learning, and runs anywhere from a laptop to a cluster of thousands of servers. This makes it an easy system to start with and scale-up to big data processing or incredibly large scale. 

In this post we will see how to install Apache Spark on Ubuntu 20.04.

Before installing Spark, make sure you have Java and Python installed in your system by executing the below commands.

 Step 1: Download Apache Spark

Download Apache Spark binary file from the below Official download page

Apache Spark

Step 2: Untar Spark binary file

Untar the file and rename it by executing the below command

       
            
            tar -xvzf spark-3.1.2-bin-hadoop3.2.tgz 
            
            mv spark-3.1.2-bin-hadoop3.2 spark
            

Step 3: Configure Apache Spark

We need to configure the Spark by using the below commands

       
            
            cd spark/conf
            
            
            mv spark-env.sh.template spark-env.sh
            
            
            nano spark-env.sh
            
            
            
            SPARK_LOCAL_IP=127.0.0.1
            
                        
            export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
            
            
            export SPARK_CLASS_PATH=/home/karthik/hive/lib/mysql-connector-java-8.0.26.jar
            
            
            export HIVE_HOME=/home/karthik/hive
            
            
            export HADOOP_HOME=/home/karthik/hadoop
            
            
            export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
            
            
            export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"

            

Once you edited the spark-env.sh, save the changes by pressing Ctrl+X then type "Y" and press "Enter" button.

Change the logging level "WARN" using the below command

       
            
            mv log4j.properties.template log4j.properties
            
            
                    
            nano log4j.properties
            
            

Change log4j.rootCategory to WARN from INFO as shown in the below screenshot and save the changes by pressing Ctrl+X then type "Y" and press "Enter" button.


Copy the hive-site.xml file from hive/conf folder to spark/conf folder using the below command

       
            
            cd
            
            
            cp hive/conf/hive-site.xml spark/conf
            
            

Copy the mysql-connector-java-*.jar file from hive/lib folder to spark/jars folder using the below command 

       
            
                      
            cp hive/lib/mysql-connector-java-*.jar spark/jars
            
            

Step 4: Add SPARK_HOME to ~/.bashrc

Add SPARK_HOME path to ~/.bashrc file by running the below command
 
            nano ~/.bashrc
            
            
            export SPARK_HOME=/home/karthik/spark
            
            
            export PATH=$PATH:$SPARK_HOME/bin
            
            
            export SPARK_VERSION=3.1.2
            
            
            export SPARK_CLASS_PATH=/home/karthik/spark/jars
            
            
            export PYSPARK_PYTHON=/usr/bin/python3.8
            
            
            export PYSPARK_DRIVER_PYTHON=/usr/bin/python3.8
            
            
            export PYTHONPATH=$PYTHONPATH:"$SPARK_HOME/python/lib/*"
            
            
           
            
Save the changes by pressing Ctrl+X then type "Y" and press "Enter"  button.

Instantiate the changes made by executing the below command.

                  
            
            
            source ~/.bashrc
            

Step 5: Verify Spark Installation 

Spark has Python and Scala shells. You can start Python spark shell using the below command
                  
            
            
            pyspark
            
            
 

You can verify Hive integration with Apache Spark by executing the below command while inside the pyspark/scala spark shell

                  
            
            
            spark.sql("SHOW DATABASES").show()
            
            
            
 

You can exit the pyspark session by typing "quit()" and press "Enter" button.
 
You can start Scala spark shell using the below command

                  
            
            
            
            spark-shell
            
            
            

You can exit the Scala Spark shell by typing ":q" and press "Enter"

That's it, we have successfully installed and configured Apache Spark on Ubuntu 20.04 LTS. If you face any issues, let me know in the comments section.