This article describes how to set up and configure Apache Spark to run on a single node/pseudo distributed Hadoop cluster with YARN resource manager. Apache Spark comes with a Spark Standalone resource manager by default. We can configure Spark to use YARN resource manger instead of the Spark’s own resource manager so that the resource allocation will be taken care by YARN.
At the end of this tutorial, you’ll have Apache Spark set up on a single node/pseudo distributed Hadoop Cluster in macOS.
You should have a Single Node/Pseudo Distributed Hadoop Cluster set up on your mac machine. Follow this guide to set it up if you haven’t already.
Install and configure Spark
Download Apache Spark Binary
Download the latest Apache Spark from the official website – https://spark.apache.org/downloads.html
spark-2.4.3-bin-hadoop2.7 was the latest version at the time of writing.
Unpack and move
Unpack the tar file. Update the location in the command if the tar file is in a different directory.
$ tar xzvf /User/ash/bin/Downloads/spark-2.4.3-bin-hadoop2.7.tgz
Move the Spark binary directory to a preferred directory. We are using /User/ash/bin/ directory to store the Hadoop distribution. You can use any directory of your preference.
$ mkdir /User/ash/bin/bin $ mv -f /User/ash/bin/Downloads/spark-2.4.3-bin-hadoop2.7 ~/bin/
Add the following properties to ~/.bash_profile file
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop export SPARK_HOME=$HOME/bin/spark-2.4.3-bin-hadoop2.7 export PATH=$PATH:$SPARK_HOME/bin
Source the .bash_profile file.
$ source ~/.bash_profile
Verify the variables are set
Verify $SPARK_HOME is set.
$ echo $SPARK_HOME
Verify Spark executable binaries are added to $PATH.
$ spark-submit --version
The output should look similar to the following.
$ spark-submit --version Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.4.3 /_/ Using Scala version 2.11.12, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_211 Branch Compiled by user on 2019-05-01T05:08:38Z Revision Url Type --help for more information.
Start up Spark Shell
Start the Hadoop daemons.
Start up spark-shell with yarn
$ spark-shell --master yarn
When spark-shell is started with Yarn resource manager in pseudo distributed mode, the
$ spark-shell --master yarn 2019-05-14 06:08:14,269 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 2019-05-14 06:08:21,796 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME. Spark context Web UI available at http://192.168.1.14:4040 Spark context available as 'sc' (master = yarn, app id = application_1557831968507_0001). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.4.3 /_/ Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_211) Type in expressions to have them evaluated. Type :help for more information. scala> sc res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@533a8540 scala> spark res1: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@1ced1fb9 scala>
Information about the Spark Job
Yarn Resource Manager
Browse Yarn Resource Manager UI at http://localhost:8088/cluster.
Spark Application UI
From the Yarn Resource Manager UI, click on the ApplicationMaster hyperlink corresponding to Spark shell to view the Spark shell application UI
Congratulations! You have successfully set up Spark with single node/pseudo distributed Hadoop cluster.