Apache Spark on a Single-node or Pseudo-distributed Hadoop Cluster on macOS

Apache Spark on pseudo-distributed cluster on macOS

This article describes how to set up and configure Apache Spark to run on a single node/pseudo distributed Hadoop cluster with YARN resource manager. Apache Spark comes with a Spark Standalone resource manager by default. We can configure Spark to use YARN resource manger instead of the Spark’s own resource manager so that the resource allocation will be taken care by YARN.

At the end of this tutorial, you’ll have Apache Spark set up on a single node/pseudo distributed Hadoop Cluster in macOS.


You should have a Single Node/Pseudo Distributed Hadoop Cluster set up on your mac machine. Follow this guide to set it up if you haven’t already.

Install and configure Spark

Download Apache Spark Binary

Download the latest Apache Spark from the official website – https://spark.apache.org/downloads.html

spark-2.4.3-bin-hadoop2.7 was the latest version at the time of writing.

Unpack and move

Unpack the tar file. Update the location in the command if the tar file is in a different directory.

$ tar xzvf /User/ash/bin/Downloads/spark-2.4.3-bin-hadoop2.7.tgz

Move the Spark binary directory to a preferred directory.  We are using /User/ash/bin/ directory to store the Hadoop distribution. You can use any directory of your preference.

$ mkdir /User/ash/bin/bin
$ mv -f /User/ash/bin/Downloads/spark-2.4.3-bin-hadoop2.7 ~/bin/

Set variables


Add the following properties to ~/.bash_profile file

export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
export SPARK_HOME=$HOME/bin/spark-2.4.3-bin-hadoop2.7

Source the .bash_profile file.

$ source ~/.bash_profile
Verify the variables are set

Verify $SPARK_HOME is set.

$ echo $SPARK_HOME

Verify Spark executable binaries are added to $PATH.

$ spark-submit --version

The output should look similar to the following.

$ spark-submit --version
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.3

Using Scala version 2.11.12, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_211
Compiled by user  on 2019-05-01T05:08:38Z
Type --help for more information.

Start up Spark Shell

Start the Hadoop daemons.

$ start-all.sh

Start up spark-shell with yarn

$ spark-shell --master yarn

When spark-shell is started with Yarn resource manager in pseudo distributed mode, the

$ spark-shell --master yarn
2019-05-14 06:08:14,269 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2019-05-14 06:08:21,796 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
Spark context Web UI available at
Spark context available as 'sc' (master = yarn, app id = application_1557831968507_0001).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.3

Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_211)
Type in expressions to have them evaluated.
Type :help for more information.

scala> sc
res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@533a8540

scala> spark
res1: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@1ced1fb9


Information about the Spark Job

Yarn Resource Manager

Browse Yarn Resource Manager UI at http://localhost:8088/cluster.

Yarn Resource Manager UI

Spark Application UI

From the Yarn Resource Manager UI, click on the ApplicationMaster hyperlink corresponding to Spark shell to view the Spark shell application UI

Congratulations! You have successfully set up Spark with single node/pseudo distributed Hadoop cluster.

One Comment

Leave a Reply

Your email address will not be published. Required fields are marked *