Apache Spark on a Single Node/Pseudo Distributed Hadoop Cluster in macOS

This article describes how to set up and configure Apache Spark to run on a single node/pseudo distributed Hadoop cluster with YARN resource manager. Apache Spark comes with a Spark Standalone resource manager by default. We can configure Spark to use YARN resource manger instead of the Spark’s own resource manager so that the resource allocation will be taken care by YARN.

At the end of this tutorial, you’ll have Apache Spark set up on a single node/pseudo distributed Hadoop Cluster in macOS.

Prerequisites

You should have a Single Node/Pseudo Distributed Hadoop Cluster set up on your mac machine. Follow this guide to set it up if you haven’t already.


Install and configure Spark

Download Apache Spark Binary

Download the latest Apache Spark from the official website – https://spark.apache.org/downloads.html

spark-2.4.3-bin-hadoop2.7 was the latest version at the time of writing.

Unpack and move

Unpack the tar file. Update the location in the command if the tar file is in a different directory.

Move the Spark binary directory to a preferred directory.  We are using /User/ash/bin/ directory to store the Hadoop distribution. You can use any directory of your preference.

Set variables

Add the following properties to ~/.bash_profile file

Source the .bash_profile file.

Verify the variables are set

Verify $SPARK_HOME is set.

Verify Spark executable binaries are added to $PATH.

The output should look similar to the following.


Start up Spark Shell

Start the Hadoop daemons.

Start up spark-shell with yarn

When spark-shell is started with Yarn resource manager in pseudo distributed mode, the


Information about the Spark Job

Yarn Resource Manager

Browse Yarn Resource Manager UI at http://localhost:8088/cluster.

Yarn Resource Manager UI

Spark Application UI

From the Yarn Resource Manager UI, click on the ApplicationMaster hyperlink corresponding to Spark shell to view the Spark shell application UI


Congratulations! You have successfully set up Spark with single node/pseudo distributed Hadoop cluster.

Leave a Reply

Your email address will not be published. Required fields are marked *

Scroll to Top