Apache Hadoop: Single-node or Pseudo-distributed cluster on macOS

Apache Hadoop on Mac OS

This article walks through setting up and configuring a single-node Hadoop Cluster or pseudo-distributed cluster on macOS. A single-node cluster is very useful for development as it reduces the need for an actual cluster for running quick tests.

At the end of this tutorial, you’ll have a single-node Hadoop cluster with all the essential Hadoop daemons such as NameNode, DataNode, NodeManager, ResourceManager, and SecondaryNameNode.

Prerequisites

The two prerequisites for setting up a single-node Hadoop cluster are Java and SSH.

Java

Java must be installed and $JAVA_HOME environment variable should be set.

Install Java

Install java from the official website – https://java.com/en/download/

Verify Java is installed

Check the version of java in the terminal

$ java -version

If java is installed, java version will be printed similarly to the below output.

$ java -version
java version “1.8.0_211”
Java(TM) SE Runtime Environment (build 1.8.0_211-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.211-b12, mixed mode)

Set $JAVA_HOME

Add $JAVA_HOME environment variable to .bash_profile file

$ export “JAVA_HOME=$(/usr/libexec/java_home)” >> ~/.bash_profile

Source the .bash_profile file

$ source ~/.bash_profile

Verify that $JAVA_HOME is set up properly

$ echo $JAVA_HOME

SSH

SSH (Remote Login) is disabled by default on MacOS. SSH should be enabled and SSH keys should be set up to manage remote Hadoop daemons.

Enable SSH

Open System Preferences and go to Sharing

Select the Remote Login checkbox to enable SSH

Select the Remote Login checkbox to enable SSH
System Preferences -> Sharing
Setup SSH Key

Generate SSH Key

$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa

Add the newly created public key to authorized SSH keys

$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys
Verify SSH

Verify you can SSH to the localhost with a passphrase:

$ ssh localhost

Install and configure Hadoop

Download Hadoop Distribution

Download the latest Hadoop distribution from the official website – https://hadoop.apache.org/releases.html

hadoop-3.1.2 was the latest distribution at the time of writing.

Unpack and move

Unpack the tar file. Update the location in the command if the tar file is in a different directory.

$ tar xzvf ~/Downloads/hadoop-3.1.2.tar

Move the Hadoop distribution directory to a preferred directory. We are using /User/ash/bin/ directory to store the Hadoop distribution. You can use any directory of your preference.

$ mkdir /User/ash/bin/
$ mv -f ~/Downloads/hadoop-3.1.2 /User/ash/bin/

Set variables

hadoop-env.sh

Edit ~/bin/hadoop-3.1.2/etc/hadoop/hadoop-env.sh file to define the following parameters. Set HADOOP_HOME to the Hadoop distribution location in your machine.

export JAVA_HOME="$(/usr/libexec/java_home)"
export HADOOP_HOME=/User/ash/bin/hadoop-3.1.2
.bash_profile

Add the following properties to ~/.bash_profile file

export HADOOP_VERSION=3.1.2
export HADOOP_HOME=$HOME/bin/hadoop-$HADOOP_VERSION
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export YARN_HOME=$HADOOP_HOME
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME
export HADOOP_LIBEXEC_DIR=$HADOOP_HOME/libexec
export JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native:$JAVA_LIBRARY_PATH
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_INSTALL=$HADOOP_HOME

export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin

Source the .bash_profile file.

$ source ~/.bash_profile
Verify the variables are set

Verify $HADOOP_HOME is set.

$ echo $HADOOP_HOME

The output should look similar to the following.

$ echo $HADOOP_HOME
/Users/ashwin/bin/hadoop-3.1.2

Verify Hadoop executable binaries are added to $PATH.

$ hadoop version

The output should look similar to the following.

$ hadoop version
Hadoop 3.1.2
Source code repository https://github.com/apache/hadoop.git -r 1019dde65bcf12e05ef48ac71e84550d589e5d9a
Compiled by sunilg on 2019-01-29T01:39Z
Compiled with protoc 2.5.0
From source with checksum 64b8bdd4ca6e77cce75a93eb09ab2a9

Configure site.xml files

Modify the following site.xml files with the properties shown below.

mapred-site.xml

$HADOOP_HOME/etc/hadoop/mapred-site.xml

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>

    <property>
        <name>yarn.app.mapreduce.am.env</name>
        <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
    </property>

    <property>
        <name>mapreduce.map.env</name>
        <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
    </property>

    <property>
        <name>mapreduce.reduce.env</name>
        <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
    </property>

    <property>
        <name>mapred.job.tracker</name>
        <value>localhost:8021</value>
    </property>
</configuration>
yarn-site.xml

$HADOOP_HOME/etc/hadoop/yarn-site.xml

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>

    <property>
        <name>yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage</name>
        <value>98.5</value>
    </property>

    <property>
        <name>yarn.nodemanager.env-whitelist</name>
        <value>JAVA_HOME, HADOOP_COMMON_HOME, HADOOP_HDFS_HOME, HADOOP_CONF_DIR, CLASSPATH_PREPEND_DISTCACHE, HADOOP_YARN_HOME, HADOOP_MAPRED_HOME, HDFS_HOME</value>
    </property>
</configuration>
hdfs-site.xml

$HADOOP_HOME/etc/hadoop/hdfs-site.xml

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>
core-site.xml

$HADOOP_HOME/etc/hadoop/core-site.xml

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

Start up the Hadoop cluster

Start the cluster

Format the HDFS filesystem.

$ hdfs namenode -format

Start the Hadoop daemons.

$ start-all.sh

Verify the cluster is up

Verify NameNode, DataNode, NodeManager, ResourceManager, and SecondaryNameNode are running.

$ jps

The output should look similar to the following.

33703 SecondaryNameNode
34376 ResourceManager
34537 Jps
34473 NodeManager
33466 NameNode
33567 DataNode


Information about the cluster

Browse the following web pages to find information about the Hadoop cluster.

Hadoop Health

Browse the Hadoop Health web page at http://localhost:9870.

Hadoop Health

Yarn Resource Manager

Browse Yarn Resource Manager UI at http://localhost:8088/cluster.

Yarn Resource Manager

Voila! You have a single-node Hadoop cluster up and running on your Mac.

2 Comments

  • Hi using the same procedure hadoop -version says
    ERROR: -version is not COMMAND nor fully qualified CLASSNAME.

    Reply
    • It’s supposed to be `hadoop version`. I have fixed the typo. Thank you.

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *