Tag: big data

Find the version of Apache Hive from Command Line Interface (CLI)

The version of Apache Hive can be retrieved from Command Line. You don’t have to navigate through the configuration files or browse through the User Interface. There are two commands that can be used from Command Line to obtain the version of Apache Hive. COMMAND #1 This command follows the popular convention used by other […]

What is Cloud Computing?

We hear the term “Cloud Computing” a lot in the media, advertisements, news and in memes. Cloud Computing has continuously been a trending term throughout the last decade. But what does cloud computing mean? Cloud computing is the offering of computing as a service. Consumers can pay the cloud computing service for on-demand use of […]

My Path To AWS Certified Big Data Specialty

Amazon Web Services certifications are few of the most reputed in the field of Software Engineering. I successfully completed the AWS Big Data Speciality certification on Nov 25, 2019. This certification tests the candidate on two of the most wanted skills right now – Cloud and Big Data technologies. Prior to taking this certification, I […]

Execute Linux Commands from Spark Shell and PySpark Shell

Linux commands can be executed from Spark Shell and PySpark Shell. This comes in handy during development to run some Linux commands like listing the contents of a HDFS directory or a local directory. These methods are provided by the native libraries of Scala and Python languages. Hence, we can even use these methods within […]

Apache Spark: Repartition vs Coalesce

Repartition can be used for increasing or decreasing the number of partitions. Whereas Coalesce can only be used for decreasing the number of partitions. Coalesce is a less expensive operation than Repartition as Coalesce reduces data movement between the nodes while Repartition shuffles all data over the network. Partitions What are partitions? The dataset in […]

Apache Spark on a Single Node/Pseudo Distributed Hadoop Cluster in macOS

This article describes how to set up and configure Apache Spark to run on a single node/pseudo distributed Hadoop cluster with YARN resource manager. Apache Spark comes with a Spark Standalone resource manager by default. We can configure Spark to use YARN resource manger instead of the Spark’s own resource manager so that the resource […]

Single Node/Pseudo Distributed Hadoop Cluster on macOS

This article walks through setting up and configuring a single node Hadoop Cluster or pseudo-distributed cluster on macOS. A single node cluster is very useful for development as it reduces the need for an actual cluster for running quick tests. At the end of this tutorial, you’ll have a single node Hadoop cluster with all […]

Scroll to Top