Execute Linux Commands from Spark Shell and PySpark Shell

Linux commands can be executed from Spark Shell and PySpark Shell. This comes in handy during development to run some Linux commands like listing the contents of a HDFS directory or a local directory. These methods are provided by the native libraries of Scala and Python languages. Hence, we can even use these methods within […]

Course Review – Machine Learning A-Z: Hands-On Python & R In Data Science

I completed Machine Learning A-Z: Hands-On Python & R In Data Science course from Udemy on Aug 1, 2019. I would say “Machine Learning A-Z for Programmers” is a more apt title for the course. It’s a beginner friendly course aimed towards programmers that covers a wide range of topics with hands-on programming with Python […]

Amazon EC2 Instances: M5 vs M5d vs M5a vs M5ad

Amazon Elastic Compute Cloud (EC2) is a service that offers compute capacity in Amazon Web Services (AWS) cloud. Amazon EC2 M5 Instances are the fifth generation EC2 instances that are ideal for General Purpose computing as they offer a balance of compute, memory and networking resources. M5 can be used as servers, caching fleets, app […]

Amazon EC2 Spot Instances: Most and Least Interrupted Instance Types

Amazon EC2 Spot Instances are one type of purchasing the EC2 instances, the other two types being on-demand and reserved instances. Spot instances are the cheapest among the three types and they are cost effective for running fault-tolerant workloads. Before starting to use the Spot instances, it’s important to understand that Spot instances will be […]

Apache Sqoop: Import data from RDBMS to HDFS in ORC Format

Apache Sqoop import tool offers capability to import data from RDBMS (MySQL, Oracle, SQLServer, etc) table to HDFS. Sqoop import provides native support to store data in text file as well as binary format such as Avro and Parquet. There’s no native support to import in ORC format. However, it’s still possible to import in […]

Cloudera CCA Spark and Hadoop Developer (CCA175) Certification – Preparation Guide

Cloudera’s CCA Spark and Hadoop Developer (CCA175) exam validates the candidate’s ability to employ various Big Data tools such as Hadoop, Spark, Hive, Impala, Sqoop, Flume, Kafka, etc to solve hands-on problems. I passed CCA175 certification exam on May 13, 2019 and wanted to share my experience. This article has everything you should know about […]

Apache Spark: Repartition vs Coalesce

Repartition can be used for increasing or decreasing the number of partitions. Whereas Coalesce can only be used for decreasing the number of partitions. Coalesce is a less expensive operation than Repartition as Coalesce reduces data movement between the nodes while Repartition shuffles all data over the network. Partitions What are partitions? The dataset in […]

Apache Spark on a Single Node/Pseudo Distributed Hadoop Cluster in macOS

This article describes how to set up and configure Apache Spark to run on a single node/pseudo distributed Hadoop cluster with YARN resource manager. Apache Spark comes with a Spark Standalone resource manager by default. We can configure Spark to use YARN resource manger instead of the Spark’s own resource manager so that the resource […]

Single Node/Pseudo Distributed Hadoop Cluster on macOS

This article walks through setting up and configuring a single node Hadoop Cluster or pseudo-distributed cluster on macOS. A single node cluster is very useful for development as it reduces the need for an actual cluster for running quick tests. At the end of this tutorial, you’ll have a single node Hadoop cluster with all […]

Scroll to Top