Amazon Elastic Compute Cloud (EC2) is a service that offers compute capacity in Amazon Web Services (AWS) cloud. Amazon EC2 M5 Instances are the fifth generation EC2 instances that are ideal for General Purpose computing as they offer a balance of compute, memory and networking resources. M5 can be used as servers, caching fleets, app […]
Amazon EC2 Spot Instances: Most and Least Interrupted Instance Types
Amazon EC2 Spot Instances are one type of purchasing the EC2 instances, the other two types being on-demand and reserved instances. Spot instances are the cheapest among the three types and they are cost effective for running fault-tolerant workloads. Before starting to use the Spot instances, it’s important to understand that Spot instances will be […]
Apache Sqoop: Import data from RDBMS to HDFS in ORC Format
Apache Sqoop import tool offers capability to import data from RDBMS (MySQL, Oracle, SQLServer, etc) table to HDFS. Sqoop import provides native support to store data in text file as well as binary format such as Avro and Parquet. There’s no native support to import in ORC format. However, it’s still possible to import in […]
Cloudera CCA Spark and Hadoop Developer (CCA175) Certification – Preparation Guide
Cloudera’s CCA Spark and Hadoop Developer (CCA175) exam validates the candidate’s ability to employ various Big Data tools such as Hadoop, Spark, Hive, Impala, Sqoop, Flume, Kafka, etc to solve hands-on problems. I passed CCA175 certification exam on May 13, 2019 and wanted to share my experience. This article has everything you should know about […]
Apache Spark: Repartition vs Coalesce
Repartition can be used for increasing or decreasing the number of partitions. Whereas Coalesce can only be used for decreasing the number of partitions. Coalesce is a less expensive operation than Repartition as Coalesce reduces data movement between the nodes while Repartition shuffles all data over the network. Partitions What are partitions? The dataset in […]
Apache Spark on a Single Node/Pseudo Distributed Hadoop Cluster in macOS
This article describes how to set up and configure Apache Spark to run on a single node/pseudo distributed Hadoop cluster with YARN resource manager. Apache Spark comes with a Spark Standalone resource manager by default. We can configure Spark to use YARN resource manger instead of the Spark’s own resource manager so that the resource […]
Single Node/Pseudo Distributed Hadoop Cluster on macOS
This article walks through setting up and configuring a single node Hadoop Cluster or pseudo-distributed cluster on macOS. A single node cluster is very useful for development as it reduces the need for an actual cluster for running quick tests. At the end of this tutorial, you’ll have a single node Hadoop cluster with all […]