Did you know that every minute:
50,000 photos are posted on Instagram,
500,000 photos are shared on Snapchat,
1,000,000 swipes are done on Tinder and
4,00,000 videos are watched on YouTube
Category: Apache Spark
Execute Linux Commands from Spark Shell and PySpark Shell
Linux commands can be executed from Spark Shell and PySpark Shell. This comes in handy during development to run some Linux commands like listing the contents of a HDFS directory or a local directory. These methods are provided by the native libraries of Scala and Python languages. Hence, we can even use these methods within […]
Cloudera CCA Spark and Hadoop Developer (CCA175) Certification – Preparation Guide
Cloudera’s CCA Spark and Hadoop Developer (CCA175) exam validates the candidate’s ability to employ various Big Data tools such as Hadoop, Spark, Hive, Impala, Sqoop, Flume, Kafka, etc to solve hands-on problems. I passed CCA175 certification exam on May 13, 2019 and wanted to share my experience. This article has everything you should know about […]
Apache Spark: Repartition vs Coalesce
Repartition can be used for increasing or decreasing the number of partitions. Whereas Coalesce can only be used for decreasing the number of partitions. Coalesce is a less expensive operation than Repartition as Coalesce reduces data movement between the nodes while Repartition shuffles all data over the network. Partitions What are partitions? The dataset in […]
Apache Spark on a Single Node/Pseudo Distributed Hadoop Cluster in macOS
This article describes how to set up and configure Apache Spark to run on a single node/pseudo distributed Hadoop cluster with YARN resource manager. Apache Spark comes with a Spark Standalone resource manager by default. We can configure Spark to use YARN resource manger instead of the Spark’s own resource manager so that the resource […]