Amazon EMR and Google Cloud Dataproc: Top 10 Common Features

Amazon Web Services and Google Cloud Platform are the two of the three market leaders in cloud computing. They both offer similar kind of cloud-native big data platforms to filter, transform, aggregate and process data at scale. Amazon EMR and Google Cloud Dataproc are Amazon Web Service’s and Google Cloud Platform’s managed big data platforms respectively. Essentially, both EMR and Dataproc are on-demand managed Hadoop Cluster service. While they offer exclusive features, there are many useful features that offered by both these services. These features can be leveraged to make the clusters cost-effective, reliable, efficient and secured. In this blog post, we will discuss the top 10 common features.

EMR and Dataproc: Top 10 Features

Managed Hadoop Cluster

EMR and Dataproc are managed Hadoop and Spark service that can be used for big data processing, streaming, interactive analysis and machine learning. The users do not have to worry about the setup of Hadoop and Spark as the clusters come preconfigured. EMR and Dataproc clusters can be created with many of the popular Apache Hadoop ecosystem components installed. EMR and Dataproc take care of the installation and configuration of these applications. Some of the applications include Apache Spark, Apache Hadoop, Apache Pig, Apache Hive, Presto, Jupyter and Zeppelin.

Resizing

One of the biggest challenges of on-premise cluster is scalability. It’s time consuming and expensive to scale the clusters. If the clusters are scaled out to meet the requirement of peak loads, the clusters may be under-utilized during normal or smaller loads. We would be paying for the entire cluster even when we are not using them entirely. With EMR and Dataproc, the size of the cluster can be specified at the time of creation. Furthermore, the size can be adjusted or scaled at any time. For instance, we can create the cluster with a specific size and then when the cluster is running, we can scale-out the cluster to meet a heavier workload. Once the heavier workload is completed, the cluster can be scaled-in back to the original size.

Autoscaling

Estimating the size of the cluster for the workload is a challenging process. If we predefine the size of the cluster, there’s a high probability that the cluster would either be under utilized or slowed down due to maximum utilization of resources. Both EMR and Dataproc offer features to automatically scale the cluster when the jobs are in flight. We can specify Scaling Policy with EMR and Autoscaling Policy with Dataproc to describe how the cluster should scale-out or scale-in based on various metrics such as memory, cpu, storage etc. For example, we can specify policy to add more instances when the memory utilization reaches 80%. If the cluster is launched with this policy attached, more nodes would be added when the cluster is in flight and the memory utilization reaches 80%.

On-demand Ephemeral clusters

Unlike on-premise clusters, the clusters on cloud do not have to be running 24 x 7 if there’s no necessity. For batch processing, the clusters can be created on-demand and terminated when the processing is completed. This is one of the main selling point of the clusters on cloud as you only pay for the resources that are being used when the jobs are active. Both EMR and Dataproc offer automatic termination of the cluster. EMR’s step execution and Dataproc Workflow Templates can be leveraged to terminate the clusters automatically after a job or a series of jobs are completed.

Long running clusters

Long running clusters are required for uses cases such as running streaming jobs or interactive or ad hoc analysis using Jypyter notebooks and Zeppelin. Both EMR and Dataproc provide the ability to run for long. Furthermore, these long running clusters don’t necessarily have to run with same size as they can be scaled out and scaled in based on the workload. For instance, the cluster can be running throughout the week to process real-time streaming data.

Kerberos Authenticaton

Kerberos is a network authentication protocol that provides strong authentication for nodes in a network with secret-key cryptography. This allows nodes communicating across an insecure network communication to prove their identities in a secure manner, and also encrypt all their communications to prevent eavesdropping and replay attacks. Both EMR and Dataproc clusters can be configured to have Kerberos authentication enabled.

Flexible Virtual Machine Instance Types

Not all workloads require same kind of resources. Some workloads may be memory intensive while others may be CPU intensive. It’s imperative to choose the virtual machine type with the configurations satisfying the needs of the workloads. Both Amazon and Google offer various instance families each with wide range of sizes. For example, Amazon’s m5 instance family are geared towards general purpose computing. The smallest instance size in m5 family (m5.large) has 2 vCPUs and 8 GiB of memory while the largest instance size (m5.24xlarge) has 96 vCPUs and 384 GiB of memory. Similarly, N1 machine type family is Google’s general purpose machine type. The smallest instance size in n1 family (n1-standard-1) has 1 vCPU and 3.75 GB of memory while the largest instance size (n1-standard-96) has 96 vCPUs and 360 GB of memory. Amazon and Google offer other instance type families optimized for various use cases such as compute-intensive workloads, memory intensive workloads, i/o intensive workloads etc. With these wide range of instance families and instance sizes, the EMR and Dataproc clusters can be created with best virtual machine instance type for the workload.

Custom Virtual Machine Images

Custom Virtual Machine Images contain the information required to launch a Virtual Machine instance. The Virtual Machine Images are templates that contain information such as the operating system, software, users etc that are used while launching a Virtual Machine instance. Amazon’s Virtual Machine Images are called Amazon Machine Images (AMI) while Google Cloud’s are called Custom Images. Both EMR and Dataproc clusters can be provisioned with custom Virtual Machine Images. EMR clusters can be launched with Amazon Machine Images for the instances while Dataproc can be launched with Custom Images. With these images, the clusters can pre-install applications and perform other customizations.

Cost-effective Instances with Interruptions

Cloud providers sell the unused compute capacity at steep discounts up to 80% cheaper than the regular instances. But the catch is that they can take away these instances whenever they want to. These instances are well-suited for optional tasks, batch jobs and fault-tolerant workloads. AWS and Google offer these cost-effective choice of virtual machine instances as Spot EC2 instances and Preemtible Virtual Machine instances respectively. EMR cluster can be launched with a combination of spot and on-demand instances while Dataproc can be launched with a combination of preemtible and regular instances. One of the use cases would to use the clusters with spot/preemtible instances for fault tolerant and time insensitive batch processing of big data workloads.

Instances with Block Storage

By default, the storage have to the attached separately to the Virtual Machine Instances. In this case, the latency is high between the instance and storage. For latency sensitive or high throughput workloads, it’s vital to have the storage disks directly attached to the virtual machine images. Amazon’s Instance store and Google’s Persistent Disk models have the storage disks (HDD/SSD) directly attached to the virtual machine instances. EMR clusters can be created with Instance Store and Cloud Dataproc clusters can be created with Persistent Disks. To give a scenario, clusters with block storage attaches instances can be well suited and effecient for i/o intensive workloads that access the HDFS.

Final Words

These are just some of the most important common features offered by Amazon EMR and Google Cloud Dataproc. These features should be leveraged to make the clusters cost-effective, efficient and secured. There are many more common features as these services have similar service models. In addition, each of these two services offer exclusive features that are not offered by the other service.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top