Amazon EMR and Google Cloud Dataproc: Powerful 10 Shared Features

Amazon Web Services and Google Cloud Platform are two of the three market leaders in cloud computing. They both offer similar kinds of cloud-native big data platforms to filter, transform, aggregate, and process data at scale. Amazon EMR and Google Cloud Dataproc are managed big data platforms offered by Amazon Web Services and Google Cloud Platform respectively. Essentially, both EMR and Dataproc are on-demand managed Hadoop Cluster services. While they offer exclusive features, there are many useful features that are offered by both these services. These features can be leveraged to make the clusters cost-effective, reliable, efficient, and secure. In this blog post, we will discuss the top 10 common features.

Managed Hadoop Cluster

EMR and Dataproc are managed Hadoop and Spark services that can be used for big data processing, streaming, interactive analysis, and machine learning. The users do not have to worry about the setup of Hadoop and Spark as the clusters come preconfigured. EMR and Dataproc clusters can be created with many of the popular Apache Hadoop ecosystem components installed. EMR and Dataproc take care of the installation and configuration of these applications. Some of the applications include Apache Spark, Apache Hadoop, Apache Pig, Apache Hive, Presto, Jupyter, and Zeppelin.

Resizing

One of the biggest challenges of the on-premise cluster is scalability. It’s time-consuming and expensive to scale the clusters. If the clusters are scaled out to meet the requirement of peak loads, the clusters may be under-utilized during normal or smaller loads. We would be paying for the entire cluster even when we are not using them entirely. With EMR and Dataproc, the size of the cluster can be specified at the time of creation. Furthermore, the size can be adjusted or scaled at any time. For instance, we can create the cluster with a specific size and then when the cluster is running, we can scale out the cluster to meet a heavier workload. Once the heavier workload is completed, the cluster can be scaled back to the original size.

Autoscaling

Estimating the size of the cluster for the workload is a challenging process. If we predefine the size of the cluster, there’s a high probability that the cluster would either be underutilized or slowed down due to maximum utilization of resources. Both EMR and Dataproc offer features to automatically scale the cluster when the jobs are in flight. We can specify the Scaling Policy with EMR and the Autoscaling Policy with Dataproc to describe how the cluster should scale out or scale in based on various metrics such as memory, CPU, storage, etc. For example, we can specify a policy to add more instances when the memory utilization reaches 80%. If the cluster is launched with this policy attached, more nodes will be added when the cluster is in flight and the memory utilization reaches 80%.

On-demand Ephemeral clusters

Unlike on-premise clusters, the clusters on the cloud do not have to be running 24 x 7 if there’s no necessity. For batch processing, the clusters can be created on-demand and terminated when the processing is completed. This is one of the main selling points of the clusters on the cloud as you only pay for the resources that are being used when the jobs are active. Both EMR and Dataproc offer automatic termination of the cluster. EMR’s step execution and Dataproc Workflow Templates can be leveraged to terminate the clusters automatically after a job or a series of jobs are completed.

Long-running clusters

Long-running clusters are required for use cases such as running streaming jobs or interactive or ad hoc analysis using Jypyter notebooks and Zeppelin. Both EMR and Dataproc provide the ability to run for long. Furthermore, these long-running clusters don’t necessarily have to run with the same size as they can be scaled out and scaled in based on the workload. For instance, the cluster can be running throughout the week to process real-time streaming data.

Kerberos Authentication

Kerberos is a network authentication protocol that provides strong authentication for nodes in a network with secret-key cryptography. This allows nodes communicating across an insecure network communication to prove their identities in a secure manner, and also encrypt all their communications to prevent eavesdropping and replay attacks. Both EMR and Dataproc clusters can be configured to have Kerberos authentication enabled.

Flexible Virtual Machine Instance Types

Not all workloads require the same kind of resources. Some workloads may be memory intensive while others may be CPU intensive. It’s imperative to choose the virtual machine type with the configurations satisfying the needs of the workloads. Both Amazon and Google offer various instance families each with a wide range of sizes. For example, Amazon’s m5 instance family is geared towards general-purpose computing. The smallest instance size in the m5 family (m5.large) has 2 vCPUs and 8 GiB of memory while the largest instance size (m5.24xlarge) has 96 vCPUs and 384 GiB of memory. Similarly, the N1 machine type family is Google’s general-purpose machine type. The smallest instance size in the n1 family (n1-standard-1) has 1 vCPU and 3.75 GB of memory while the largest instance size (n1-standard-96) has 96 vCPUs and 360 GB of memory. Amazon and Google offer other instance-type families optimized for various use cases such as compute-intensive workloads, memory-intensive workloads, i/o-intensive workloads, etc. With this wide range of instance families and instance sizes, the EMR and Dataproc clusters can be created with the best virtual machine instance type for the workload.

Custom Virtual Machine Images

Custom Virtual Machine Images contain the information required to launch a Virtual Machine instance. The Virtual Machine Images are templates that contain information such as the operating system, software, users, etc. that are used while launching a Virtual Machine instance. Amazon’s Virtual Machine Images are called Amazon Machine Images (AMI) while Google Cloud’s are called Custom Images. Both EMR and Dataproc clusters can be provisioned with custom Virtual Machine Images. EMR clusters can be launched with Amazon Machine Images for the instances while Dataproc can be launched with Custom Images. With these images, the clusters can pre-install applications and perform other customizations.

Cost-effective Instances with Interruptions

Cloud providers sell the unused compute capacity at steep discounts up to 80% cheaper than the regular instances. But the catch is that they can take away these instances whenever they want to. These instances are well-suited for optional tasks, batch jobs, and fault-tolerant workloads. AWS and Google offer these cost-effective choices of virtual machine instances as Spot EC2 instances and Preemptible Virtual Machine instances respectively. EMR cluster can be launched with a combination of spot and on-demand instances while Dataproc can be launched with a combination of preemtible and regular instances. One of the use cases would be to use the clusters with spot/preemtible instances for fault-tolerant and time-insensitive batch processing of big data workloads.

Instances with Block Storage

By default, the storage has to the attached separately to the Virtual Machine Instances. In this case, the latency is high between the instance and storage. For latency-sensitive or high throughput workloads, it’s vital to have the storage disks directly attached to the virtual machine images. Amazon’s Instance store and Google’s Persistent Disk models have storage disks (HDD/SSD) directly attached to the virtual machine instances. EMR clusters can be created with Instance Store and Cloud Dataproc clusters can be created with Persistent Disks. To give a scenario, clusters with block storage attached instances can be well suited and efficient for i/o intensive workloads that access the HDFS.

Final Words

These are just some of the most important common features offered by Amazon EMR and Google Cloud Dataproc. These features should be leveraged to make the clusters cost-effective, efficient, and secure. There are many more common features as these services have similar service models. In addition, each of these two services offers exclusive features that are not offered by the other service.