Provisioners in Cloud Data Fusion

A provisioner is responsible for creating and tearing down the cloud cluster where the pipeline is executed. Different provisioners are capable of creating different types of clusters on various clouds.

Each provisioner exposes a set of configuration settings that control the type of cluster that's created for a run. For example, the Dataproc and Amazon EMR provisioners have cluster size settings. Provisioners also have settings for the credentials required to talk to their respective clouds and provision the required compute nodes.

Supported provisioners in Cloud Data Fusion

Cloud Data Fusion supports the following provisioners:

Dataproc
A fast, easy-to-use, and fully-managed cloud service for running Apache Spark and Apache Hadoop clusters.
Amazon Elastic MapReduce (EMR)
Provides a managed Hadoop framework that processes vast amounts of data across dynamically scalable Amazon EC2 instances.
Remote Hadoop
Runs jobs on a pre-existing Hadoop cluster, either on-premises or in the cloud.