This page shows you how to set up a multi-regional Dataproc Metastore service. For more information about how multi-regional Dataproc Metastore services work, see Dataproc Metastore regions.
Before you begin
- Enable Dataproc Metastore in your project.
- Understand networking requirements specific to your project.
- Learn about Dataproc Metastore regions and choose an appropriate region.
Required roles
To get the permission that you need to create a multi-regional Dataproc Metastore service, ask your administrator to grant you the following IAM roles on your project, based on the principle of least privilege:
-
Grant full control of Dataproc Metastore resources (
roles/metastore.editor
) -
Grant full access to all Dataproc Metastore resources, including IAM policy administration (
roles/metastore.admin
)
For more information about granting roles, see Manage access to projects, folders, and organizations.
This predefined role contains the
metastore.services.create
permission,
which is required to
create a multi-regional Dataproc Metastore service.
You might also be able to get this permission with custom roles or other predefined roles.
For more information about specific Dataproc Metastore roles and permissions, see Manage access with IAM.About multi-regional Dataproc Metastore services
Multi-regional Dataproc Metastore services store your data in two
different regions and use the two regions to run your workloads. For example,
the multi-region nam7
contains the us-central1
and us-east4
regions.
A multi-regional Dataproc Metastore service replicates metadata across two regions and exposes the relevant endpoints to access the Hive Metastore. For gRPC, one endpoint per region is exposed. For Thrift, one endpoint per subnetwork is exposed.
A multi-regional Dataproc Metastore service provides an active-active high availability (HA) cluster configuration. This configuration means that workloads can access either region when running jobs. It also provides a failover mechanism for your service. For example, if your primary regional endpoint goes down, your workloads are automatically routed to the secondary region. This helps prevent disruptions to your Dataproc jobs.
Considerations
The following considerations apply to multi-regional Dataproc Metastore services.
Multi-regional services only support the Spanner database type. check the supported feature list before you create your multi-regional service.
Multi-regional services only support Dataproc Metastore 2 configurations.
Multi-regional services create artifact buckets in a Cloud Storage multi-region location. For example,
Nam7
buckets are created in theUS
multi-region location.
Create a multi-regional Dataproc Metastore service
Choose one of the following tabs to learn how to create a multi-regional service using either the Thrift or gRPC endpoint protocol, with a Dataproc Metastore service 2.
gRPC
When creating a multi-regional service that uses the gRPC endpoint protocol, you don't have to set any specific network settings. The gRPC protocol handles the network routing for you.
Console
In the Google Cloud console, go to the Dataproc Metastore page.
In the navigation bar, click +Create.
The Create Metastore service dialog opens.
Select Dataproc Metastore 2.
In the Pricing and Capacity section, select Enterprise Plus - Dual region
For the Endpoint protocol, select gRPC.
To create and start the service, click Submit.
Your new metastore service appears on the Dataproc Metastore page. The status displays Creating until the service is ready to use. When it's ready, the status changes to Active. Provisioning the service might take a few minutes.
gcloud CLI
To create a Dataproc Metastore multi-regional service, run the
following gcloud metastore services create
command. This command creates Dataproc Metastore version 3.1.2.
gcloud metastore services create SERVICE \ --location=MULTI_REGION \ { --instance-size=INSTANCE_SIZE | --scaling-factor=SCALING_FACTOR } \ --endpoint-protocol=grpc
Replace the following:
SERVICE
: the name of your Dataproc Metastore service.MULTI_REGION
: the multi-region that you're creating your Dataproc Metastore service in.INSTANCE_SIZE
: the instance size of your multi-regional Dataproc Metastore. For example,small
,medium
orlarge
. If you specify a value forINSTANCE_SIZE
, don't specify a value forSCALING_FACTOR
.SCALING_FACTOR
: the scaling factor of your Dataproc Metastore service. For example,0.1
. If you specify a value forSCALING_FACTOR
, don't specify a value forINSTANCE_SIZE
.
Thrift
When creating a multi-regional service that uses the Thrift endpoint protocol, you must set the appropriate subnetwork settings. In this case, for each VPC network you are using, you must provide at least one subnetwork from each region.
For example, to create the nam7
multi-region, you must provide both the
us-central1
and us-east4
regions.
Console
In the Google Cloud console, go to the Dataproc Metastore page.
In the navigation bar, click +Create.
The Create Metastore service dialog opens.
Select Dataproc Metastore 2.
In the Pricing and Capacity section, select Enterprise Plus - Dual region.
For more information, see pricing plans and scaling configurations.
In the Service name field, enter a unique name for your service.
For information on naming conventions, see Resource naming convention.
For the Endpoint protocol, select Thrift.
For Network Config, provide the subnetworks that form your chosen multi-regional configuration.
For the remaining service configuration options, use the provided defaults.
To create and start the service, click Submit.
Your new metastore service appears on the Dataproc Metastore page. The status displays Creating until the service is ready to use. When it's ready, the status changes to Active. Provisioning the service might take a few minutes.
gcloud CLI
To create a Dataproc Metastore multi-regional service,
run the following gcloud metastore services create
command.
This command creates Dataproc Metastore version 3.1.2.
gcloud metastore services create SERVICE \ --location=MULTI_REGION \ --consumer-subnetworks="projects/PROJECT_ID/regions/LOCATION1/subnetworks/SUBNET1,projects/PROJECT_ID/regions/LOCATION2/subnetworks/SUBNET2" \ { --instance-size=INSTANCE_SIZE | --scaling-factor=SCALING_FACTOR } \ --endpoint-protocol=thrift
Or you can store your network settings in a file, as shown in the following command.
gcloud metastore services create SERVICE \ --location=MULTI_REGION \ --network-config-from-file=NETWORK_CONFIG_FROM_FILE { --instance-size=INSTANCE_SIZE | --scaling-factor=SCALING_FACTOR } \ --endpoint-protocol=thrift
Replace the following:
SERVICE
: the name of your Dataproc Metastore service.MULTI_REGION
: the multi-region that you're creating your Dataproc Metastore service in.PROJECT_ID
: the Google Cloud project ID that you're creating your Dataproc Metastore service in.SUBNET1
,SUBNET2
: a list of subnetworks that form a multi-regional configuration. You can use the ID, fully qualified URL, or relative name of the subnetwork. You can specify up to six subnetworks.LOCATION1
,LOCATION2
: a list of locations that form a multi-regional configuration. You can use the ID of the location. For example, for anam7
multi-region, you useus-central1
andus-east4
.NETWORK_CONFIG_FROM_FILE
: the path to a YAML file containing your network configuration.INSTANCE_SIZE
: the instance size of your multi-regional Dataproc Metastore. For example,small
,medium
orlarge
. If you specify a value forINSTANCE_SIZE
, don't specify a value forSCALING_FACTOR
.SCALING_FACTOR
: the scaling factor of your Dataproc Metastore service. For example,0.1
. If you specify a value forSCALING_FACTOR
, don't specify a value forINSTANCE_SIZE
.
REST
To learn how to create a multi-regional Dataproc Metastore service, follow the instructions to create a service by using the Google APIs Explorer.
To configure a multi-regional service, provide the following information
in the Network Config
objects.
"network_config": { "consumers": [ {"subnetwork": "projects/PROJECT_ID/regions/LOCATION/subnetworks/SUBNET1"}, {"subnetwork": "projects/PROJECT_ID/regions/LOCATION/subnetworks/SUBNET2"} ], "scaling_config": { "scaling_factor": SCALING_FACTOR } }
Replace the following:
PROJECT_ID
: the Google Cloud project ID of the project that contains your Dataproc Metastore service.LOCATION
: the Google Cloud region that your Dataproc Metastore service resides in.SUBNET1
,SUBNET2
: a list of subnetworks that form a multi-regional configuration. You can use the ID, fully qualified URL, or relative name of the subnetwork. You can specify up to five subnetworks.SCALING_FACTOR
: the scaling factor that you want to use for service.
Connect Dataproc Metastore to a Dataproc cluster
Choose one of the following tabs to learn how to connect a multi-regional Dataproc Metastore service from a Dataproc cluster.
gRPC
To connect a Dataproc cluster, choose the tab that corresponds with the version of Dataproc Metastore that you're using.
Dataproc Metastore 3.1.2
Create the following variables for your Dataproc cluster:
CLUSTER_NAME=CLUSTER_NAME PROJECT_ID=PROJECT_ID MULTI_REGION=MULTI_REGION DATAPROC_IMAGE_VERSION=DATAPROC_IMAGE_VERSION PROJECT=PROJECT SERVICE_ID=SERVICE_ID
Replace the following:
CLUSTER_NAME
: the name of your Dataproc cluster.PROJECT_ID
: the Google Cloud project that contains your Dataproc cluster. Make sure that the subnet you're using has the appropriate permissions to access this project.MULTI_REGION
: the Google Cloud multi-region that you want to create your Dataproc cluster in.DATAPROC_IMAGE_VERSION
: the Dataproc image version that you are using with your Dataproc Metastore service. You must use a image version of2.0
or higher.PROJECT
: the project that contains your Dataproc Metastore service.SERVICE_ID
: the service ID of your Dataproc Metastore service.
To create your cluster, run the following
gcloud dataproc clusters create
command.--enable-kerberos
is optional. Only include this option if you are using kerberos with your cluster.gcloud dataproc clusters create ${CLUSTER_NAME} \ --project ${PROJECT_ID} \ --region ${MULTI_REGION} \ --image-version ${DATAPROC_IMAGE_VERSION} \ --scopes "https://www.googleapis.com/auth/cloud-platform" \ --dataproc-metastore projects/${PROJECT}/locations/${MULTI_REGION}/services/${SERVICE_ID} \ [ --enable-kerberos ]
Dataproc Metastore 2.3.6
Create the following variables for your Dataproc Metastore service:
METASTORE_PROJECT=METASTORE_PROJECT METASTORE_ID=METASTORE_ID MULTI_REGION=MULTI_REGION SUBNET=SUBNET
Replace the following:
METASTORE_PROJECT
: the Google Cloud project that contains your Dataproc Metastore service.METASTORE_ID
: the service ID of your Dataproc Metastore service.MULTI_REGION
: the multi-region location that you want to use for your Dataproc Metastore service.SUBNET
: one of the subnets that you're using for your Dataproc Metastore service. Or any subnetwork in the parent VPC network of the subnetworks used for your service.
Create the following variables for your Dataproc cluster:
CLUSTER_NAME=CLUSTER_NAME DATAPROC_PROJECT=DATAPROC_PROJECT DATAPROC_REGION=DATAPROC_REGION HIVE_VERSION=HIVE_VERSION IMAGE_VERSION=
r>IMAGE_VERSION Replace the following:
CLUSTER_NAME
: the name of your Dataproc cluster.DATAPROC_PROJECT
: the Google Cloud project that contains your Dataproc cluster. Make sure that the subnet you're using has the appropriate permissions to access this project.DATAPROC_REGION
: the Google Cloud region that you want to create your Dataproc cluster in.HIVE_VERSION
: the version of Hive that your Dataproc Metastore service uses.IMAGE_VERSION
: the Dataproc image version you are using with your Dataproc Metastore service.- For Hive Metastore version 2.0, use image version
1.5
. - For Hive Metastore version 3.1.2, use image version
2.0
.
- For Hive Metastore version 2.0, use image version
Retrieve the warehouse directory of your Dataproc Metastore service and store it in a variable.
WAREHOUSE_DIR=$(gcloud metastore services describe "${METASTORE_ID}" --project "${METASTORE_PROJECT}" --location "${MULTI_REGION}" --format="get(hiveMetastoreConfig.configOverrides[hive.metastore.warehouse.dir])")
Create a Dataproc cluster configured with a multi-regional Dataproc Metastore.
gcloud dataproc clusters create ${CLUSTER_NAME} \ --project "${DATAPROC_PROJECT}" \ --region ${DATAPROC_REGION} \ --scopes "https://www.googleapis.com/auth/cloud-platform" \ --subnet "${SUBNET}" \ --optional-components=DOCKER \ --image-version ${IMAGE_VERSION} \ --metadata "hive-version=${HIVE_VERSION},dpms-name=${DPMS_NAME}" \ --properties "hive:hive.metastore.uris=thrift://localhost:9083,hive:hive.metastore.warehouse.dir=${WAREHOUSE_DIR}" \ --initialization-actions gs://metastore-init-actions/mr-metastore-grpc-proxy/metastore-grpc-proxy.sh
Thrift
Option 1: Edit the hive-site.xml file
- Find the endpoint URI and warehouse directory of your Dataproc Metastore service. You can pick any one of the endpoints exposed.
- In the Google Cloud console go to the VM Instances page.
In the list of virtual machine instances, click SSH in the row of the Dataproc primary node (
.*-m
).A browser window opens in your home directory on the node.
Open the
/etc/hive/conf/hive-site.xml
file.sudo vim /etc/hive/conf/hive-site.xml
You see an output similar to the following:
<property> <name>hive.metastore.uris</name> <value>ENDPOINT_URI</value> </property> <property> <name>hive.metastore.warehouse.dir</name> <value>WAREHOUSE_DIR</value> </property>
Replace the following:
ENDPOINT_URI
: The endpoint URI of your Dataproc Metastore service.WAREHOUSE_DIR
: The location of your Hive warehouse directory.
Restart HiveServer2:
sudo systemctl restart hive-server2.service
Option 2: Use the gcloud CLI
Run the following gcloud CLI gcloud dataproc clusters create
command.
- Find the endpoint URI and warehouse directory of your Dataproc Metastore service. You can pick any one of the endpoints exposed.
gcloud dataproc clusters create CLUSTER_NAME \ --network NETWORK \ --project PROJECT_ID \ --scopes "https://www.googleapis.com/auth/cloud-platform" \ --image-version IMAGE_VERSION \ --properties "hive:hive.metastore.uris=ENDPOINT,hive:hive.metastore.warehouse.dir=WAREHOUSE_DIR"
Replace the following:
CLUSTER_NAME
: the name of your Dataproc cluster.NETWORK
: the Google Cloud project that contains your Dataproc cluster. Make sure that the subnet you're using has the appropriate permissions to access this project.PROJECT_ID
: the version of Hive that your Dataproc Metastore service uses.IMAGE_VERSION
: the Dataproc image version you are using with your Dataproc Metastore service.- For Hive Metastore version 2.0, use image version
1.5
. - For Hive Metastore version 3.1.2, use image version
2.0
.
- For Hive Metastore version 2.0, use image version
ENDPOINT
: the Thrift endpoint that your Dataproc Metastore uses.WAREHOUSE_DIR
: the warehouse directory of your Dataproc Metastore.
Custom region configurations
You can configure Dataproc Metastore services to use a custom region configuration.
A custom region configuration lets your service run workloads from two separate regions. This provides redundancy across regions, meaning that workloads can access either region when running jobs. It also provides a failover mechanism for your service. For example, if one of the regional endpoints goes down, your workloads are automatically routed to the other region. This helps prevent disruptions to your workloads and jobs.
Custom region configurations also let you control where you are storing metadata and where to expose your Hive Metastore endpoints. This can improve performance when processing workloads.
Considerations
The following considerations apply to Dataproc Metastore services configured with a custom region configuration:
- Region/Pairing Restrictions: Not all regions and combinations are allowed.
- Read-Only Limitations: Read-only regions cannot accept write operations. If a read-only region is chosen and the read-write region is unreachable, then the write fails to process.
- Configuration Immutability: Once set, the region configuration cannot be changed.
- US Stack Only: Custom dual regions only support the
US
stack and are limited to the US boundary.
Create a custom region service
To set up a custom region, choose two adjacent regions when you create your service. This combination can be either two read-write regions or one read-write and one read-only region.
Console
In the Google Cloud console, go to the Dataproc Metastore page.
In the navigation bar, click +Create.
The Create Metastore service dialog opens.
Select Dataproc Metastore 2.
In the Pricing and Capacity section, select Enterprise Plus - Dual region.
In the Service name field, enter a unique name for your service.
For data location, select US (continent).
The Custom regions section appears.
Under Custom regions, select a Read-write region and a Read-only region.
For the remaining service configuration options, use the provided defaults.
To create and start the service, click Submit.
Your new metastore service appears on the Dataproc Metastore page. The status displays Creating until the service is ready to use. When it's ready, the status changes to Active. Provisioning the service might take a few minutes.
gcloud
To create a Dataproc Metastore service with custom regions,
run the following gcloud CLI gcloud beta metastore services create
command.
gcloud beta metastore services create SERVICE \ --read-write-regions
Replace the following:
SERVICE
: the name of your Dataproc Metastore service.READ_WRITE_REGIONS
: A supported read write region that is a part of your custom region configuration.READ_ONLY_REGIONS
: A supported read only region that is a part of your custom region configuration.