The Cloud Storage connector open source Java library lets you run Apache Hadoop or Apache Spark jobs directly on data in Cloud Storage.
Benefits of the Cloud Storage connector
- Direct data access: Store your data in Cloud Storage and access it directly. You don't need to transfer it into HDFS first.
- HDFS compatibility: You can access your data in
Cloud Storage using the
gs://
prefix instead ofhdfs://
. - Interoperability: Storing data in Cloud Storage enables seamless interoperability between Spark, Hadoop, and Google services.
- Data accessibility: When you shut down a Hadoop cluster, unlike HDFS, you continue to have access to your data in Cloud Storage.
- High data availability: Data stored in Cloud Storage is highly available and globally replicated without a loss of performance.
- No storage management overhead: Unlike HDFS, Cloud Storage requires no routine maintenance, such as checking the file system, or upgrading or rolling back to a previous version of the file system.
- Quick startup: In HDFS, a MapReduce job can't start until the
NameNode
is out of safe mode, a process that can take from a few seconds to many minutes depending on the size and state of your data. With Cloud Storage, you can start your job as soon as the task nodes start, which leads to significant cost savings over time.
Connector setup on Dataproc clusters
The Cloud Storage connector is installed by default on all
Dataproc cluster nodes in the
/usr/local/share/google/dataproc/lib/
directory. The following
subsections describe steps you can take to complete
connector setup on Dataproc clusters.
VM service account
When running the connector on Dataproc cluster
nodes and other Compute Engine VMs, the
google.cloud.auth.service.account.enable
property is set
to false
by default, which means you don't need to configure the
VM service account
credentials for the connector—VM service account credentials are provided by the
VM metadata server.
The Dataproc VM service account must have permission to access your Cloud Storage bucket.
User-selected connector versions
The default Cloud Storage connector versions used in the latest images installed on Dataproc clusters are listed in the image version pages. If your application depends on a non-default connector version deployed on your cluster, you can perform one of the following actions to use your selected connector version:
- Create a cluster with the
--metadata=GCS_CONNECTOR_VERSION=x.y.z
flag, which updates the connector used by applications running on the cluster to the specified connector version. - Include and relocate the connector classes and connector dependencies for the version you are using into your application's jar.cRelocation is necessary to avoid a conflict between the your deployed connector version and the default connector version installed on the Dataproc cluster. Also see the Maven dependencies relocation example.
Connector setup on non-Dataproc clusters
You can take the following steps to setup the Cloud Storage connector on a non-Dataproc cluster, such as an Apache Hadoop or Spark cluster that you use to move on-premises HDFS data to Cloud Storage.
Download the connector.
- To download the Cloud Storage connector:
- To use a
latest
version located in Cloud Storage bucket (using alatest
version is not recommended for production applications): - To use a specific version
from your Cloud Storage bucket by substituting the Hadoop and
Cloud Storage connector versions in the
gcs-connector-HADOOP_VERSION-CONNECTOR_VERSION.jar
name pattern, for example,gs://hadoop-lib/gcs/gcs-connector-hadoop2-2.1.1.jar
. - To use a specific version
from the
Apache Maven repository,
download a shaded jar that has
-shaded
suffix in the name.
- To use a
- To download the Cloud Storage connector:
Install the connector.
Follow the GitHub instructions to install, configure, and test the Cloud Storage connector.
Connector usage
You can use the connector to access Cloud Storage data in the following ways:
- In a Spark, PySpark, or Hadoop application with the
gs://
prefix - In a hadoop shell with
hadoop fs -ls gs://bucket/dir/file
- In the Cloud Storage Browser In the Google Cloud console
- Using Google Cloud SDK commands, such as:
*
gcloud storage cp
*gcloud storage rsync
Java usage
The Cloud Storage connector requires Java 8.
The following is a sample Maven POM dependency management section for the Cloud Storage connector. For additional information, see Dependency Management.
<dependency> <groupId>com.google.cloud.bigdataoss</groupId> <artifactId>gcs-connector</artifactId> <version>hadoopX-X.X.XCONNECTOR VERSION</version> <scope>provided</scope> </dependency>
For a shaded version:
<dependency> <groupId>com.google.cloud.bigdataoss</groupId> <artifactId>gcs-connector</artifactId> <version>hadoopX-X.X.XCONNECTOR VERSION</version> <scope>provided</scope> <classifier>shaded</classifier> </dependency>
Connector support
The Cloud Storage connector is supported by Google Cloud for use with Google Cloud products and use cases. When used with Dataproc, it is supported at the same level as Dataproc. For more information, see Get support.
Connect to Cloud Storage using gRPC
By default, the Cloud Storage connector on Dataproc uses the Cloud Storage JSON API. This section shows you how to enable the Cloud Storage connector to use gRPC.
Usage considerations
Using the Cloud Storage connector with gRPC includes the following considerations:
- Regional bucket location: The gRPC can improve read latencies only when Compute Engine VMs and Cloud Storage buckets are located in the same Compute Engine region.
- Read-intensive jobs: gRPC can offer improved read latencies for long-running reads, and can help read-intensive workloads. It is not recommended for applications that create a gRPC channel, run a short computation, and then close the channel.
- Unauthenticated requests: The gRPC does not support unauthenticated requests.
Requirements
The following requirements apply when using gRPC with the Cloud Storage connector:
Your Dataproc cluster VPC network must support direct connectivity. This means that the network's routes and firewall rules must allow egress traffic to reach
34.126.0.0/18
and2001:4860:8040::/42
.- If your Dataproc cluster uses IPv6 networking, you must set up an IPv6 subnet for VM instances. For more information, see Configuring IPv6 for instances and instance templates.
When creating a Dataproc cluster, you must use Cloud Storage connector version
2.2.23
or later with image version2.1.56+
or Cloud Storage connector version v3.0.0 or later with image version 2.2.0+. The Cloud Storage connector version installed on each Dataproc image version is listed in the Dataproc image version pages.- If you create and use a
Dataproc on GKE virtual cluster
for your gRPC Cloud Storage requests, GKE version
1.28.5-gke.1199000
withgke-metadata-server 0.4.285
is recommended. This combination supports direct connectivity.
- If you create and use a
Dataproc on GKE virtual cluster
for your gRPC Cloud Storage requests, GKE version
You or your organization administrator must grant Identity and Access Management roles that include the permissions necessary to set up and make gRPC requests to the Cloud Storage connector. These roles can include the following:
- User role: Dataproc Editor role granted to users to allow them to create clusters and submit jobs
- Service account role: Storage Object User role granted to the Dataproc VM service account to allow applications running on cluster VMs to view, read, create, and write Cloud Storage objects.
Enable gRPC on the Cloud Storage connector
You can enable gRPC on the Cloud Storage connector at the cluster or job level. Once enabled on the cluster, Cloud Storage conector read requests use gRPC. If enabled on a job instead of at the cluster level, Cloud Storage connector read requests use gRPC for the job only.
Enable a cluster
To enable gRPC on the Cloud Storage connector at the cluster level,
set the core:fs.gs.client.type=STORAGE_CLIENT
property when you
create a Dataproc cluster.
Once gRPC is enabled at the cluster level, Cloud Storage connector
read requests made by jobs running on the cluster use gRPC.
gcloud CLI example:
gcloud dataproc clusters create CLUSTER_NAME \ --project=PROJECT_ID \ --region=REGION \ --properties=core:fs.gs.client.type=STORAGE_CLIENT
Replace the following;
- CLUSTER_NAME: Specify a name for your cluster.
- PROJECT_NAME: The project ID of the project where the cluster is located. Project IDs are listed in the Project info section on the Google Cloud console Dashboard.
- REGION: Specify a Compute Engine region where the cluster will be located.
Enable a job
To enable gRPC on the Cloud Storage connector for a specific
job, include --properties=spark.hadoop.fs.gs.client.type=STORAGE_CLIENT
when you submit a job.
Example: Run a job on an existing cluster that uses gRPC to read from Cloud Storage.
Create a local
/tmp/line-count.py
PySpark script that uses gRPC to read a Cloud Storage text file and output the number of lines in the file.cat <<EOF >"/tmp/line-count.py" #!/usr/bin/python import sys from pyspark.sql import SparkSession path = sys.argv[1] spark = SparkSession.builder.getOrCreate() rdd = spark.read.text(path) lines_counter = rdd.count() print("There are {} lines in file: {}".format(lines_counter,path)) EOF
Create a local
/tmp/line-count-sample.txt
text file.cat <<EOF >"/tmp/line-count-sample.txt" Line 1 Line 2 line 3 EOF
Upload local
/tmp/line-count.py
and/tmp/line-count-sample.txt
to your bucket in Cloud Storage.gcloud storage cp /tmp/line-count* gs://BUCKET
Run the
line-count.py
job on your cluster. Set--properties=spark.hadoop.fs.gs.client.type=STORAGE_CLIENT
to enable gRPC for Cloud Storage connector read requests.gcloud dataproc jobs submit pyspark gs://BUCKET/line-count.py \ --cluster=CLUSTER_NAME \ --project=PROJECT_ID \ --region=REGION \ --properties=spark.hadoop.fs.gs.client.type=STORAGE_CLIENT \ -- gs://BUCKET/line-count-sample.txt
Replace the following;
- CLUSTER_NAME: The name of an existing cluster.
- PROJECT_NAME: Your project ID. Project IDs are listed in the Project info section on the Google Cloud console Dashboard.
- REGION: The Compute Engine region where the cluster is located.
- BUCKET: Your Cloud Storage bucket.
Generate gRPC client-side metrics
You can configure the Cloud Storage connector to generate gRPC related metrics in Cloud Monitoring. The gRPC related metrics can help you to do the following:
- Monitor and optimize the performance of gRPC requests to Cloud Storage
- Troubleshoot and debug issues
- Gain insights into application usage and behavior
For information about how to configure the Cloud Storage connector to generate gRPC related metrics, see Use gRPC client-side metrics.
Resources
- See Connect to Cloud Storage using gRPC to use the Cloud Storage connector with client libraries, VPC Service Controls, and other scenarios.
- Learn more about Cloud Storage.
- See Use the Cloud Storage connector with Apache Spark.
- Understand the Apache Hadoop file system .
- View the Javadoc reference.