Cloud Storage connector

The Cloud Storage connector open source Java library lets you run Apache Hadoop or Apache Spark jobs directly on data in Cloud Storage.

Benefits of the Cloud Storage connector

  • Direct data access: Store your data in Cloud Storage and access it directly. You don't need to transfer it into HDFS first.
  • HDFS compatibility: You can access your data in Cloud Storage using the gs:// prefix instead of hdfs://.
  • Interoperability: Storing data in Cloud Storage enables seamless interoperability between Spark, Hadoop, and Google services.
  • Data accessibility: When you shut down a Hadoop cluster, unlike HDFS, you continue to have access to your data in Cloud Storage.
  • High data availability: Data stored in Cloud Storage is highly available and globally replicated without a loss of performance.
  • No storage management overhead: Unlike HDFS, Cloud Storage requires no routine maintenance, such as checking the file system, or upgrading or rolling back to a previous version of the file system.
  • Quick startup: In HDFS, a MapReduce job can't start until the NameNode is out of safe mode, a process that can take from a few seconds to many minutes depending on the size and state of your data. With Cloud Storage, you can start your job as soon as the task nodes start, which leads to significant cost savings over time.

Connector setup on Dataproc clusters

The Cloud Storage connector is installed by default on all Dataproc cluster nodes in the /usr/local/share/google/dataproc/lib/ directory. The following subsections describe steps you can take to complete connector setup on Dataproc clusters.

VM service account

When running the connector on Dataproc cluster nodes and other Compute Engine VMs, the google.cloud.auth.service.account.enable property is set to false by default, which means you don't need to configure the VM service account credentials for the connector—VM service account credentials are provided by the VM metadata server.

The Dataproc VM service account must have permission to access your Cloud Storage bucket.

User-selected connector versions

The default Cloud Storage connector versions used in the latest images installed on Dataproc clusters are listed in the image version pages. If your application depends on a non-default connector version deployed on your cluster, you can perform one of the following actions to use your selected connector version:

  • Create a cluster with the --metadata=GCS_CONNECTOR_VERSION=x.y.z flag, which updates the connector used by applications running on the cluster to the specified connector version.
  • Include and relocate the connector classes and connector dependencies for the version you are using into your application's jar.cRelocation is necessary to avoid a conflict between the your deployed connector version and the default connector version installed on the Dataproc cluster. Also see the Maven dependencies relocation example.

Connector setup on non-Dataproc clusters

You can take the following steps to setup the Cloud Storage connector on a non-Dataproc cluster, such as an Apache Hadoop or Spark cluster that you use to move on-premises HDFS data to Cloud Storage.

  1. Download the connector.

  2. Install the connector.

    Follow the GitHub instructions to install, configure, and test the Cloud Storage connector.

Connector usage

You can use the connector to access Cloud Storage data in the following ways:

Java usage

The Cloud Storage connector requires Java 8.

The following is a sample Maven POM dependency management section for the Cloud Storage connector. For additional information, see Dependency Management.

<dependency>
    <groupId>com.google.cloud.bigdataoss</groupId>
    <artifactId>gcs-connector</artifactId>
    <version>hadoopX-X.X.XCONNECTOR VERSION</version>
    <scope>provided</scope>
</dependency>

For a shaded version:

<dependency>
    <groupId>com.google.cloud.bigdataoss</groupId>
    <artifactId>gcs-connector</artifactId>
    <version>hadoopX-X.X.XCONNECTOR VERSION</version>
    <scope>provided</scope>
    <classifier>shaded</classifier>
</dependency>

Connector support

The Cloud Storage connector is supported by Google Cloud for use with Google Cloud products and use cases. When used with Dataproc, it is supported at the same level as Dataproc. For more information, see Get support.

Connect to Cloud Storage using gRPC

By default, the Cloud Storage connector on Dataproc uses the Cloud Storage JSON API. This section shows you how to enable the Cloud Storage connector to use gRPC.

Usage considerations

Using the Cloud Storage connector with gRPC includes the following considerations:

  • Regional bucket location: The gRPC can improve read latencies only when Compute Engine VMs and Cloud Storage buckets are located in the same Compute Engine region.
  • Read-intensive jobs: gRPC can offer improved read latencies for long-running reads, and can help read-intensive workloads. It is not recommended for applications that create a gRPC channel, run a short computation, and then close the channel.
  • Unauthenticated requests: The gRPC does not support unauthenticated requests.

Requirements

The following requirements apply when using gRPC with the Cloud Storage connector:

  • Your Dataproc cluster VPC network must support direct connectivity. This means that the network's routes and firewall rules must allow egress traffic to reach 34.126.0.0/18 and 2001:4860:8040::/42.

  • When creating a Dataproc cluster, you must use Cloud Storage connector version 2.2.23 or later with image version 2.1.56+ or Cloud Storage connector version v3.0.0 or later with image version 2.2.0+. The Cloud Storage connector version installed on each Dataproc image version is listed in the Dataproc image version pages.

    • If you create and use a Dataproc on GKE virtual cluster for your gRPC Cloud Storage requests, GKE version 1.28.5-gke.1199000 with gke-metadata-server 0.4.285 is recommended. This combination supports direct connectivity.
  • You or your organization administrator must grant Identity and Access Management roles that include the permissions necessary to set up and make gRPC requests to the Cloud Storage connector. These roles can include the following:

    • User role: Dataproc Editor role granted to users to allow them to create clusters and submit jobs
    • Service account role: Storage Object User role granted to the Dataproc VM service account to allow applications running on cluster VMs to view, read, create, and write Cloud Storage objects.

Enable gRPC on the Cloud Storage connector

You can enable gRPC on the Cloud Storage connector at the cluster or job level. Once enabled on the cluster, Cloud Storage conector read requests use gRPC. If enabled on a job instead of at the cluster level, Cloud Storage connector read requests use gRPC for the job only.

Enable a cluster

To enable gRPC on the Cloud Storage connector at the cluster level, set the core:fs.gs.client.type=STORAGE_CLIENT property when you create a Dataproc cluster. Once gRPC is enabled at the cluster level, Cloud Storage connector read requests made by jobs running on the cluster use gRPC.

gcloud CLI example:

gcloud dataproc clusters create CLUSTER_NAME \
    --project=PROJECT_ID \
    --region=REGION \
    --properties=core:fs.gs.client.type=STORAGE_CLIENT

Replace the following;

  • CLUSTER_NAME: Specify a name for your cluster.
  • PROJECT_NAME: The project ID of the project where the cluster is located. Project IDs are listed in the Project info section on the Google Cloud console Dashboard.
  • REGION: Specify a Compute Engine region where the cluster will be located.

Enable a job

To enable gRPC on the Cloud Storage connector for a specific job, include --properties=spark.hadoop.fs.gs.client.type=STORAGE_CLIENT when you submit a job.

Example: Run a job on an existing cluster that uses gRPC to read from Cloud Storage.

  1. Create a local /tmp/line-count.py PySpark script that uses gRPC to read a Cloud Storage text file and output the number of lines in the file.

    cat <<EOF >"/tmp/line-count.py"
    #!/usr/bin/python
    import sys
    from pyspark.sql import SparkSession
    path = sys.argv[1]
    spark = SparkSession.builder.getOrCreate()
    rdd = spark.read.text(path)
    lines_counter = rdd.count()
    print("There are {} lines in file: {}".format(lines_counter,path))
    EOF
    
  2. Create a local /tmp/line-count-sample.txt text file.

    cat <<EOF >"/tmp/line-count-sample.txt"
    Line 1
    Line 2
    line 3
    EOF
    
  3. Upload local /tmp/line-count.py and /tmp/line-count-sample.txt to your bucket in Cloud Storage.

    gcloud storage cp /tmp/line-count* gs://BUCKET
    
  4. Run the line-count.py job on your cluster. Set --properties=spark.hadoop.fs.gs.client.type=STORAGE_CLIENT to enable gRPC for Cloud Storage connector read requests.

    gcloud dataproc jobs submit pyspark gs://BUCKET/line-count.py \
    --cluster=CLUSTER_NAME \
    --project=PROJECT_ID  \
    --region=REGION \
    --properties=spark.hadoop.fs.gs.client.type=STORAGE_CLIENT \
    -- gs://BUCKET/line-count-sample.txt
    

    Replace the following;

    • CLUSTER_NAME: The name of an existing cluster.
    • PROJECT_NAME: Your project ID. Project IDs are listed in the Project info section on the Google Cloud console Dashboard.
    • REGION: The Compute Engine region where the cluster is located.
    • BUCKET: Your Cloud Storage bucket.

Generate gRPC client-side metrics

You can configure the Cloud Storage connector to generate gRPC related metrics in Cloud Monitoring. The gRPC related metrics can help you to do the following:

  • Monitor and optimize the performance of gRPC requests to Cloud Storage
  • Troubleshoot and debug issues
  • Gain insights into application usage and behavior

For information about how to configure the Cloud Storage connector to generate gRPC related metrics, see Use gRPC client-side metrics.

Resources