The WordCount template is a batch pipeline that reads text from Cloud Storage, tokenizes the text lines into individual words, and performs a frequency count on each of the words. For more information about WordCount, see WordCount Example Pipeline.
If the Cloud Storage bucket is outside of your service perimeter, create an egress rule that allows access to the bucket.
Template parameters
Parameter | Description |
---|---|
inputFile |
The Cloud Storage input file's path. |
outputFile |
The Cloud Storage output file's path and prefix. |
Run the WordCount template
Console
- Go to the Dataflow Create job from template page. Go to Create job from template
- In the Job name field, enter a unique job name.
- Optional: For Regional endpoint, select a value from the drop-down menu. The default
region is
us-central1
.For a list of regions where you can run a Dataflow job, see Dataflow locations.
- From the Dataflow template drop-down menu, select the WordCount template.
- In the provided parameter fields, enter your parameter values.
- Click Run job.
gcloud
In your shell or terminal, run the template:
gcloud dataflow jobs run JOB_NAME \\
--gcs-location gs://dataflow-templates/latest/Word_Count \\
--region REGION_NAME \\
--parameters \\
inputFile=gs://dataflow-samples/shakespeare/kinglear.txt,\\
output=gs://BUCKET_NAME/output/my_output
Replace the following:
JOB_NAME
: a unique job name of your choiceREGION_NAME
: the region where you want to deploy your Dataflow job—for example,us-central1
BUCKET_NAME
: the name of your Cloud Storage bucket
API
To run the template using the REST API, send an HTTP POST request. For more information on the
API and its authorization scopes, see
projects.templates.launch
.
POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates/latest/Word_Count
{
"jobName": "JOB_NAME",
"parameters": {
"inputFile" : "gs://dataflow-samples/shakespeare/kinglear.txt",
"output": "gs://BUCKET_NAME/output/my_output"
},
"environment": { "zone": "us-central1-f" }
}
Replace the following:
PROJECT_ID
: the Google Cloud project ID where you want to run the Dataflow job
JOB_NAME
: a unique job name of your choiceLOCATION
: the region where you want to deploy your Dataflow job—for example,us-central1
BUCKET_NAME
: the name of your Cloud Storage bucket