The File Format Conversion template is a batch pipeline that converts files stored on Cloud Storage from one supported format to another.
The following format conversions are supported:
- CSV to Avro
- CSV to Parquet
- Avro to Parquet
- Parquet to Avro
Pipeline requirements
- The output Cloud Storage bucket must exist before running the pipeline.
Template parameters
Parameter | Description |
---|---|
inputFileFormat |
The input file format. Must be one of [csv, avro, parquet] . |
outputFileFormat |
The output file format. Must be one of [avro, parquet] . |
inputFileSpec |
The Cloud Storage path pattern for input files. For example, gs://bucket-name/path/*.csv |
outputBucket |
The Cloud Storage folder to write output files. This path must end with a slash.
For example, gs://bucket-name/output/ |
schema |
The Cloud Storage path to the Avro schema file. For example, gs://bucket-name/schema/my-schema.avsc |
containsHeaders |
(Optional) The input CSV files contain a header record (true/false). The default value is false . Only required when reading CSV files. |
csvFormat |
(Optional) The CSV format specification to use for parsing records. The default value is Default .
See Apache Commons CSV Format
for more details. |
delimiter |
(Optional) The field delimiter used by the input CSV files. |
outputFilePrefix |
(Optional) The output file prefix. The default value is output . |
numShards |
(Optional) The number of output file shards. |
Run the template
Console
- Go to the Dataflow Create job from template page. Go to Create job from template
- In the Job name field, enter a unique job name.
- Optional: For Regional endpoint, select a value from the drop-down menu. The default
region is
us-central1
.For a list of regions where you can run a Dataflow job, see Dataflow locations.
- From the Dataflow template drop-down menu, select the Convert file formats template.
- In the provided parameter fields, enter your parameter values.
- Click Run job.
gcloud
In your shell or terminal, run the template:
gcloud dataflow flex-template run JOB_NAME \ --project=PROJECT_ID \ --region=REGION_NAME \ --template-file-gcs-location=gs://dataflow-templates-REGION_NAME/VERSION/flex/File_Format_Conversion \ --parameters \ inputFileFormat=INPUT_FORMAT,\ outputFileFormat=OUTPUT_FORMAT,\ inputFileSpec=INPUT_FILES,\ schema=SCHEMA,\ outputBucket=OUTPUT_FOLDER
Replace the following:
PROJECT_ID
: the Google Cloud project ID where you want to run the Dataflow jobJOB_NAME
: a unique job name of your choiceREGION_NAME
: the region where you want to deploy your Dataflow job—for example,us-central1
VERSION
: the version of the template that you want to useYou can use the following values:
latest
to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates-REGION_NAME/latest/- the version name, like
2023-09-12-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates-REGION_NAME/
INPUT_FORMAT
: the file format of the input file; must be one of[csv, avro, parquet]
OUTPUT_FORMAT
: the file format of the output files; must be one of[avro, parquet]
INPUT_FILES
: the path pattern for input filesOUTPUT_FOLDER
: your Cloud Storage folder for output filesSCHEMA
: the path to the Avro schema file
API
To run the template using the REST API, send an HTTP POST request. For more information on the
API and its authorization scopes, see
projects.templates.launch
.
POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch { "launch_parameter": { "jobName": "JOB_NAME", "parameters": { "inputFileFormat": "INPUT_FORMAT", "outputFileFormat": "OUTPUT_FORMAT", "inputFileSpec": "INPUT_FILES", "schema": "SCHEMA", "outputBucket": "OUTPUT_FOLDER" }, "containerSpecGcsPath": "gs://dataflow-templates-LOCATION/VERSION/flex/File_Format_Conversion", } }
Replace the following:
PROJECT_ID
: the Google Cloud project ID where you want to run the Dataflow jobJOB_NAME
: a unique job name of your choiceLOCATION
: the region where you want to deploy your Dataflow job—for example,us-central1
VERSION
: the version of the template that you want to useYou can use the following values:
latest
to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates-REGION_NAME/latest/- the version name, like
2023-09-12-00_RC00
, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates-REGION_NAME/
INPUT_FORMAT
: the file format of the input file; must be one of[csv, avro, parquet]
OUTPUT_FORMAT
: the file format of the output files; must be one of[avro, parquet]
INPUT_FILES
: the path pattern for input filesOUTPUT_FOLDER
: your Cloud Storage folder for output filesSCHEMA
: the path to the Avro schema file
What's next
- Learn about Dataflow templates.
- See the list of Google-provided templates.