This page explains how to remove common errors from a dataset when you prepare data in the Wrangler workspace of the Cloud Data Fusion Studio.
The following types of errors occur in datasets:
- Systemic errors, such as service or instance failures
- Logical errors, such as pipeline run failures
- Data errors, such as invalid credit card numbers, invalid date formats, or invalid zip codes
Wrangler provides a set of over 50 directives to help you remove common errors from a dataset.
To send records to error, follow these steps:
- Go to the Wrangler workspace in Cloud Data Fusion.
- On the Data tab, go to a column name and click the arrow_drop_down expander arrow.
- Select Send to error, and then select the condition that sends bad records to error.
Wrangler removes values that match the specified condition from the sample and
adds the send to error
directive to the recipe. When you run the data
pipeline, the transformation is applied to all values in the column.
Add an error collector plugin to a data pipeline
When you add a Wrangler transformation with a recipe that includes the send to
error
directive to a data pipeline, you can choose to connect it to the Error
Collector plugin. The Error Collector plugin is usually connected to a
downstream sink plugin, such as a BigQuery sink.
When you run the pipeline, the records flagged by the send to error
directive
go from the Wrangler transformation step in your pipeline, to the Error Collector
step, to the sink step. When the run finishes, you can examine those flagged
records written to the sink.
If your recipe includes the send to error
transformation, but the pipeline
doesn't include the Error Collector plugin, the records flagged by the send to
error
directive are dropped during the pipeline run.
What's next
- Learn more about Wrangler directives.