Wrangler is a visual data preparation tool within the Cloud Data Fusion Studio interface. It lets you clean and transform data before using it in Extract, Transform, Load (ETL) pipelines. Wrangler applies transformations on a sample of your data in one place (called a Preview) before running the logic on the entire dataset. This preview helps you apply transformations and gain an understanding of how they affect the entire dataset.
Wrangler directives
A directive is a single instruction used within the Wrangler. Directives specify how to manipulate your data, such as transforming, filtering, or pivoting individual records.
The following concepts are related to directives:
- Recipe
- A recipe is a set of directives. It consists of one or more directives.
- Transformation step
- A transformation step is an implementation of a data transformation directive, operating on a single record or set of records. A transformation step can generate zero or more records from applying a directive. Wrangler applies the transformation steps in the order listed in the recipe.
Wrangler components
The following sections explain components of Wrangler in the Cloud Data Fusion Studio.
Wrangler workspace
The Wrangler workspace is a page in the Cloud Data Fusion Studio interface where you parse, blend, cleanse, and transform datasets. On the Workspace page, you can do the following:
- Add transformation steps to a recipe using the drop-down menu in each column.
- View or delete steps in a recipe by selecting the Transformation steps tab.
- Discover columns with blank fields and other information by checking the Data quality bar.
- View the schema for the dataset by clicking More.
- Create a data pipeline with a source plugin for the dataset, and the Wrangler transformation with the recipe containing the transformation steps, which are executed when the pipeline runs.
Wrangler Power Mode (CLI)
To specify directives using declarative syntax, use the Power Mode (CLI). It's useful for the following tasks:
- Using directives that aren't available in the Studio interface
- Adding user-defined directives
- Applying a directive to multiple columns
To use Wrangler Power Mode, enter directives in the black bar at the bottom of the Wrangler Data tab.
Wrangler Insights tab
You can use the Insights tab on the Wrangler page to perform data discovery on a dataset.
Limitations
- Wrangler is only supported for batch ETL pipelines.
- Wrangler applies transformation only on the sample data. This sample data is limited to the first 1000 records.
- Wrangler requires connections to be created with the source. For more information, see Create and manage connections.
- Wrangler always requires at least one Wrangler workspace to be open.
- Clicking the Wrangle button in the Wrangler transformation isn't supported.
Navigate to Wrangler in Cloud Data Fusion
You can access Wrangler in two ways from the Cloud Data Fusion Studio interface:
- To open the Cloud Data Fusion Wrangler workspace, go to the Cloud Data Fusion Studio and click Wrangler.
- To configure Wrangler properties, go to the Cloud Data Fusion Studio, and click Studio > Transformations > Wrangler.
Connect to a data source
Wrangler supports various data sources, such as BigQuery, Cloud Storage, and external databases (with additional configuration). To use Wrangler, you must create a connection with the source.
To create the connection, go to the Connections list and select the connection to your data source. For more information, see Create and manage connections.
Explore and preview data
Wrangler displays a sample of your data (typically 1000 rows) for inspection. You can get an overview of the data schema, including data types and basic statistics.
Apply directives
Wrangler offers a variety of built-in directives for common data wrangling tasks.
- Drag the chosen directive onto a specific column or the data preview window.
- Each directive has configuration options to customize its behavior.
For more information, see Wrangler command-line directives.
Preview transformation results
As you apply directives, the data preview window dynamically updates to reflect the changes. This lets you see the immediate impact of each transformation on your data.
Refine and iterate
To refine your data wrangling process, continue adding directives, modifying configurations, and reviewing the preview.
Wrangler's visual interface helps you experiment and ensure that your transformations produce the expected outcome.
Add transformations to a pipeline
While Wrangler itself isn't a persistent storage solution, Cloud Data Fusion offers ways to capture your wrangling logic:
Create a pipeline. From the Wrangler workspace, convert your Wrangler transformations into a Cloud Data Fusion pipeline by following these steps:
- Click Create pipeline.
- Select Batch pipeline. The Pipeline Studio page opens with a pipeline that has a source and a Wrangler transformation.
Apply transformations. If you're using the Wrangler plugin on the Studio page, convert your Wrangler transformations into a Cloud Data Fusion pipeline by clicking Apply.
Edit Recipes
When you use the Wrangler workspace to create a Wrangler transformation, after you add the Wrangler transformation to a pipeline, it's recommended that you use the Wrangler interface to add or edit recipes.
In the Wrangler transformation, if you manually edit the recipe or add new steps to the recipe and the changes affect the output schema, you must manually update the output schema in the Wrangler transformation to match the changes in the recipe. Only recipes created or edited in the Wrangler workspace will auto-create and auto-update the output schema in the Wrangler transformation.
To edit a recipe in the Wrangler transformation that was created in the Wrangler web interface, follow these steps:
- Go to the Wrangler node in your pipeline and click Properties.
- Click Wrangle.
- Edit or add a new recipe.
- Click Apply.
What's next
- Learn more about Wrangler CLI directives.