This page explains how to extract and transform data from a field (a cell) when you prepare data in the Wrangler workspace of the Cloud Data Fusion Studio.
To perform transformations on this data, you split it into separate columns. In Wrangler, you can extract data from a column and create new columns for the extracted data. You can extract values based on patterns, delimiters, or positions.
Extract data using patterns
You can extract data from fields in columns of the string data type with the following patterns:
- Credit cards
- Date
- Date time
- URLs from HTML anchors
- IPv4 address
- ISBN codes
- Mac address
- N digits number
- SSN
- Start and End pattern
- Time
To extract data based on a pattern, follow these steps:
- Go to Wrangler workspace in Cloud Data Fusion.
- On the Data tab, go to a column name and click the arrow_drop_down expander arrow.
- Select Extract fields > Using patterns and select an option—for example, URL.
- Optional: click Show pattern to view the regular expression for the pattern.
- Click Extract.
Wrangler extracts the fields based on the chosen pattern and adds the
extract-regex-groups
directive to the recipe. When you run the data pipeline,
Cloud Data Fusion applies the transformation to all rows in the column.
In the following example, a column contains a number, followed by an email address:
Emails |
---|
1 222larabrown@gmail.com |
2 cloudysanfrancisco@gmail.com |
To extract the email address, select the Email pattern. When you click Extract, Wrangler retains the original column and creates a new column containing only the email addresses:
Emails | Emails_1 |
---|---|
1 222larabrown@gmail.com | 222larabrown@gmail.com |
2 cloudysanfrancisco@gmail.com | cloudysanfrancisco@gmail.com |
Extract data with delimiters
You can extract data into two or more columns based on the following delimiters:
- Comma
- Tab
- Pipe
- Whitespace
- Custom separator
If a value doesn't have the delimiter, no value is added to corresponding field in the new column.
To extract values based on a delimiter:
- Go to Wrangler workspace in Cloud Data Fusion.
- On the Data tab, go to a column name and click the arrow_drop_down expander arrow.
- Select Extract fields > Using delimiters and select an option—for example, Comma.
- Click Extract.
Wrangler extracts the fields based on the selected delimiter and adds the
split-to-columns
directive to the recipe. When you run the data pipeline,
Cloud Data Fusion transforms all values in the column.
In the following example, a column contains multiple names separated by commas:
ID | Name |
---|---|
1 | Lee,Lucian,Luka |
2 | Mahan,Noam,Nur |
In this example, using the comma delimiters pattern extracts the values in the
original Name
column into three new columns:
ID | Name | Name_1 | Name_2 | Name_3 |
---|---|---|---|---|
1 | Lee,Lucian,Luka | Lee | Lucian | Luka |
2 | Mahan,Noam,Nur | Mahan | Noam | Nur |
Extract data by position
You can extract part of a string based on its position in the string.
To extract data based on its position:
- Go to Wrangler workspace in Cloud Data Fusion.
- On the Data tab, go to a column name and click the arrow_drop_down expander arrow.
- Select Extract fields > Using positions. Column values you can extract appear with a blue background.
- In any cell of the column, select the characters to extract.
- In the Name of destination column field, enter a name.
- Click Apply.
The chosen portion of the value is extracted from each row in the column.
Wrangler extracts the fields based on the selected pattern and adds the
cut-character
directive to the recipe. When you run the data pipeline,
Cloud Data Fusion applies the transformation to all values in the column.
What's next
- Learn more about Wrangler directives.