Transform SQL translations using configuration YAML files

This document shows you how to use configuration YAML files to transform SQL code while migrating it to BigQuery. It provides guidelines to create your own configuration YAML files, and provides examples for various translation transformations that are supported by this feature.

When using the BigQuery interactive SQL translator, using the BigQuery Migration API, or performing a batch SQL translation, you can provide configuration YAML files to modify a SQL query translation. Using configuration YAML files allows for further customization when translating SQL queries from your source database.

You can specify a configuration YAML file to use in a SQL translation in the following ways:

The interactive SQL translator, BigQuery Migration API, the batch SQL translator, and the batch translation Python client supports the use of multiple configuration YAML files in a single translation job. See Applying multiple YAML configurations for more information.

Configuration YAML file requirements

Before creating a configuration YAML file, review the following information to ensure that your YAML file is compatible to use with the BigQuery Migration Service:

  • You must upload the configuration YAML files to the root directory of the Cloud Storage bucket that contains your SQL translation input files. For information on how to create buckets and upload files to Cloud Storage, see Create buckets and Upload objects from a filesystem.
  • The file size for a single configuration YAML file must not exceed 1 MB.
  • The total file size of all configuration YAML files used in a single SQL translation job must not exceed 4 MB.
  • If you are using regex syntax for name matching, use RE2/J.
  • All configuration YAML file names must include a .config.yaml extension—for example, change-case.config.yaml.
    • config.yaml alone is not a valid name for the configuration file.

Guidelines to create a configuration YAML file

This section provides some general guidelines to create a configuration YAML file:

Each configuration file must contain a header specifying the type of configuration. The object_rewriter type is used to specify SQL translations in a configuration YAML file. The following example uses the object_rewriter type to transform a name case:

type: object_rewriter
global:
  case:
    all: UPPERCASE

Entity selection

To perform entity-specific transformations, specify the entity in the configuration file. All match properties are optional; only use the match properties needed for a transformation. The following configuration YAML exposes properties to be matched in order to select specific entities:

match:
  db: <literal_name>
  schema: <literal_name>
  relation: <literal_name>
  attribute: <literal_name>
  dbRegex: <regex>
  schemaRegex: <regex>
  relationRegex: <regex>
  attributeRegex: <regex>

Description of each match property:

  • db: the project_id component.
  • schema: the dataset component.
  • relation: the table component.
  • attribute: the column component. Only valid for attribute selection
  • dbRegex: matches a db property with a regular expression (Preview).
  • schemaRegex: matches schema properties to regular expressions (Preview).
  • relationRegex: matches relation properties with regular expressions (Preview).
  • attributeRegex: matches attribute properties with regular expressions. Only valid for attribute selection (Preview).

For example, the following configuration YAML specifies the match properties to select the testdb.acme.employee table for a temporary table transformation.

type: object_rewriter
relation:
-
  match:
    db: testdb
    schema: acme
    relation: employee
  temporary: true

You can use the dbRegex, schemaRegex, relationRegex, and attributeRegex properties to specify regular expressions in order to select a subset of entities. The following example changes all relations from tmp_schema schema in testdb to temporary, as long as their name starts with tmp_:

type: object_rewriter
relation:
-
  match:
    schema: tmp_schema
    relationRegex: "tmp_.*"
  temporary: true

Both literal and regex properties are matched in a case-insensitive manner. You can enforce case-sensitive matching by using a regex with a disabled i flag, as seen in the following example:

match:
  relationRegex: "(?-i:<actual_regex>)"

You can also specify fully-qualified entities using an equivalent short-string syntax. A short-string syntax expects exactly 3 (for relation selection) or 4 (for attribute selection) name segments delimited with dots, as the example testdb.acme.employee. The segments are then internally interpreted as if they were passed as db, schema, relation and attribute respectively. This means that names are matched literally, thus regular expressions are not allowed in short syntax. The following example shows the use of short-string syntax to specify a fully-qualified entity in a configuration YAML file:

type: object_rewriter
relation:
-
  match : "testdb.acme.employee"
  temporary: true

If a table contains a dot in the name, you cannot specify the name using a short syntax. In this case, you must use an object match. The following example changes the testdb.acme.stg.employee table to temporary:

type: object_rewriter
relation:
-
  match:
    db: testdb
    schema: acme
    relation: stg.employee
  temporary: true

The configuration YAML accepts key as an alias to match.

Default database

Some input SQL dialects, notably Teradata, do not support database-name in the qualified name. In this case, the easiest way to match entities is to omit db property in match.

However, you can set the default_database property of the BigQuery Migration Service and use that default database in the match.

Supported target attribute types

You can use the configuration YAML file to perform attribute type transformations, where you transform the data type of a column from the source type to a target type. The configuration YAML file supports the following target types:

  • BOOLEAN
  • TINYINT
  • SMALLINT
  • INTEGER
  • BIGINT
  • FLOAT
  • DOUBLE
  • NUMERIC (Supports optional precision and scale, such as NUMERIC(18, 2))
  • TIME
  • TIMETZ
  • DATE
  • DATETIME
  • TIMESTAMP
  • TIMESTAMPTZ
  • CHAR (Supports optional precision, such as CHAR(42))
  • VARCHAR (Supports optional precision, such as VARCHAR(42))

Configuration YAML examples

This section provides examples to create various configuration YAML files to use with your SQL translations. Each example outlines the YAML syntax to transform your SQL translation in specific ways, along with a brief description. Each example also provides the contents of a teradata-input.sql or hive-input.sql file and a bq-output.sql file so that you can compare the effects of a configuration YAML on a BigQuery SQL query translation.

The following examples use Teradata or Hive as the input SQL dialect and BigQuery SQL as the output dialect. The following examples also use testdb as the default database, and testschema as the schema search path.

Change object-name case

The following configuration YAML changes the upper or lower-casing of object names:

type: object_rewriter
global:
  case:
    all: UPPERCASE
    database: LOWERCASE
    attribute: LOWERCASE

A SQL translation with this configuration YAML file might look like the following:

teradata-input.sql
      create table x(a int);
      select * from x;
    
bq-output.sql
      CREATE TABLE testdb.TESTSCHEMA.X
      (
        a INT64
      )
      ;
      SELECT
          X.a
        FROM
          testdb.TESTSCHEMA.X
      ;
    

Make table temporary

The following configuration YAML changes a regular table to a temporary table:

type: object_rewriter
relation:
  -
    match: "testdb.testschema.x"
    temporary: true

A SQL translation with this configuration YAML file might look like the following:

teradata-input.sql
    create table x(a int);
    
bq-output.sql
    CREATE TEMPORARY TABLE x
    (
      a INT64
    )
    ;
    

Make table ephemeral

The following configuration YAML changes a regular table to an ephemeral table with a 60 second expiration.

type: object_rewriter
relation:
  -
    match: "testdb.testschema.x"
    ephemeral:
      expireAfterSeconds: 60

A SQL translation with this configuration YAML file might look like the following:

teradata-input.sql
    create table x(a int);
    
bq-output.sql
    CREATE TABLE testdb.testschema.x
    (
      a INT64
    )
    OPTIONS(
      expiration_timestamp=timestamp_add(current_timestamp(), interval 60 SECOND)
    );
    

Set partition expiration

The following configuration YAML changes the expiration of a partitioned table to 1 day:

type: object_rewriter
relation:
  -
    match: "testdb.testschema.x"
    partitionLifetime:
      expireAfterSeconds: 86400

A SQL translation with this configuration YAML file might look like the following:

teradata-input.sql
    create table x(a int, b int) partition by (a);
    
bq-output.sql
    CREATE TABLE testdb.testschema.x
    (
      a INT64,
      b INT64
    )
    CLUSTER BY a
    OPTIONS(
      partition_expiration_days=1
    );
    

Change external location or format for a table

The following configuration YAML changes the external location and formation for a table:

type: object_rewriter
relation:
  -
    match: "testdb.testschema.x"
    external:
      locations: "gs://path/to/department/files"
      format: ORC

A SQL translation with this configuration YAML file might look like the following:

teradata-input.sql
    create table x(a int);
    
bq-output.sql
    CREATE EXTERNAL TABLE testdb.testschema.x
    (
      a INT64
    )
    OPTIONS(
      format='ORC',
      uris=[
        'gs://path/to/department/files'
      ]
    );
    

Set or change table description

The following configuration YAML sets the description of a table:

type: object_rewriter
relation:
  -
    match: "testdb.testschema.x"
    description:
      text: "Example description."

A SQL translation with this configuration YAML file might look like the following:

teradata-input.sql
    create table x(a int);
    
bq-output.sql
    CREATE TABLE testdb.testschema.x
    (
      a INT64
    )
    OPTIONS(
      description='Example description.'
    );
    

Set or change table partitioning

The following configuration YAML changes the partitioning scheme of a table:

type: object_rewriter
relation:
  -
    match: "testdb.testschema.x"
    partition:
      simple:
        add: [a]
  -
    match: "testdb.testschema.y"
    partition:
      simple:
        remove: [a]

A SQL translation with this configuration YAML file might look like the following:

teradata-input.sql
    create table x(a date, b int);
    create table y(a date, b int) partition by (a);
    
bq-output.sql
    CREATE TABLE testdb.testschema.x
    (
      a DATE,
      b INT64
    )
    PARTITION BY a;
    CREATE TABLE testdb.testschema.y
    (
      a DATE,
      b INT64
    )
    ;
    

Set or change table clustering

The following configuration YAML changes the clustering scheme of a table:

type: object_rewriter
relation:
  -
    match: "testdb.testschema.x"
    clustering:
      add: [a]
  -
    match: "testdb.testschema.y"
    clustering:
      remove: [b]

A SQL translation with this configuration YAML file might look like the following:

hive-input.sql
    create table x(a int, b int);
    create table y(a int, b int) clustered by (b) into 16 buckets;
    
bq-output.sql
    CREATE TABLE testdb.testschema.x
    (
      a INT64,
      b INT64
    )
    CLUSTER BY a;
    CREATE TABLE testdb.testschema.y
    (
      a INT64,
      b INT64
    )
    ;
    

Change type of a column attribute

The following configuration YAML changes the data type for an attribute of a column:

type: object_rewriter
attribute:
  -
    match:
      db: testdb
      schema: testschema
      attributeRegex: "a+"
    type:
      target: NUMERIC(10,2)

You can transform the source data type to any of the supported target attribute types.

A SQL translation with this configuration YAML file might look like the following:

teradata-input.sql
    create table x(a int, b int, aa int);
    
bq-output.sql
    CREATE TABLE testdb.testschema.x
    (
      a NUMERIC(31, 2),
      b INT64,
      aa NUMERIC(31, 2)
    )
    ;
    

Add connection to external data lake

The following configuration YAML marks the source table as being an external table that points to data stored in an external data lake, specified by a data lake connection.

type: object_rewriter
relation:
-
  key: "testdb.acme.employee"
  external:
    connection_id: "connection_test"

A SQL translation with this configuration YAML file might look like the following:

hive-input.sql
    CREATE TABLE x
    (
      a VARCHAR(150),
      b INT
    );
    
bq-output.sql
    CREATE EXTERNAL TABLE x
    (
      a STRING,
      b INT64
    )
    WITH CONNECTION `connection_test`
    OPTIONS(
    );
    

Change the character encoding of an input file

By default, the BigQuery Migration Service attempts to automatically detect the character encoding of input files. In cases where BigQuery Migration Service might misidentify the encoding of a file, you can use a configuration YAML to specify the character encoding explicitly.

The following configuration YAML specifies the explicit character encoding of the input file as ISO-8859-1.

type: experimental_input_formats
formats:
- source:
    pathGlob: "*.sql"
  contents:
    raw:
      charset: iso-8859-1

Global type conversion

The following configuration YAML changes a data type to another across all scripts, and specifies a source data type to avoid in the transpiled script. This is different from the Change type of a column attribute configuration, where only the data type for a single attribute is changed.

BigQuery supports the following data type conversions:

  • DATETIME to TIMESTAMP
  • TIMESTAMP to DATETIME (accepts optional time zone)
  • TIMESTAMP WITH TIME ZONE to DATETIME (accepts optional time zone)
  • CHAR to VARCHAR

In the following example, the configuration YAML converts a TIMESTAMP data type to DATETIME.

type: experimental_object_rewriter
global:
  typeConvert:
    timestamp: DATETIME

In dialects like Teradata, datetime-related functions such as current_date, current_time, or current_timestamp return timestamps based on the configured time zone, either local or session. BigQuery, on the other hand, always returns timestamps in UTC. To ensure consistent behavior between the two dialects, it is necessary to configure the time zone accordingly.

In the following example, the configuration YAML converts a TIMESTAMP and a TIMESTAMP WITH TIME ZONE data type to DATETIME, with the target time zone set to Europe/Paris.

type: experimental_object_rewriter
global:
  typeConvert:
    timestamp:
      target: DATETIME
      timezone: Europe/Paris
    timestamptz:
      target: DATETIME
      timezone: Europe/Paris

A SQL translation with this configuration YAML file might look like the following:

teradata-input.sql
      create table x(a timestamp);
      select a from x where a > current_timestamp(0);
    
bq-output.sql
      CREATE TABLE x
      (
        a TIMESTAMP
      )
      ;
      SELECT
          x.a
        FROM
          test.x
        WHERE x.a > datetime_trunc(current_datetime('Europe/Paris'), SECOND)
      ;
    

Select statement modification

The following configuration YAML changes the star projection, GROUP BY, and ORDER BY clauses in SELECT statements.

starProjection supports the following configurations:

  • ALLOW
  • PRESERVE (default)
  • EXPAND

groupBy and orderBy support the following configurations:

  • EXPRESSION
  • ALIAS
  • INDEX

In the following example, the configuration YAML configures the star projection to EXPAND.

type: experimental_statement_rewriter
select:
  starProjection: EXPAND

A SQL translation with this configuration YAML file might look like the following:

teradata-input.sql
      create table x(a int, b TIMESTAMP);
      select * from x;
    
bq-output.sql
      CREATE TABLE x
      (
        a INT64,
        b DATETIME
      )
      ;
      SELECT
          x.a
          x.b
        FROM
          x
      ;
    

UDF specification

The following configuration YAML specifies the signature of user-defined functions (UDFs) that are used in the source scripts. Much like metadata zip files, UDF definitions can help to produce a more accurate translation of input scripts.

type: metadata
udfs:
  - "date parse_short_date(dt int)"

A SQL translation with this configuration YAML file might look like the following:

teradata-input.sql
      create table x(dt int);
      select parse_short_date(dt) + 1 from x;
    
bq-output.sql
      CREATE TABLE x
      (
        dt INT64
      )
      ;
      SELECT
          date_add(parse_short_date(x.dt), interval 1 DAY)
        FROM
          x
      ;
    

Setting decimal precision strictness

By default, BigQuery Migration Service increases numeric precision to the highest precision available for a given scale. The following configuration YAML overrides this behavior by configuring the precision strictness to retain the decimal precision of the source statement.

type: experimental_statement_rewriter
common:
  decimalPrecision: STRICT

A SQL translation with this configuration YAML file might look like the following:

teradata-input.sql
      create table x(a decimal(3,0));
    
bq-output.sql
      CREATE TABLE x
      (
        a NUMERIC(3)
      )
      ;
    

Output name mapping

You can use configuration YAML to map SQL object names. You can change different parts of the name depending on the object being mapped.

Static name mapping

Use static name mapping to map the name of an entity. If you only want to change specific parts of the name while keeping other parts of the name the same, then only include the parts that need to change.

The following configuration YAML changes the name of the table from my_db.my_schema.my_table to my_new_db.my_schema.my_new_table.

type: experimental_object_rewriter
relation:
-
  match: "my_db.my_schema.my_table"
  outputName:
    database: "my_new_db"
    relation: "my_new_table"

A SQL translation with this configuration YAML file might look like the following:

teradata-input.sql
      create table my_db.my_schema.my_table(a int);
    
bq-output.sql
      CREATE TABLE my_new_db.my_schema.my_new_table
      (
        a INT64
      )
    

Dynamic name mapping

Use dynamic name mapping to change several objects at the same time, and create new names based on the mapped objects.

The following configuration YAML changes the name of all tables by adding the prefix stg_ to those that belong to the staging schema, and then moves those tables to the production schema.

type: experimental_object_rewriter
relation:
-
  match:
    schema: staging
  outputName:
    schema: production
    relation: "stg_${relation}"

A SQL translation with this configuration YAML file might look like the following:

teradata-input.sql
      create table staging.my_table(a int);
    
bq-output.sql
      CREATE TABLE production.stg_my_table
      (
        a INT64
      )
      ;
    

Specifying default database and schema search path

The following configuration YAML specifies a default database and schema search path.

type: environment
session:
  defaultDatabase: myproject
  schemaSearchPath: [myschema1, myschema2]

A SQL translation with this configuration YAML file might look like the following:

teradata-input.sql
      SELECT * FROM database.table
      SELECT * FROM table1
    
bq-output.sql
      SELECT * FROM myproject.database.table.
      SELECT * FROM myproject.myschema1.table1
    

Global output name rewrite

The following configuration YAML changes the output names of all objects (database, schema, relation, and attributes) in the script according to the configured rules.

type: experimental_object_rewriter
global:
  outputName:
    regex:
      - match: '\s'
        replaceWith: '_'
      - match: '>='
        replaceWith: 'gte'
      - match: '^[^a-zA-Z_].*'
        replaceWith: '_$0'

A SQL translation with this configuration YAML file might look like the following:

teradata-input.sql
      create table "test special chars >= 12"("42eid" int, "custom column" varchar(10));
    
bq-output.sql
      CREATE TABLE test_special_chars_employees_gte_12
      (
        _42eid INT64,
        custom_column STRING
      )
      ;
    

Optimize and improve the performance of translated SQL

Optional transformations can be applied to translated SQL in order to introduce changes that can improve the query in terms of performance or cost. These optimizations are strictly case dependent and should be evaluated against unmodified SQL output to assess their actual effect on performance.

The following configuration YAML enables optional transformations. The configuration accepts a list of optimizations and, for optimizations which accept parameters, a section with optional parameter values.

type: experimental_optimizer
transformations:
  - name: PRECOMPUTE_INDEPENDENT_SUBSELECTS
  - name: REWRITE_CTE_TO_TEMP_TABLE
    parameters:
      threshold: 1
Optimization Optional parameter Description
PRECOMPUTE_INDEPENDENT_SUBSELECTS scope: [PREDICATE, PROJECTION] Rewrites the query by adding a DECLARE statement to replace an expression in either PREDICATE clauses or PROJECTION with a precomputed variable. This will be identified as a static predicate allowing for a reduction of the amount of data read. If the scope is omitted, the default value is PREDICATE (i.e. WHERE and JOIN-ON clause).

Extracting a scalar subquery to a DECLARE statement will make the original predicate static and therefore qualify for improved execution planning. This optimization will introduce new SQL statements.
REWRITE_CTE_TO_TEMP_TABLE threshold: N Rewrites common table expressions (CTE) to temporary tables when there are more than N references to the same common table expression. This reduces query complexity and forces single execution of the common table expression. If N is omitted, the default value is 4.

We recommend using this optimization when non-trivial CTEs are referenced multiple times. Introducing temporary tables has an overhead that might be larger than eventual multiple executions of a low complexity or low cardinality CTE. This optimization will introduce new SQL statements.
REWRITE_ZERO_SCALE_NUMERIC_AS_INTEGER bigint: N Rewrites zero-scale NUMERIC/BIGNUMERIC attributes to INT64 type if the precision is within N. If N is omitted, the default value is 18.

We recommend using this optimization when translating from source dialects that don't have integer types. Changing column types requires reviewing all downstream uses for type compatibility and semantic changes. For example, fractional divisions becoming integer divisions, code expecting numeric values
DROP_TEMP_TABLE Adds DROP TABLE statements for all temporary tables created in a script and not dropped by the end of it. This reduces the storage billing period for the temporary table from 24 hours to the script running time. This optimization will introduce new SQL statements.

We recommend using this optimization when temporary tables are not accessed for any further processing after the end of script execution. This optimization will introduce new SQL statements.
REGEXP_CONTAINS_TO_LIKE Rewrites some categories of REGEXP_CONTAINS matching patterns to LIKE expressions.

We recommend using this optimization when no other process, such as macro replacement, relies on the regular expression pattern literals being preserved unchanged in output SQL.
ADD_DISTINCT_TO_SUBQUERY_IN_SET_COMPARISON Adds DISTINCT clause to subqueries used as value set for [NOT] IN operator.

We recommend using this optimization when the cardinality (distinct number of values) of the subquery result is significantly lower than the number of values. When this precondition is not met this transformation can have negative effects on performance.

Applying multiple YAML configurations

When specifying a configuration YAML file in a batch or interactive SQL translation, you can select multiple configuration YAML files in a single translation job to reflect multiple transformations. If multiple configurations conflict, one transformation might override another. We recommend using different types of configuration settings in each file to avoid conflicting transformations in the same translation job.

The following example lists two separate configuration YAML files that were provided for a single SQL translation job, one to change a column's attribute, and the other to set the table as temporary:

change-type-example.config.yaml:

type: object_rewriter
attribute:
  -
    match: "testdb.testschema.x.a"
    type:
      target: NUMERIC(10,2)

make-temp-example.config.yaml:

type: object_rewriter
relation:
  -
    match: "testdb.testschema.x"
    temporary: true

A SQL translation with these two configuration YAML files might look like the following:

teradata-input.sql
    create table x(a int);
    
bq-output.sql
    CREATE TEMPORARY TABLE x
    (
      a NUMERIC(31, 2)
    )
    ;