About an incremental table
Dataform updates tables differently based on the table type. During each execution of a table or a view, Dataform rebuilds the whole table or view from scratch.
When you define an incremental table, Dataform builds the incremental table from scratch only for the first time. During subsequent executions, Dataform only inserts or merges new rows into the incremental table according to the conditions that you configure.
Dataform inserts new rows only into columns that already exist in the incremental table. If you make changes to the incremental table definition query — for example, add a new column — you must rebuild the table from scratch. To do so, the next time you trigger an execution of the table, select the Run with full refresh option.
Here are some common use cases for incremental tables:
- Performance optimization
- For some kinds of data, such as web logs or analytics data, you might want to only process new records instead of reprocessing the entire table.
- Latency reduction
- You can use incremental tables to execute workflows quickly but frequently, reducing the downstream latency of the output tables.
- Daily snapshots
- You can configure an incremental table to create daily snapshots of the table data, for example, for longitudinal analysis of user settings stored in a production database.
Before you begin
In the Google Cloud console, go to the Dataform page.
Select or create a repository.
Select or create a development workspace.
Create a table of the
incremental
table type.
Required roles
To get the permissions that you need to configure an incremental table,
ask your administrator to grant you the
Dataform Editor (roles/dataform.editor
) IAM role on workspaces.
For more information about granting roles, see Manage access to projects, folders, and organizations.
You might also be able to get the required permissions through custom roles or other predefined roles.
Process a subset of rows in an incremental table
To determine a subset of rows for Dataform to process during each
execution, add a conditional WHERE
clause to the incremental table SQLX
definition file. In the WHERE
clause,
you can specify an incremental condition and a non-incremental condition.
Dataform applies the incremental condition during table execution
without a full refresh, and the non-incremental condition during execution
with a full refresh.
To configure an incremental table, follow these steps:
- Go to your development workspace.
- In the Files pane, expand
definitions/
. - Open an incremental table definition SQLX file.
Enter a
WHERE
clause in the following format:config { type: "incremental" } SELECT_STATEMENT ${when(incremental(), `WHERE INCREMENTAL_CONDITION`, `WHERE NON_INCREMENTAL_CONDITION`) }
Replace the following:
- SELECT_STATEMENT: the
SELECT
statement that defines your table INCREMENTAL_CONDITION: the condition you specify in the
WHERE
clause to select rows for Dataform to process during table execution without a full refresh.NON_INCREMENTAL_CONDITION: the condition you specify in the
WHERE
clause to select rows for Dataform to process during table execution with a full refresh.
- SELECT_STATEMENT: the
Optional: Click Format.
The following code sample shows an incremental table that incrementally
processes rows of the productiondb.logs
table:
config { type: "incremental" }
SELECT timestamp, message FROM ${ref("productiondb", "logs")}
${when(incremental(),
`WHERE date > (SELECT MAX(date) FROM ${self()}) AND country = "UK"`,
`WHERE country = "UK"`)}
The following code sample shows an incremental table that creates a snapshot
of the productiondb.customers
table:
config { type: "incremental" }
SELECT CURRENT_DATE() AS snapshot_date, customer_id, name, account_settings FROM ${ref("productiondb", "customers")}
${when(incremental(), `WHERE snapshot_date > (SELECT MAX(snapshot_date) FROM ${self()})`) }
Merge rows in an incremental table
To ensure that an incremental table contains only one row corresponding to a
selected combination of columns, set the selected columns as uniqueKey
to
merge rows that have the same uniqueKey
. When updating the table,
Dataform merges rows with uniqueKey
instead of appending them.
To configure merging in an incremental table, follow these steps:
- Go to your development workspace.
- In the Files pane, expand
definitions/
. - Select an incremental table definition SQLX file
In the
config
block, set the selected columns asuniqueKey
in the following format:uniqueKey: ["COLUMN_NAME"]
Replace COLUMN_NAME with the name of a selected column.
Optional: Click Format.
The following code sample shows an incremental table with the transaction_id
column set as uniqueKey
to ensure it always contains one row:
config {
type: "incremental",
uniqueKey: ["transaction_id"]
}
SELECT timestamp, action FROM weblogs.user_actions
${ when(incremental(), `WHERE timestamp > (SELECT MAX(timestamp) FROM ${self()})`) }
Filter rows in an incremental table
In an incremental partitioned table, to avoid Dataform scanning the
whole table to find matching rows, set updatePartitionFilter
to only consider
a subset of records.
The following code sample shows an incremental partitioned table with merging
configured by setting the uniqueKey
and updatePartitionFilter
properties:
config {
type: "incremental",
uniqueKey: ["transaction_id"],
bigquery: {
partitionBy: "DATE(timestamp)",
updatePartitionFilter:
"timestamp >= timestamp_sub(current_timestamp(), interval 24 hour)"
}
}
SELECT timestamp, action FROM weblogs.user_actions
${ when(incremental(), `WHERE timestamp > (SELECT MAX(timestamp) FROM ${self()})`) }
Avoid full table scans when ingesting from a partitioned table
When you create an incremental table that references a partitioned table, we recommend that you build your table query to avoid full table scans of the partitioned table during each incremental update.
You can limit the number of partitions that BigQuery scans to update
the incremental table by using a constant expression in your
table query. To turn a value from the partitioned table into a constant
expression, use BigQuery scripting to
declare the value as a variable in the pre_operations
block.
Then, use the variable as a constant expression in a WHERE
clause in
the SELECT
query.
With this configuration, Dataform updates the incremental table based on the most recent partitions of the referenced partitioned table, without scanning the entire table.
To configure an incremental table that references a partitioned table and avoids full table scans, follow these steps:
- Go to your development workspace.
- In the Files pane, expand
definitions/
. - Select an incremental table definition SQLX file
- In the
pre_operations
block, declare a variable with BigQuery scripting. - Filter the
SELECT
statement that defines the table with aWHERE
clause that references the declared variable. - Optional: Click Format.
The following code sample shows an incremental table in which the referenced
raw_events
table is partitioned by event_timestamp
:
config {
type: "incremental",
}
pre_operations {
DECLARE event_timestamp_checkpoint DEFAULT (
${when(incremental(),
`SELECT max(event_timestamp) FROM ${self()}`,
`SELECT timestamp("2000-01-01")`)}
)
}
SELECT
*
FROM
${ref("raw_events")}
WHERE event_timestamp > event_timestamp_checkpoint
In the preceding code sample, the event_timestamp_checkpoint
variable is
defined in the pre_operations
block.
The event_timestamp_checkpoint
variable is then used as a constant expression
in the WHERE
clause.
Rebuild an incremental table from scratch with full refresh
You can force an incremental table to be rebuilt from scratch using either the
command-line interface with the --full-refresh
option
or with the Run with full refresh option when
triggering a workflow execution.
When you select the full refresh option, in your development workspace or by
using the Dataform CLI, Dataform ignores the
${when(incremental(), ... }
parameter during execution and recreates the
table with a CREATE OR REPLACE
statement.
Protect an incremental table from full refresh
To protect an incremental table from being rebuilt from scratch and potential
data loss, you can set the incremental table as protected
. You might want to
prevent an incremental table from being rebuilt if your data source is temporary.
To mark an incremental table as protected
, follow these steps:
- Go to your development workspace.
- In the Files pane, expand
definitions/
. - Select an incremental table definition SQLX file.
- In the
config
block, enterprotected: true
. - Optional: Click Format.
The following code sample shows an incremental tables marked as protected
:
config {
type: "incremental",
protected: true
}
SELECT ...
What's next
- To learn how to define a table, see Create a table.
- To learn how to use the Dataform command line interface, see Use the Dataform CLI.
- To learn to manually trigger executions, see Trigger execution.