The ML.ENTITY_FEATURES_AT_TIME function
This document describes the ML.ENTITY_FEATURES_AT_TIME
function, which lets
you use multiple point-in-time cutoffs for multiple entities when retrieving
features, because features can have time dependencies if they include
time-sensitive data. To avoid
data leakage, use
point-in-time features when training models and running inference.
Use this function to retrieve features from multiple entities for multiple
points in time. For example, you could retrieve features created at or before
three different points in time for entity 1, and features created at or before
yet another different point in time for entity 2. Use the
ML.FEATURES_AT_TIME
function
to use the same point-in-time cutoff for all entities when retrieving features.
Syntax
ML.ENTITY_FEATURES_AT_TIME( { TABLE feature_table | (feature_query_statement) }, { TABLE entity_time_table | (entity_time_query_statement) } [, num_rows => INT64][, ignore_feature_nulls => BOOL])
Arguments
ML.ENTITY_FEATURES_AT_TIME
takes the following arguments:
feature_table
: aSTRING
value that specifies the name of the BigQuery table that contains the feature data. The feature table must contain the following columns:entity_id
: aSTRING
column that contains the ID of the entity related to the features.- One or more feature columns.
feature_timestamp
: aTIMESTAMP
column that identifies when the feature data was last updated.
The column names are case-insensitive. For example, you can use a column named
Entity_ID
instead ofentity_id
.The feature table must be in wide format, with one column for each feature.
feature_query_statement
: aSTRING
value that specifies a GoogleSQL query that returns the feature data. This query must return the same columns asfeature_table
. See GoogleSQL query syntax for the supported SQL syntax of thefeature_query_statement
clause.entity_time_table
: aSTRING
value that specifies the name of the BigQuery table that maps entity IDs to feature lookup times. The entity time table must contain the following columns:entity_id
: aSTRING
column that contains the entity ID.time
: aTIMESTAMP
column that identifies a point in time to use as a cutoff time when selecting features for the entity represented by the entity ID.
The column names are case-insensitive. For example, you can use a column named
Entity_ID
instead ofentity_id
.The table identified by
entity_time_table
must be no larger than 100 MB.entity_time_query_statement
: aSTRING
value that specifies a GoogleSQL query that returns the entity time data. This query must return the same columns asentity_time_table
. See GoogleSQL query syntax for the supported SQL syntax of theentity_time_query_statement
clause.num_rows
: anINT64
value that specifies the number of rows to return for each row inentity_time_table
. Defaults to1
.ignore_feature_nulls
: aBOOL
value that indicates whether to replace aNULL
value in a feature column with the feature column value from the row for the same entity that immediately precedes it in time. For example, for the following feature table and entity time table:Feature table
+-----------+------+------+--------------------------+ | entity_id | f1 | f2 | feature_timestamp | +-----------+------+------+--------------------------+ | '2' | 5.0 | 8.0 | '2022-06-10 09:00:00+00' | +-----------+------+------+--------------------------+ | '2' | 2.0 | 4.0 | '2022-06-10 12:00:00+00' | +-----------+------+------+--------------------------+ | '2' | 7.0 | NULL | '2022-06-11 10:00:00+00' | +-----------+------+------+--------------------------+
Entity time table
+-----------+--------------------------+ | entity_id | time | +-----------+--------------------------+ | '2' | '2022-06-11 10:00:00+00' | +-----------+--------------------------+
Running this query:
SELECT * FROM ML.ENTITY_FEATURES_AT_TIME( TABLE mydataset.feature_table, TABLE mydataset.entity_time_table, num_rows => 1, ignore_feature_nulls => TRUE);
Results in the following output, where the
f2
value from the row for entity ID 2 that is timestamped'2022-06-10 12:00:00+00'
is substituted for theNULL
value in the row timestamped'2022-06-11 10:00:00+00'
:+-----------+------+------+--------------------------+ | entity_id | f1 | f2 | feature_timestamp | +-----------+------+------+--------------------------+ | '2' | 7.0 | 4.0 | '2022-06-11 10:00:00+00' | +-----------+------+------+--------------------------+
If there is no available replacement value, for example, where there is no earlier row for that entity ID, a
NULL
value is returned.Defaults to
FALSE
.
Output
ML.ENTITY_FEATURES_AT_TIME
returns the input table rows that meet the
point-in-time cutoff criteria, with the feature_timestamp
column showing the
timestamp from the time
column of the entity time table.
Because you can specify multiple points in time from which to retrieve features
for the same entity, it is possible to return duplicate rows, depending on the
timestamps in the feature and entity time tables, and the num_rows
value
you specify. For example, if the only row in the feature table for entity ID 1
is timestamped 2022-06-11 10:00:00+00
, and you have two rows for entity ID 1
in the entity time table that both have later timestamps, the function output
has 2 rows with the same feature data for entity ID 1.
If either of the following conditions are true:
- No entity ids from the entity time table are found in the feature table.
- The rows in the feature table whose entity ids match those in the entity time table don't meet the point-in-time criteria.
Then the function doesn't return any output for that entity time table row.
Examples
Example 1
This example shows a how to retrain a model using only features that were
created or updated before the timestamps identified in
mydataset.entity_time_table
:
CREATE OR REPLACE `mydataset.mymodel` OPTIONS (WARM_START = TRUE) AS SELECT * EXCEPT (feature_timestamp, entity_id) FROM ML.ENTITY_FEATURES_AT_TIME( TABLE `mydataset.feature_table`, TABLE `mydataset.entity_time_table`, num_rows => 1, ignore_feature_nulls => TRUE);
Example 2
This example shows a how to get predictions from a model based on features
that were created or updated before the timestamps identified in
mydataset.entity_time_table
:
SELECT * FROM ML.PREDICT( MODEL `mydataset.mymodel`, ( SELECT * EXCEPT (feature_timestamp, entity_id) FROM ML.ENTITY_FEATURES_AT_TIME( TABLE `mydataset.feature_table`, TABLE `mydataset.entity_time_table`, num_rows => 1, ignore_feature_nulls => TRUE) ) );
Example 3
This is a contrived example that you can use to see the output of the function:
WITH feature_table AS ( SELECT * FROM UNNEST( ARRAY<STRUCT<entity_id STRING, f_1 FLOAT64, f_2 FLOAT64, feature_timestamp TIMESTAMP>>[ ('id1', 1.0, 1.0, TIMESTAMP '2022-06-10 12:00:00+00'), ('id2', 12.0, 24.0, TIMESTAMP '2022-06-11 12:00:00+00'), ('id1', 11.0, NULL, TIMESTAMP '2022-06-11 12:00:00+00'), ('id1', 6.0, 12.0, TIMESTAMP '2022-06-11 10:00:00+00'), ('id2', 2.0, 4.0, TIMESTAMP '2022-06-10 12:00:00+00'), ('id2', 7.0, NULL, TIMESTAMP '2022-06-11 10:00:00+00')]) ), entity_time_table AS ( SELECT * FROM UNNEST( ARRAY<STRUCT<entity_id STRING, time TIMESTAMP>>[ ('id1', TIMESTAMP '2022-06-12 12:00:00+00'), ('id2', TIMESTAMP '2022-06-11 10:00:00+00'), ('id1', TIMESTAMP '2022-06-10 13:00:00+00')]) ) SELECT * FROM ML.ENTITY_FEATURES_AT_TIME( TABLE feature_table, TABLE entity_time_table, num_rows => 1, ignore_feature_nulls => TRUE);