Module model_selection (1.25.0)

Functions for test/train split and model tuning. This module is styled after scikit-learn's model_selection module: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection.

Classes

KFold

KFold(n_splits: int = 5, *, random_state: typing.Optional[int] = None)

K-Fold cross-validator.

Split data in train/test sets. Split dataset into k consecutive folds.

Each fold is then used once as a validation while the k - 1 remaining folds form the training set.

Parameters
Name	Description
`n_splits`	`int` Number of folds. Must be at least 2. Default to 5.
`random_state`	`Optional[int]` A seed to use for randomly choosing the rows of the split. If not set, a random split will be generated each time. Default to None.

Modules Functions

cross_validate

cross_validate(
    estimator,
    X: typing.Union[
        bigframes.dataframe.DataFrame,
        bigframes.series.Series,
        pandas.core.frame.DataFrame,
        pandas.core.series.Series,
    ],
    y: typing.Optional[
        typing.Union[
            bigframes.dataframe.DataFrame,
            bigframes.series.Series,
            pandas.core.frame.DataFrame,
            pandas.core.series.Series,
        ]
    ] = None,
    *,
    cv: typing.Optional[typing.Union[int, bigframes.ml.model_selection.KFold]] = None
) -> dict[str, list]

Evaluate metric(s) by cross-validation and also record fit/score times.

Parameters
Name	Description
`X`	`bigframes.dataframe.DataFrame or bigframes.series.Series` The data to fit.
`y`	`bigframes.dataframe.DataFrame, bigframes.series.Series or None` The target variable to try to predict in the case of supe()rvised learning. Default to None.
`cv`	`int, bigframes.ml.model_selection.KFold or None` Determines the cross-validation splitting strategy. Possible inputs for cv are: - None, to use the default 5-fold cross validation, - int, to specify the number of folds in a `KFold`, - bigframes.ml.model_selection.KFold instance.

Returns
Type	Description
`Dict[str, List]`	A dict of arrays containing the score/time arrays for each scorer is returned. The keys for this `dict` are: `test_score` The score array for test scores on each cv split. `fit_time` The time for fitting the estimator on the train set for each cv split. `score_time` The time for scoring the estimator on the test set for each cv split.

train_test_split

train_test_split(
    *arrays: typing.Union[
        bigframes.dataframe.DataFrame,
        bigframes.series.Series,
        pandas.core.frame.DataFrame,
        pandas.core.series.Series,
    ],
    test_size: typing.Optional[float] = None,
    train_size: typing.Optional[float] = None,
    random_state: typing.Optional[int] = None,
    stratify: typing.Optional[bigframes.series.Series] = None
) -> typing.List[typing.Union[bigframes.dataframe.DataFrame, bigframes.series.Series]]

Splits dataframes or series into random train and test subsets.

Parameters
Name	Description
`\*arrays`	`bigframes.dataframe.DataFrame or bigframes.series.Series` A sequence of BigQuery DataFrames or Series that can be joined on their indexes.
`test_size`	`default None` The proportion of the dataset to include in the test split. If None, this will default to the complement of train_size. If both are none, it will be set to 0.25.
`train_size`	`default None` The proportion of the dataset to include in the train split. If None, this will default to the complement of test_size.
`random_state`	`default None` A seed to use for randomly choosing the rows of the split. If not set, a random split will be generated each time.

Returns
Type	Description
`List[Union[bigframes.dataframe.DataFrame, bigframes.series.Series]]`	A list of BigQuery DataFrames or Series.