Class TabularDataset (1.44.0)

TabularDataset(
    dataset_name: str,
    project: typing.Optional[str] = None,
    location: typing.Optional[str] = None,
    credentials: typing.Optional[google.auth.credentials.Credentials] = None,
)

A managed tabular dataset resource for Vertex AI.

Use this class to work with tabular datasets. You can use a CSV file, BigQuery, or a pandas DataFrame to create a tabular dataset. For more information about paging through BigQuery data, see Read data with BigQuery API using pagination. For more information about tabular data, see Tabular data.

The following code shows you how to create and import a tabular dataset with a CSV file.

my_dataset = aiplatform.TabularDataset.create(
    display_name="my-dataset", gcs_source=['gs://path/to/my/dataset.csv'])

The following code shows you how to create and import a tabular dataset in two distinct steps.

my_dataset = aiplatform.TextDataset.create(
    display_name="my-dataset")

my_dataset.import(
    gcs_source=['gs://path/to/my/dataset.csv']
    import_schema_uri=aiplatform.schema.dataset.ioformat.text.multi_label_classification
)

If you create a tabular dataset with a pandas DataFrame, you need to use a BigQuery table to stage the data for Vertex AI:

my_dataset = aiplatform.TabularDataset.create_from_dataframe(
    df_source=my_pandas_dataframe,
    staging_path=f"bq://{bq_dataset_id}.table-unique"
)

Properties

column_names

Retrieve the columns for the dataset by extracting it from the Google Cloud Storage or Google BigQuery source.

Exceptions
Type Description
RuntimeError When no valid source is found.

create_time

Time this resource was created.

display_name

Display name of this resource.

encryption_spec

Customer-managed encryption key options for this Vertex AI resource.

If this is set, then all resources created by this Vertex AI resource will be encrypted with the provided encryption key.

gca_resource

The underlying resource proto representation.

labels

User-defined labels containing metadata about this resource.

Read more about labels at https://goo.gl/xmQnxf

metadata_schema_uri

The metadata schema uri of this dataset resource.

name

Name of this resource.

resource_name

Full qualified resource name.

update_time

Time this resource was last updated.

Methods

TabularDataset

TabularDataset(
    dataset_name: str,
    project: typing.Optional[str] = None,
    location: typing.Optional[str] = None,
    credentials: typing.Optional[google.auth.credentials.Credentials] = None,
)

Retrieves an existing managed dataset given a dataset name or ID.

Parameters
Name Description
dataset_name str

Required. A fully-qualified dataset resource name or dataset ID. Example: "projects/123/locations/us-central1/datasets/456" or "456" when project and location are initialized or passed.

project str

Optional project to retrieve dataset from. If not set, project set in aiplatform.init will be used.

location str

Optional location to retrieve dataset from. If not set, location set in aiplatform.init will be used.

credentials auth_credentials.Credentials

Custom credentials to use to retrieve this Dataset. Overrides credentials set in aiplatform.init.

create

create(
    display_name: typing.Optional[str] = None,
    gcs_source: typing.Optional[typing.Union[str, typing.Sequence[str]]] = None,
    bq_source: typing.Optional[str] = None,
    project: typing.Optional[str] = None,
    location: typing.Optional[str] = None,
    credentials: typing.Optional[google.auth.credentials.Credentials] = None,
    request_metadata: typing.Optional[typing.Sequence[typing.Tuple[str, str]]] = (),
    labels: typing.Optional[typing.Dict[str, str]] = None,
    encryption_spec_key_name: typing.Optional[str] = None,
    sync: bool = True,
    create_request_timeout: typing.Optional[float] = None,
) -> google.cloud.aiplatform.datasets.tabular_dataset.TabularDataset

Creates a tabular dataset.

Parameters
Name Description
display_name str

Optional. The user-defined name of the dataset. The name must contain 128 or fewer UTF-8 characters.

gcs_source Union[str, Sequence[str]]

Optional. The URI to one or more Google Cloud Storage buckets that contain your datasets. For example, str: "gs://bucket/file.csv" or Sequence[str]: ["gs://bucket/file1.csv", "gs://bucket/file2.csv"].

bq_source str

Optional. The URI to a BigQuery table that's used as an input source. For example, bq://project.dataset.table_name.

project str

Optional. The name of the Google Cloud project to which this TabularDataset is uploaded. This overrides the project that was set by aiplatform.init.

location str

Optional. The Google Cloud region where this dataset is uploaded. This region overrides the region that was set by aiplatform.init.

credentials auth_credentials.Credentials

Optional. The credentials that are used to upload the TabularDataset. These credentials override the credentials set by aiplatform.init.

request_metadata Sequence[Tuple[str, str]]

Optional. Strings that contain metadata that's sent with the request.

labels Dict[str, str]

Optional. Labels with user-defined metadata to organize your Vertex AI Tensorboards. The maximum length of a key and of a value is 64 unicode characters. Labels and keys can contain only lowercase letters, numeric characters, underscores, and dashes. International characters are allowed. No more than 64 user labels can be associated with one Tensorboard (system labels are excluded). For more information and examples of using labels, see Using labels to organize Google Cloud Platform resources. System reserved label keys are prefixed with aiplatform.googleapis.com/ and are immutable.

encryption_spec_key_name Optional[str]

Optional. The Cloud KMS resource identifier of the customer managed encryption key that's used to protect the dataset. The format of the key is projects/my-project/locations/my-region/keyRings/my-kr/cryptoKeys/my-key. The key needs to be in the same region as where the compute resource is created. If encryption_spec_key_name is set, this TabularDataset and all of its sub-resources are secured by this key. This encryption_spec_key_name overrides the encryption_spec_key_name set by aiplatform.init.

sync bool

If true, the create method creates a tabular dataset synchronously. If false, the create method creates a tabular dataset asynchronously.

create_request_timeout float

Optional. The number of seconds for the timeout of the create request.

Returns
Type Description
tabular_dataset (TabularDataset) An instantiated representation of the managed TabularDataset resource.

create_from_dataframe

create_from_dataframe(
    df_source: pd.DataFrame,
    staging_path: str,
    bq_schema: typing.Optional[
        typing.Union[str, google.cloud.bigquery.schema.SchemaField]
    ] = None,
    display_name: typing.Optional[str] = None,
    project: typing.Optional[str] = None,
    location: typing.Optional[str] = None,
    credentials: typing.Optional[google.auth.credentials.Credentials] = None,
) -> TabularDataset

Creates a new tabular dataset from a pandas DataFrame.

Parameters
Name Description
staging_path str

Required. The BigQuery table used to stage the data for Vertex AI. Because Vertex AI maintains a reference to this source to create the TabularDataset, you shouldn't delete this BigQuery table. For example: bq://my-project.my-dataset.my-table. If the specified BigQuery table doesn't exist, then the table is created for you. If the provided BigQuery table already exists, and the schemas of the BigQuery table and your DataFrame match, then the data in your local DataFrame is appended to the table. The location of the BigQuery table must conform to the BigQuery location requirements.

bq_schema Optional[Union[str, bigquery.SchemaField]]

Optional. If not set, BigQuery autodetects the schema using the column types of your DataFrame. If set, BigQuery uses the schema you provide when the staging table is created. For more information, see the BigQuery LoadJobConfig.schema property.

display_name str

Optional. The user-defined name of the Dataset. The name must contain 128 or fewer UTF-8 characters.

project str

Optional. The project to upload this dataset to. This overrides the project set using aiplatform.init.

location str

Optional. The location to upload this dataset to. This overrides the location set using aiplatform.init.

credentials auth_credentials.Credentials

Optional. The custom credentials used to upload this dataset. This overrides credentials set using aiplatform.init.

df_source pd.DataFrame

Required. A pandas DataFrame containing the source data for ingestion as a TabularDataset. This method uses the data types from the provided DataFrame when the TabularDataset is created.

Returns
Type Description
tabular_dataset (TabularDataset) An instantiated representation of the managed TabularDataset resource.

delete

delete(sync: bool = True) -> None

Deletes this Vertex AI resource. WARNING: This deletion is permanent.

Parameter
Name Description
sync bool

Whether to execute this deletion synchronously. If False, this method will be executed in concurrent Future and any downstream object will be immediately returned and synced when the Future has completed.

export_data

export_data(output_dir: str) -> typing.Sequence[str]

Exports data to output dir to GCS.

Parameter
Name Description
output_dir str

Required. The Google Cloud Storage location where the output is to be written to. In the given directory a new directory will be created with name: export-data- where timestamp is in YYYYMMDDHHMMSS format. All export output will be written into that directory. Inside that directory, annotations with the same schema will be grouped into sub directories which are named with the corresponding annotations' schema title. Inside these sub directories, a schema.yaml will be created to describe the output format. If the uri doesn't end with '/', a '/' will be automatically appended. The directory is created if it doesn't exist.

Returns
Type Description
exported_files (Sequence[str]) All of the files that are exported in this export operation.

export_data_for_custom_training

export_data_for_custom_training(
    output_dir: str,
    annotation_filter: typing.Optional[str] = None,
    saved_query_id: typing.Optional[str] = None,
    annotation_schema_uri: typing.Optional[str] = None,
    split: typing.Optional[
        typing.Union[typing.Dict[str, str], typing.Dict[str, float]]
    ] = None,
) -> typing.Dict[str, typing.Any]

Exports data to output dir to GCS for custom training use case.

Example annotation_schema_uri (image classification): gs://google-cloud-aiplatform/schema/dataset/annotation/image_classification_1.0.0.yaml

Example split (filter split): { "training_filter": "labels.aiplatform.googleapis.com/ml_use=training", "validation_filter": "labels.aiplatform.googleapis.com/ml_use=validation", "test_filter": "labels.aiplatform.googleapis.com/ml_use=test", } Example split (fraction split): { "training_fraction": 0.7, "validation_fraction": 0.2, "test_fraction": 0.1, }

Parameters
Name Description
output_dir str

Required. The Google Cloud Storage location where the output is to be written to. In the given directory a new directory will be created with name: export-data- where timestamp is in YYYYMMDDHHMMSS format. All export output will be written into that directory. Inside that directory, annotations with the same schema will be grouped into sub directories which are named with the corresponding annotations' schema title. Inside these sub directories, a schema.yaml will be created to describe the output format. If the uri doesn't end with '/', a '/' will be automatically appended. The directory is created if it doesn't exist.

annotation_filter str

Optional. An expression for filtering what part of the Dataset is to be exported. Only Annotations that match this filter will be exported. The filter syntax is the same as in ListAnnotations][DatasetService.ListAnnotations].

saved_query_id str

Optional. The ID of a SavedQuery (annotation set) under this Dataset used for filtering Annotations for training. Only used for custom training data export use cases. Only applicable to Datasets that have SavedQueries. Only Annotations that are associated with this SavedQuery are used in respectively training. When used in conjunction with annotations_filter, the Annotations used for training are filtered by both saved_query_id and annotations_filter. Only one of saved_query_id and annotation_schema_uri should be specified as both of them represent the same thing: problem type.

annotation_schema_uri str

Optional. The Cloud Storage URI that points to a YAML file describing the annotation schema. The schema is defined as an OpenAPI 3.0.2 Schema Object. The schema files that can be used here are found in gs://google-cloud-aiplatform/schema/dataset/annotation/, note that the chosen schema must be consistent with metadata_schema_uri of this Dataset. Only used for custom training data export use cases. Only applicable if this Dataset that have DataItems and Annotations. Only Annotations that both match this schema and belong to DataItems not ignored by the split method are used in respectively training, validation or test role, depending on the role of the DataItem they are on. When used in conjunction with annotations_filter, the Annotations used for training are filtered by both annotations_filter and annotation_schema_uri.

split Union[Dict[str, str], Dict[str, float]]

The instructions how the export data should be split between the training, validation and test sets.

Returns
Type Description
export_data_response (Dict) Response message for DatasetService.ExportData in Dictionary format.

import_data

import_data()

Upload data to existing managed dataset.

Parameters
Name Description
gcs_source Union[str, Sequence[str]]

Required. Google Cloud Storage URI(-s) to the input file(s). May contain wildcards. For more information on wildcards, see https://cloud.google.com/storage/docs/gsutil/addlhelp/WildcardNames. .. rubric:: Examples str: "gs://bucket/file.csv" Sequence[str]: ["gs://bucket/file1.csv", "gs://bucket/file2.csv"]

import_schema_uri str

Required. Points to a YAML file stored on Google Cloud Storage describing the import format. Validation will be done against the schema. The schema is defined as an OpenAPI 3.0.2 Schema Object https://tinyurl.com/y538mdwt__.

data_item_labels Dict

Labels that will be applied to newly imported DataItems. If an identical DataItem as one being imported already exists in the Dataset, then these labels will be appended to these of the already existing one, and if labels with identical key is imported before, the old label value will be overwritten. If two DataItems are identical in the same import data operation, the labels will be combined and if key collision happens in this case, one of the values will be picked randomly. Two DataItems are considered identical if their content bytes are identical (e.g. image bytes or pdf bytes). These labels will be overridden by Annotation labels specified inside index file referenced by import_schema_uri, e.g. jsonl file. This arg is not for specifying the annotation name or the training target of your data, but for some global labels of the dataset. E.g., 'data_item_labels={"aiplatform.googleapis.com/ml_use":"training"}' specifies that all the uploaded data are used for training.

sync bool

Whether to execute this method synchronously. If False, this method will be executed in concurrent Future and any downstream object will be immediately returned and synced when the Future has completed.

import_request_timeout float

Optional. The timeout for the import request in seconds.

Returns
Type Description
dataset (Dataset) Instantiated representation of the managed dataset resource.

list

list(
    filter: typing.Optional[str] = None,
    order_by: typing.Optional[str] = None,
    project: typing.Optional[str] = None,
    location: typing.Optional[str] = None,
    credentials: typing.Optional[google.auth.credentials.Credentials] = None,
) -> typing.List[google.cloud.aiplatform.base.VertexAiResourceNoun]

List all instances of this Dataset resource.

Example Usage:

aiplatform.TabularDataset.list( filter='labels.my_key="my_value"', order_by='display_name' )

Parameters
Name Description
filter str

Optional. An expression for filtering the results of the request. For field names both snake_case and camelCase are supported.

order_by str

Optional. A comma-separated list of fields to order by, sorted in ascending order. Use "desc" after a field name for descending. Supported fields: display_name, create_time, update_time

project str

Optional. Project to retrieve list from. If not set, project set in aiplatform.init will be used.

location str

Optional. Location to retrieve list from. If not set, location set in aiplatform.init will be used.

credentials auth_credentials.Credentials

Optional. Custom credentials to use to retrieve list. Overrides credentials set in aiplatform.init.

to_dict

to_dict() -> typing.Dict[str, typing.Any]

Returns the resource proto as a dictionary.

update

update(
    *,
    display_name: typing.Optional[str] = None,
    labels: typing.Optional[typing.Dict[str, str]] = None,
    description: typing.Optional[str] = None,
    update_request_timeout: typing.Optional[float] = None
) -> google.cloud.aiplatform.datasets.dataset._Dataset

Update the dataset. Updatable fields:

  • display_name
  • description
  • labels
Parameters
Name Description
display_name str

Optional. The user-defined name of the Dataset. The name can be up to 128 characters long and can be consist of any UTF-8 characters.

labels Dict[str, str]

Optional. Labels with user-defined metadata to organize your Tensorboards. Label keys and values can be no longer than 64 characters (Unicode codepoints), can only contain lowercase letters, numeric characters, underscores and dashes. International characters are allowed. No more than 64 user labels can be associated with one Tensorboard (System labels are excluded). See https://goo.gl/xmQnxf for more information and examples of labels. System reserved label keys are prefixed with "aiplatform.googleapis.com/" and are immutable.

description str

Optional. The description of the Dataset.

update_request_timeout float

Optional. The timeout for the update request in seconds.

Returns
Type Description
dataset (Dataset) Updated dataset.

wait

wait()

Helper method that blocks until all futures are complete.