Class Session (0.11.0)

Session(
    context: typing.Optional[bigframes._config.bigquery_options.BigQueryOptions] = None,
    clients_provider: typing.Optional[bigframes.session.clients.ClientsProvider] = None,
)

Establishes a BigQuery connection to capture a group of job activities related to DataFrames.

Parameters
Name	Description
`context`	`bigframes._config.bigquery_options.BigQueryOptions` Configuration adjusting how to connect to BigQuery and related APIs. Note that some options are ignored if `clients_provider` is set.
`clients_provider`	`bigframes.session.bigframes.session.clients.ClientsProvider` An object providing client library objects.

Properties

bqclient

API documentation for bqclient property.

bqconnectionclient

API documentation for bqconnectionclient property.

bqstoragereadclient

API documentation for bqstoragereadclient property.

cloudfunctionsclient

API documentation for cloudfunctionsclient property.

resourcemanagerclient

API documentation for resourcemanagerclient property.

Methods

close

close()

Terminated the BQ session, otherwises the session will be terminated automatically after 24 hours of inactivity or after 7 days.

read_csv

read_csv(
    filepath_or_buffer: str | IO["bytes"],
    *,
    sep: Optional[str] = ",",
    header: Optional[int] = 0,
    names: Optional[
        Union[MutableSequence[Any], np.ndarray[Any, Any], Tuple[Any, ...], range]
    ] = None,
    index_col: Optional[
        Union[int, str, Sequence[Union[str, int]], Literal[False]]
    ] = None,
    usecols: Optional[
        Union[
            MutableSequence[str],
            Tuple[str, ...],
            Sequence[int],
            pandas.Series,
            pandas.Index,
            np.ndarray[Any, Any],
            Callable[[Any], bool],
        ]
    ] = None,
    dtype: Optional[Dict] = None,
    engine: Optional[
        Literal["c", "python", "pyarrow", "python-fwf", "bigquery"]
    ] = None,
    encoding: Optional[str] = None,
    **kwargs
) -> dataframe.DataFrame

Loads DataFrame from comma-separated values (csv) file locally or from Cloud Storage.

The CSV file data will be persisted as a temporary BigQuery table, which can be automatically recycled after the Session is closed.

Note: using engine="bigquery" will not guarantee the same ordering as the file. Instead, set a serialized index column as the index and sort by that in the resulting DataFrame.

Examples:

>>> import <xref uid="bigframes.pandas">bigframes.pandas</xref> as bpd
>>> bpd.options.display.progress_bar = None

>>> gcs_path = "gs://cloud-samples-data/bigquery/us-states/us-states.csv"
>>> df = bpd.read_csv(filepath_or_buffer=gcs_path)
>>> df.head(2)
      name post_abbr
0  Alabama        AL
1   Alaska        AK
<BLANKLINE>
[2 rows x 2 columns]

Parameters
Name	Description
`filepath_or_buffer`	`str` A local or Google Cloud Storage (`gs://`) path with `engine="bigquery"` otherwise passed to pandas.read_csv.
`sep`	`Optional[str], default ","` the separator for fields in a CSV file. For the BigQuery engine, the separator can be any ISO-8859-1 single-byte character. To use a character in the range 128-255, you must encode the character as UTF-8. Both engines support `sep=" "` to specify tab character as separator. Default engine supports having any number of spaces as separator by specifying `sep="\s+"`. Separators longer than 1 character are interpreted as regular expressions by the default engine. BigQuery engine only supports single character separators.
`header`	`Optional[int], default 0` row number to use as the column names. - `None`: Instructs autodetect that there are no headers and data should be read starting from the first row. - `0`: If using `engine="bigquery"`, Autodetect tries to detect headers in the first row. If they are not detected, the row is read as data. Otherwise data is read starting from the second row. When using default engine, pandas assumes the first row contains column names unless the `names` argument is specified. If `names` is provided, then the first row is ignored, second row is read as data, and column names are inferred from `names`. - `N > 0`: If using `engine="bigquery"`, Autodetect skips N rows and tries to detect headers in row N+1. If headers are not detected, row N+1 is just skipped. Otherwise row N+1 is used to extract column names for the detected schema. When using default engine, pandas will skip N rows and assumes row N+1 contains column names unless the `names` argument is specified. If `names` is provided, row N+1 will be ignored, row N+2 will be read as data, and column names are inferred from `names`.
`names`	`default None` a list of column names to use. If the file contains a header row and you want to pass this parameter, then `header=0` should be passed as well so the first (header) row is ignored. Only to be used with default engine.
`index_col`	`default None` column(s) to use as the row labels of the DataFrame, either given as string name or column index. `index_col=False` can be used with the default engine only to enforce that the first column is not used as the index. Using column index instead of column name is only supported with the default engine. The BigQuery engine only supports having a single column name as the `index_col`. Neither engine supports having a multi-column index.
`usecols`	`default None` List of column names to use): The BigQuery engine only supports having a list of string column names. Column indices and callable functions are only supported with the default engine. Using the default engine, the column names in `usecols` can be defined to correspond to column names provided with the `names` parameter (ignoring the document's header row of column names). The order of the column indices/names in `usecols` is ignored with the default engine. The order of the column names provided with the BigQuery engine will be consistent in the resulting dataframe. If using a callable function with the default engine, only column names that evaluate to True by the callable function will be in the resulting dataframe.
`dtype`	`data type for data or columns` Data type for data or columns. Only to be used with default engine.
`engine`	`Optional[Dict], default None` Type of engine to use. If `engine="bigquery"` is specified, then BigQuery's load API will be used. Otherwise, the engine will be passed to `pandas.read_csv`.
`encoding`	`Optional[str], default to None` encoding the character encoding of the data. The default encoding is `UTF-8` for both engines. The default engine acceps a wide range of encodings. Refer to Python documentation for a comprehensive list, https://docs.python.org/3/library/codecs.html#standard-encodings The BigQuery engine only supports `UTF-8` and `ISO-8859-1`.

Returns
Type	Description
`bigframes.dataframe.DataFrame`	A BigQuery DataFrames.

read_gbq

read_gbq(
    query_or_table: str,
    *,
    index_col: Iterable[str] | str = (),
    col_order: Iterable[str] = (),
    max_results: Optional[int] = None
) -> dataframe.DataFrame

Loads a DataFrame from BigQuery.

BigQuery tables are an unordered, unindexed data source. By default, the DataFrame will have an arbitrary index and ordering.

Set the index_col argument to one or more columns to choose an index. The resulting DataFrame is sorted by the index columns. For the best performance, ensure the index columns don't contain duplicate values.

Note: By default, even SQL query inputs with an ORDER BY clause create a DataFrame with an arbitrary ordering. Use

row_number() OVER
(ORDER BY ...) AS rowindex

in your SQL query and set index_col='rowindex' to preserve the desired ordering.

If your query doesn't have an ordering, select

GENERATE_UUID() AS
    rowindex

in your SQL and set index_col='rowindex' for the best performance.

Examples:

>>> import <xref uid="bigframes.pandas">bigframes.pandas</xref> as bpd
>>> bpd.options.display.progress_bar = None

If the input is a table ID:

>>> df = bpd.read_gbq("bigquery-public-data.ml_datasets.penguins")
>>> df.head(2)
                                     species island  culmen_length_mm  \
0        Adelie Penguin (Pygoscelis adeliae)  Dream              36.6
1        Adelie Penguin (Pygoscelis adeliae)  Dream              39.8
<BLANKLINE>
   culmen_depth_mm  flipper_length_mm  body_mass_g     sex
0             18.4              184.0       3475.0  FEMALE
1             19.1              184.0       4650.0    MALE
<BLANKLINE>
[2 rows x 7 columns]

Preserve ordering in a query input.

>>> df = bpd.read_gbq('''
...    SELECT
...       -- Instead of an ORDER BY clause on the query, use
...       -- ROW_NUMBER() to create an ordered DataFrame.
...       ROW_NUMBER() OVER (ORDER BY AVG(pitchSpeed) DESC)
...         AS rowindex,
...
...       pitcherFirstName,
...       pitcherLastName,
...       AVG(pitchSpeed) AS averagePitchSpeed
...     FROM `bigquery-public-data.baseball.games_wide`
...     WHERE year = 2016
...     GROUP BY pitcherFirstName, pitcherLastName
... ''', index_col="rowindex")
>>> df.head(2)
         pitcherFirstName pitcherLastName  averagePitchSpeed
rowindex
1                Albertin         Chapman          96.514113
2                 Zachary         Britton          94.591039
<BLANKLINE>
[2 rows x 3 columns]

Parameters
Name	Description
`query_or_table`	`str` A SQL string to be executed or a BigQuery table to be read. The table must be specified in the format of `project.dataset.tablename` or `dataset.tablename`.
`index_col`	`Iterable[str] or str` Name of result column(s) to use for index in results DataFrame.
`col_order`	`Iterable[str]` List of BigQuery column names in the desired order for results DataFrame.
`max_results`	`Optional[int], default None` If set, limit the maximum number of rows to fetch from the query results.

Returns
Type	Description
`bigframes.dataframe.DataFrame`	A DataFrame representing results of the query or table.

read_gbq_function

read_gbq_function(function_name: str)

Loads a BigQuery function from BigQuery.

Then it can be applied to a DataFrame or Series.

Examples:

import bigframes.pandas as bpd bpd.options.display.progress_bar = None

function_name = "bqutil.fn.cw_lower_case_ascii_only" func = bpd.read_gbq_function(function_name=function_name) func.bigframes_remote_function 'bqutil.fn.cw_lower_case_ascii_only'

Parameter
Name	Description
`function_name`	`str` the function's name in BigQuery in the format `project_id.dataset_id.function_name`, or `dataset_id.function_name` to load from the default project, or `function_name` to load from the default project and the dataset associated with the current session.

Returns
Type	Description
`callable`	A function object pointing to the BigQuery function read from BigQuery. The object is similar to the one created by the `remote_function` decorator, including the `bigframes_remote_function` property, but not including the `bigframes_cloud_function` property.

read_gbq_model

read_gbq_model(model_name: str)

Loads a BigQuery ML model from BigQuery.

Examples:

>>> import <xref uid="bigframes.pandas">bigframes.pandas</xref> as bpd
>>> bpd.options.display.progress_bar = None

Read an existing BigQuery ML model.

>>> model_name = "bigframes-dev.bqml_tutorial.penguins_model"
>>> model = bpd.read_gbq_model(model_name)

Parameter
Name	Description
`model_name`	`str` the model's name in BigQuery in the format `project_id.dataset_id.model_id`, or just `dataset_id.model_id` to load from the default project.

read_gbq_query

read_gbq_query(
    query: str,
    *,
    index_col: Iterable[str] | str = (),
    col_order: Iterable[str] = (),
    max_results: Optional[int] = None
) -> dataframe.DataFrame

Turn a SQL query into a DataFrame.

Note: Because the results are written to a temporary table, ordering by ORDER BY is not preserved. A unique index_col is recommended. Use row_number() over () if there is no natural unique index or you want to preserve ordering.

Examples:

>>> import <xref uid="bigframes.pandas">bigframes.pandas</xref> as bpd
>>> bpd.options.display.progress_bar = None

Simple query input:

>>> df = bpd.read_gbq_query('''
...    SELECT
...       pitcherFirstName,
...       pitcherLastName,
...       pitchSpeed,
...    FROM `bigquery-public-data.baseball.games_wide`
... ''')
>>> df.head(2)
  pitcherFirstName pitcherLastName  pitchSpeed
0                                            0
1                                            0
<BLANKLINE>
[2 rows x 3 columns]

Preserve ordering in a query input.

>>> df = bpd.read_gbq_query('''
...    SELECT
...       -- Instead of an ORDER BY clause on the query, use
...       -- ROW_NUMBER() to create an ordered DataFrame.
...       ROW_NUMBER() OVER (ORDER BY AVG(pitchSpeed) DESC)
...         AS rowindex,
...
...       pitcherFirstName,
...       pitcherLastName,
...       AVG(pitchSpeed) AS averagePitchSpeed
...     FROM `bigquery-public-data.baseball.games_wide`
...     WHERE year = 2016
...     GROUP BY pitcherFirstName, pitcherLastName
... ''', index_col="rowindex")
>>> df.head(2)
         pitcherFirstName pitcherLastName  averagePitchSpeed
rowindex
1                Albertin         Chapman          96.514113
2                 Zachary         Britton          94.591039
<BLANKLINE>
[2 rows x 3 columns]

read_gbq_table

read_gbq_table(
    query: str,
    *,
    index_col: Iterable[str] | str = (),
    col_order: Iterable[str] = (),
    max_results: Optional[int] = None
) -> dataframe.DataFrame

Turn a BigQuery table into a DataFrame.

Examples:

>>> import <xref uid="bigframes.pandas">bigframes.pandas</xref> as bpd
>>> bpd.options.display.progress_bar = None

Read a whole table, with arbitrary ordering or ordering corresponding to the primary key(s).

>>> df = bpd.read_gbq_table("bigquery-public-data.ml_datasets.penguins")
>>> df.head(2)
                                     species island  culmen_length_mm  \
0        Adelie Penguin (Pygoscelis adeliae)  Dream              36.6
1        Adelie Penguin (Pygoscelis adeliae)  Dream              39.8
<BLANKLINE>
   culmen_depth_mm  flipper_length_mm  body_mass_g     sex
0             18.4              184.0       3475.0  FEMALE
1             19.1              184.0       4650.0    MALE
<BLANKLINE>
[2 rows x 7 columns]

read_json

read_json(
    path_or_buf: str | IO["bytes"],
    *,
    orient: Literal[
        "split", "records", "index", "columns", "values", "table"
    ] = "columns",
    dtype: Optional[Dict] = None,
    encoding: Optional[str] = None,
    lines: bool = False,
    engine: Literal["ujson", "pyarrow", "bigquery"] = "ujson",
    **kwargs
) -> dataframe.DataFrame

Convert a JSON string to DataFrame object.

Note: using engine="bigquery" will not guarantee the same ordering as the file. Instead, set a serialized index column as the index and sort by that in the resulting DataFrame.

Examples:

>>> import <xref uid="bigframes.pandas">bigframes.pandas</xref> as bpd
>>> bpd.options.display.progress_bar = None

>>> gcs_path = "gs://bigframes-dev-testing/sample1.json"
>>> df = bpd.read_json(path_or_buf=gcs_path, lines=True, orient="records")
>>> df.head(2)
   id   name
0   1  Alice
1   2    Bob
<BLANKLINE>
[2 rows x 2 columns]

Parameters
Name	Description
`path_or_buf`	`a valid JSON str, path object or file-like object` A local or Google Cloud Storage (`gs://`) path with `engine="bigquery"` otherwise passed to pandas.read_json.
`orient`	`str, optional` If `engine="bigquery"` orient only supports "records". Indication of expected JSON string format. Compatible JSON strings can be produced by `to_json()` with a corresponding orient value. The set of possible orients is: - `'split'` : dict like `{{index -> [index], columns -> [columns], data -> [values]}}` - `'records'` : list like `[{{column -> value}}, ... , {{column -> value}}]` - `'index'` : dict like `{{index -> {{column -> value}}}}` - `'columns'` : dict like `{{column -> {{index -> value}}}}` - `'values'` : just the values array
`dtype`	`bool or dict, default None` If True, infer dtypes; if a dict of column to dtype, then use those; if False, then don't infer dtypes at all, applies only to the data. For all `orient` values except `'table'`, default is True.
`encoding`	`str, default is 'utf-8'` The encoding to use to decode py3 bytes.
`lines`	`bool, default False` Read the file as a json object per line. If using `engine="bigquery"` lines only supports True.
`engine`	`{{"ujson", "pyarrow", "bigquery"}}, default "ujson"` Type of engine to use. If `engine="bigquery"` is specified, then BigQuery's load API will be used. Otherwise, the engine will be passed to `pandas.read_json`.

Returns
Type	Description
`bigframes.dataframe.DataFrame`	The DataFrame representing JSON contents.

read_pandas

read_pandas(
    pandas_dataframe: pandas.core.frame.DataFrame,
) -> bigframes.dataframe.DataFrame

Loads DataFrame from a pandas DataFrame.

The pandas DataFrame will be persisted as a temporary BigQuery table, which can be automatically recycled after the Session is closed.

Examples:

>>> import <xref uid="bigframes.pandas">bigframes.pandas</xref> as bpd
>>> import pandas as pd
>>> bpd.options.display.progress_bar = None

>>> d = {'col1': [1, 2], 'col2': [3, 4]}
>>> pandas_df = pd.DataFrame(data=d)
>>> df = bpd.read_pandas(pandas_df)
>>> df
   col1  col2
0     1     3
1     2     4
<BLANKLINE>
[2 rows x 2 columns]

Parameter
Name	Description
`pandas_dataframe`	`pandas.DataFrame` a pandas DataFrame object to be loaded.

Returns
Type	Description
`bigframes.dataframe.DataFrame`	The BigQuery DataFrame.

read_parquet

read_parquet(path: str | IO["bytes"]) -> dataframe.DataFrame

Load a Parquet object from the file path (local or Cloud Storage), returning a DataFrame.

Examples:

>>> import <xref uid="bigframes.pandas">bigframes.pandas</xref> as bpd
>>> bpd.options.display.progress_bar = None

>>> gcs_path = "gs://cloud-samples-data/bigquery/us-states/us-states.parquet"
>>> df = bpd.read_parquet(path=gcs_path)
>>> df.head(2)
      name post_abbr
0  Alabama        AL
1   Alaska        AK
<BLANKLINE>
[2 rows x 2 columns]

Parameter
Name	Description
`path`	`str` Local or Cloud Storage path to Parquet file.

Returns
Type	Description
`bigframes.dataframe.DataFrame`	A BigQuery DataFrames.

read_pickle

read_pickle(
    filepath_or_buffer: FilePath | ReadPickleBuffer,
    compression: CompressionOptions = "infer",
    storage_options: StorageOptions = None,
)

Load pickled BigFrames object (or any object) from file.

Examples:

>>> import <xref uid="bigframes.pandas">bigframes.pandas</xref> as bpd
>>> bpd.options.display.progress_bar = None

>>> gcs_path = "gs://bigframes-dev-testing/test_pickle.pkl"
>>> df = bpd.read_pickle(filepath_or_buffer=gcs_path)
>>> df.head(2)
                                     species island  culmen_length_mm  \
0        Adelie Penguin (Pygoscelis adeliae)  Dream              36.6
1        Adelie Penguin (Pygoscelis adeliae)  Dream              39.8
<BLANKLINE>
   culmen_depth_mm  flipper_length_mm  body_mass_g     sex
0             18.4              184.0       3475.0  FEMALE
1             19.1              184.0       4650.0    MALE
<BLANKLINE>
[2 rows x 7 columns]

Parameters
Name	Description
`filepath_or_buffer`	`str, path object, or file-like object` String, path object (implementing os.PathLike[str]), or file-like object implementing a binary readlines() function. Also accepts URL. URL is not limited to S3 and GCS.
`compression`	`str or dict, default 'infer'` For on-the-fly decompression of on-disk data. If 'infer' and 'filepath_or_buffer' is path-like, then detect compression from the following extensions: '.gz', '.bz2', '.zip', '.xz', '.zst', '.tar', '.tar.gz', '.tar.xz' or '.tar.bz2' (otherwise no compression). If using 'zip' or 'tar', the ZIP file must contain only one data file to be read in. Set to None for no decompression. Can also be a dict with key 'method' set to one of {'zip', 'gzip', 'bz2', 'zstd', 'tar'} and other key-value pairs are forwarded to zipfile.ZipFile, gzip.GzipFile, bz2.BZ2File, zstandard.ZstdDecompressor or tarfile.TarFile, respectively. As an example, the following could be passed for Zstandard decompression using a custom compression dictionary compression={'method': 'zstd', 'dict_data': my_compression_dict}.
`storage_options`	`dict, default None` Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details, and for more examples on storage options refer here.

Returns
Type	Description
`bigframes.dataframe.DataFrame or bigframes.series.Series`	same type as object stored in file.

remote_function

remote_function(
    input_types: typing.List[type],
    output_type: type,
    dataset: typing.Optional[str] = None,
    bigquery_connection: typing.Optional[str] = None,
    reuse: bool = True,
    name: typing.Optional[str] = None,
    packages: typing.Optional[typing.Sequence[str]] = None,
)

Decorator to turn a user defined function into a BigQuery remote function. Check out the code samples at: https://cloud.google.com/bigquery/docs/remote-functions#bigquery-dataframes.

Have the below APIs enabled for your project:
- BigQuery Connection API
- Cloud Functions API
- Cloud Run API
- Cloud Build API
- Artifact Registry API
- Cloud Resource Manager API
This can be done from the cloud console (change PROJECT_ID to yours): https://console.cloud.google.com/apis/enableflow?apiid=bigqueryconnection.googleapis.com,cloudfunctions.googleapis.com,run.googleapis.com,cloudbuild.googleapis.com,artifactregistry.googleapis.com,cloudresourcemanager.googleapis.com&project=PROJECT_ID

Or from the gcloud CLI:

$ gcloud services enable bigqueryconnection.googleapis.com cloudfunctions.googleapis.com run.googleapis.com cloudbuild.googleapis.com artifactregistry.googleapis.com cloudresourcemanager.googleapis.com
Have following IAM roles enabled for you:
- BigQuery Data Editor (roles/bigquery.dataEditor)
- BigQuery Connection Admin (roles/bigquery.connectionAdmin)
- Cloud Functions Developer (roles/cloudfunctions.developer)
- Service Account User (roles/iam.serviceAccountUser) on the service account PROJECT_NUMBER-compute@developer.gserviceaccount.com
- Storage Object Viewer (roles/storage.objectViewer)
- Project IAM Admin (roles/resourcemanager.projectIamAdmin) (Only required if the bigquery connection being used is not pre-created and is created dynamically with user credentials.)
Either the user has setIamPolicy privilege on the project, or a BigQuery connection is pre-created with necessary IAM role set:
1. To create a connection, follow https://cloud.google.com/bigquery/docs/reference/standard-sql/remote-functions#create_a_connection
2. To set up IAM, follow https://cloud.google.com/bigquery/docs/reference/standard-sql/remote-functions#grant_permission_on_function
  
  Alternatively, the IAM could also be setup via the gcloud CLI:
  
  $ gcloud projects add-iam-policy-binding PROJECT_ID --member="serviceAccount:CONNECTION_SERVICE_ACCOUNT_ID" --role="roles/run.invoker".

Parameters
Name	Description
`input_types`	`list(type)` List of input data types in the user defined function.
`output_type`	`type` Data type of the output in the user defined function.
`dataset`	`str, Optional` Dataset in which to create a BigQuery remote function. It should be in `<project_id>.<dataset_name>` or `<dataset_name>` format. If this parameter is not provided then session dataset id is used.
`bigquery_connection`	`str, Optional` Name of the BigQuery connection. You should either have the connection already created in the `location` you have chosen, or you should have the Project IAM Admin role to enable the service to create the connection for you if you need it. If this parameter is not provided then the BigQuery connection from the session is used.
`reuse`	`bool, Optional` Reuse the remote function if already exists. `True` by default, which will result in reusing an existing remote function and corresponding cloud function (if any) that was previously created for the same udf. Setting it to `False` would force creating a unique remote function. If the required remote function does not exist then it would be created irrespective of this param.
`name`	`str, Optional` Explicit name of the persisted BigQuery remote function. Use it with caution, because two users working in the same project and dataset could overwrite each other's remote functions if they use the same persistent name.
`packages`	`str[], Optional` Explicit name of the external package dependencies. Each dependency is added to the `requirements.txt` as is, and can be of the form supported in https://pip.pypa.io/en/stable/reference/requirements-file-format/.

Returns
Type	Description
`callable`	A remote function object pointing to the cloud assets created in the background to support the remote execution. The cloud assets can be located through the following properties set in the object: `bigframes_cloud_function` - The google cloud function deployed for the user defined code. `bigframes_remote_function` - The bigquery remote function capable of calling into `bigframes_cloud_function`.

Class Session (0.11.0)

Parameters

Properties

bqclient

bqconnectionclient

bqstoragereadclient

cloudfunctionsclient

resourcemanagerclient

Methods

close

read_csv

read_gbq

read_gbq_function

read_gbq_model

read_gbq_query

read_gbq_table

read_json

read_pandas

read_parquet

read_pickle

remote_function