Package bigquery (1.9.0)

API documentation for bigquery package.

Packages Functions

array_agg

array_agg(
    obj: groupby.SeriesGroupBy | groupby.DataFrameGroupBy,
) -> series.Series | dataframe.DataFrame

Group data and create arrays from selected columns, omitting NULLs to avoid BigQuery errors (NULLs not allowed in arrays).

Examples:

>>> import bigframes.pandas as bpd
>>> import bigframes.bigquery as bbq
>>> import numpy as np
>>> bpd.options.display.progress_bar = None

For a SeriesGroupBy object:

>>> lst = ['a', 'a', 'b', 'b', 'a']
>>> s = bpd.Series([1, 2, 3, 4, np.nan], index=lst)
>>> bbq.array_agg(s.groupby(level=0))
a    [1. 2.]
b    [3. 4.]
dtype: list<item: double>[pyarrow]

For a DataFrameGroupBy object:

>>> l = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]
>>> df = bpd.DataFrame(l, columns=["a", "b", "c"])
>>> bbq.array_agg(df.groupby(by=["b"]))
         a      c
b
1.0    [2]    [3]
2.0  [1 1]  [3 2]
<BLANKLINE>
[2 rows x 2 columns]
Parameter
Name Description
obj

A GroupBy object to be applied the function.

array_length

array_length(series: series.Series) -> series.Series

Compute the length of each array element in the Series.

Examples:

>>> import bigframes.pandas as bpd
>>> import bigframes.bigquery as bbq
>>> bpd.options.display.progress_bar = None

>>> s = bpd.Series([[1, 2, 8, 3], [], [3, 4]])
>>> bbq.array_length(s)
0    4
1    0
2    2
dtype: Int64

You can also apply this function directly to Series.

>>> s.apply(bbq.array_length, by_row=False)
0    4
1    0
2    2
dtype: Int64
Parameter
Name Description
series

A Series with array columns.

array_to_string

array_to_string(series: series.Series, delimiter: str) -> series.Series

Converts array elements within a Series into delimited strings.

Examples:

>>> import bigframes.pandas as bpd
>>> import bigframes.bigquery as bbq
>>> import numpy as np
>>> bpd.options.display.progress_bar = None

>>> s = bpd.Series([["H", "i", "!"], ["Hello", "World"], np.nan, [], ["Hi"]])
>>> bbq.array_to_string(s, delimiter=", ")
    0         H, i, !
    1    Hello, World
    2
    3
    4              Hi
    dtype: string
Parameters
Name Description
series

A Series containing arrays.

delimiter

The string used to separate array elements.

vector_search(
    base_table: str,
    column_to_search: str,
    query: Union[dataframe.DataFrame, series.Series],
    *,
    query_column_to_search: Optional[str] = None,
    top_k: Optional[int] = 10,
    distance_type: Literal["euclidean", "cosine"] = "euclidean",
    fraction_lists_to_search: Optional[float] = None,
    use_brute_force: bool = False
) -> dataframe.DataFrame

Conduct vector search which searches embeddings to find semantically similar entities.

Examples:

>>> import bigframes.pandas as bpd
>>> import bigframes.bigquery as bbq
>>> bpd.options.display.progress_bar = None

DataFrame embeddings for which to find nearest neighbors. The ARRAY<FLOAT64> column is used as the search query:

>>> search_query = bpd.DataFrame({"query_id": ["dog", "cat"],
...                               "embedding": [[1.0, 2.0], [3.0, 5.2]]})
>>> bbq.vector_search(
...             base_table="bigframes-dev.bigframes_tests_sys.base_table",
...             column_to_search="my_embedding",
...             query=search_query,
...             top_k=2)
  query_id  embedding  id my_embedding  distance
1      cat  [3.  5.2]   5    [5.  5.4]  2.009975
0      dog    [1. 2.]   1      [1. 2.]       0.0
0      dog    [1. 2.]   4    [1.  3.2]       1.2
1      cat  [3.  5.2]   2      [2. 4.]   1.56205
<BLANKLINE>
[4 rows x 5 columns]

Series embeddings for which to find nearest neighbors:

>>> search_query = bpd.Series([[1.0, 2.0], [3.0, 5.2]],
...                            index=["dog", "cat"],
...                            name="embedding")
>>> bbq.vector_search(
...             base_table="bigframes-dev.bigframes_tests_sys.base_table",
...             column_to_search="my_embedding",
...             query=search_query,
...             top_k=2)
     embedding  id my_embedding  distance
dog    [1. 2.]   1      [1. 2.]       0.0
cat  [3.  5.2]   5    [5.  5.4]  2.009975
dog    [1. 2.]   4    [1.  3.2]       1.2
cat  [3.  5.2]   2      [2. 4.]   1.56205
<BLANKLINE>
[4 rows x 4 columns]

You can specify the name of the column in the query DataFrame embeddings and distance type. If you specify query_column_to_search_value, it will use the provided column which contains the embeddings for which to find nearest neighbors. Otherwiese, it uses the column_to_search value.

>>> search_query = bpd.DataFrame({"query_id": ["dog", "cat"],
...                               "embedding": [[1.0, 2.0], [3.0, 5.2]],
...                               "another_embedding": [[0.7, 2.2], [3.3, 5.2]]})
>>> bbq.vector_search(
...             base_table="bigframes-dev.bigframes_tests_sys.base_table",
...             column_to_search="my_embedding",
...             query=search_query,
...             distance_type="cosine",
...             query_column_to_search="another_embedding",
...             top_k=2)
  query_id  embedding another_embedding  id my_embedding  distance
1      cat  [3.  5.2]         [3.3 5.2]   2      [2. 4.]  0.005181
0      dog    [1. 2.]         [0.7 2.2]   4    [1.  3.2]  0.000013
1      cat  [3.  5.2]         [3.3 5.2]   1      [1. 2.]  0.005181
0      dog    [1. 2.]         [0.7 2.2]   3    [1.5 7. ]  0.004697
<BLANKLINE>
[4 rows x 6 columns]
Parameters
Name Description
base_table

The table to search for nearest neighbor embeddings.

column_to_search

The name of the base table column to search for nearest neighbor embeddings. The column must have a type of ARRAY. All elements in the array must be non-NULL.

query

A Series or DataFrame that provides the embeddings for which to find nearest neighbors.

query_column_to_search

Specifies the name of the column in the query that contains the embeddings for which to find nearest neighbors. The column must have a type of ARRAY. All elements in the array must be non-NULL and all values in the column must have the same array dimensions as the values in the column_to_search column. Can only be set when query is a DataFrame.

top_k

Sepecifies the number of nearest neighbors to return. Default to 10.

distance_type

Specifies the type of metric to use to compute the distance between two vectors. Possible values are "euclidean" and "cosine". Default to "euclidean".

fraction_lists_to_search

Specifies the percentage of lists to search. Specifying a higher percentage leads to higher recall and slower performance, and the converse is true when specifying a lower percentage. It is only used when a vector index is also used. You can only specify fraction_lists_to_search when use_brute_force is set to False.

use_brute_force

Determines whether to use brute force search by skipping the vector index if one is available. Default to False.