Perform similarity vector search in Bigtable by finding the K-nearest neighbors

Similarity vector search can help you identify similar concepts and contextual meaning in your Bigtable data, which means it can yield more relevant results when filtering for data stored within a specified key range. Example use cases might include the following:

  • Inbox search, where you want to perform semantic matching of messages for a particular user
  • Anomaly detection within a range of sensors
  • Retrieving the most relevant documents within a set of known keys for retrieval augmented generation (RAG)

This page describes how to perform similarity vector search in Bigtable by using the cosine distance and Euclidean distance vector functions in GoogleSQL for Bigtable to find K-nearest neighbors. Before you read this page, it's important that you understand the following concepts:

Bigtable supports the COSINE_DISTANCE() and EUCLIDEAN_DISTANCE() functions, which operate on vector embeddings, letting you find the KNN of the input embedding.

You can use the Vertex AI text embeddings APIs to generate and store your Bigtable data as vector embeddings. You can then provide these vector embeddings as an input parameter in your query to find the nearest vectors in N-dimensional space to search for semantically similar or related items.

Both distance functions take the arguments vector1 and vector2, which are of the type array<> and must consist of the same dimensions and have the same length. For more details about these functions, see the following:

The code on this page demonstrate how to create embeddings, store them in Bigtable, and then perform a KNN search.

The example on this page uses EUCLIDEAN_DISTANCE() and the Bigtable client library for Python. However, you can also use COSINE_DISTANCE() and any client library that supports GoogleSQL for Bigtable, such as the Bigtable client library for Java.

Before you begin

Complete the following before you try the code samples.

Required roles

To get the permissions that you need to read and write to Bigtable, ask your administrator to grant you the following IAM role.

  • Bigtable User (roles/bigtable.user) on the Bigtable instance that you want to send requests to

Set up your environment

  1. Download and install the Bigtable client library for Python. To use GoogleSQL for Bigtable functions, you must use python-bigtable version 2.26.0 or later. Instructions, including how to set up authentication, are at Python hello world.

  2. If you don't have a Bigtable instance, follow the steps at Create an instance.

  3. Identify your resource IDs. When you run the code, replace the following placeholders with the IDs of your Google Cloud project, Bigtable instance, and table:

    • PROJECT_ID
    • INSTANCE_ID
    • TABLE_ID

Create a table to store the text, embeddings, and search phrase

Create a table with two column families.

Python

from google.cloud import bigtable
from google.cloud.bigtable import column_family

client = bigtable.Client(project=PROJECT_ID, admin=True)
instance = client.instance(INSTANCE_ID)
table = instance.table(TABLE_ID)
column_families = {"docs":column_family.MaxVersionsGCRule(2), "search_phrase":column_family.MaxVersionsGCRule(2)}

if not table.exists():
  table.create(column_families=column_families)
else:
  print("Table already exists")

Embed texts with a pre-trained, foundational model from Vertex

Generate the text and embeddings to store in Bigtable along with the associated keys. For additional documentation, see Get text embeddings or Get multimodal embeddings.

Python

from typing import List, Optional
from vertexai.language_models import TextEmbeddingInput, TextEmbeddingModel
from vertexai.generative_models import GenerativeModel

#defines which LLM that we should use to generate the text
model = GenerativeModel("gemini-1.5-pro-001")

#First, use generative AI to create a list of 10 chunks for phrases
#This can be replaced with a static list of text items or your own data

chunks = []
for i in range(10):
  response = model.generate_content(
      "Generate a paragraph between 10 and 20 words that is about about either
      Bigtable or Generative AI"
)
chunks.append(response.text)
print(response.text)
#create embeddings for the chunks of text
def embed_text(
  texts: List[str] = chunks,
  task: str = "RETRIEVAL_DOCUMENT",
  model_name: str = "text-embedding-004",
  dimensionality: Optional[int] = 128,
) -> List[List[float]]:
  """Embeds texts with a pre-trained, foundational model."""
  model = TextEmbeddingModel.from_pretrained(model_name)
  inputs = [TextEmbeddingInput(text, task) for text in texts]
  kwargs = dict(output_dimensionality=dimensionality) if dimensionality else {}
  embeddings = model.get_embeddings(inputs, **kwargs)
  return [embedding.values for embedding in embeddings]

embeddings = embed_text()
print("embeddings created for text phrases")

Define functions that let you convert into byte objects

Bigtable is optimized for key-value pairs and generally stores data as byte objects. For more information about designing your data model for Bigtable, see Schema design best practices.

You need to convert the embeddings that come back from Vertex, which are stored as a list of floating point numbers in Python. You convert each element to big-endian IEEE 754 floating-point formation and then concatenate them together. The following function achieves this.

Python

import struct
def floats_to_bytes(float_list):
  """
  Convert a list of floats to a bytes object, where each float is represented
  by 4 big-endian bytes.

  Parameters:
  float_list (list of float): The list of floats to be converted.

  Returns:
  bytes: The resulting bytes object with concatenated 4-byte big-endian
  representations of the floats.
  """
  byte_array = bytearray()

  for value in float_list:
      packed_value = struct.pack('>f', value)
      byte_array.extend(packed_value)

  # Convert bytearray to bytes
  return bytes(byte_array)

Write the embeddings to Bigtable

Convert the embeddings to byte objects, create a mutation, and then write the data to Bigtable.

Python

from google.cloud.bigtable.data  import RowMutationEntry
from google.cloud.bigtable.data  import SetCell

mutations = []
embeddings = embed_text()
for i, embedding in enumerate(embeddings):
  print(embedding)

  #convert each embedding into a byte object
  vector = floats_to_bytes(embedding)

  #set the row key which will be used to pull the range of documents (ex. doc type or user id)
  row_key = f"doc_{i}"

  row = table.direct_row(row_key)

  #set the column for the embedding based on the byte object format of the embedding
  row.set_cell("docs","embedding",vector)
  #store the text associated with vector in the same key
  row.set_cell("docs","text",chunks[i])
  mutations.append(row)

#write the rows to Bigtable
table.mutate_rows(mutations)

The vectors are stored as binary-encoded data that can be read from Bigtable using a conversion function from the BYTES type to ARRAY<FLOAT32>.

Here is the SQL query:

SELECT _key, TO_VECTOR32(data['embedding']) AS embedding
FROM table WHERE _key LIKE 'store123%';

In Python, you can use the GoogleSQL COSINE_DISTANCE function to find the similarity between your text embeddings and the search phrases that you give it. Since this computation can take time to process, use the Python client library's asynchronous data client to execute the SQL query.

Python

from google.cloud.bigtable.data import BigtableDataClientAsync

#first embed the search phrase
search_embedding = embed_text(texts=["Apache HBase"])

query = """
      select _key, docs['text'] as description
      FROM knn_intro
      ORDER BY COSINE_DISTANCE(TO_VECTOR32(docs['embedding']), {search_embedding})
      LIMIT 1;
      """

async def execute_query():
  async with BigtableDataClientAsync(project=PROJECT_ID) as client:
    local_query = query
    async for row in await client.execute_query(query.format(search_embedding=search_embedding[0]), INSTANCE_ID):
      return(row["_key"],row["description"])

await execute_query()

The response that is returned is a generated text description that describes Bigtable.

What's next