Evaluate an agent

After developing an agent, you can use Gen AI evaluation service to evaluate the agent's ability to complete tasks and goals for a given use case.

Define evaluation metrics

Begin with an empty list of metrics (i.e. metrics = []) and add the relevant metrics to it. To include additional metrics:

Final response

Final response evaluation follows the same process as model-based evaluation. For details, see Define your evaluation metrics.

Exact match

metrics.append("trajectory_exact_match")

If the predicted trajectory is identical to the reference trajectory, with the exact same tool calls in the exact same order, the trajectory_exact_match metric returns a score of 1, otherwise 0.

Input parameters:

predicted_trajectory: The list of tool calls used by the agent to reach the final response.
reference_trajectory: The expected tool use for the agent to satisfy the query.

In-order match

metrics.append("trajectory_in_order_match")

If the predicted trajectory contains all the tool calls from the reference trajectory in the same order, and may also have extra tool calls, the trajectory_in_order_match metric returns a score of 1, otherwise 0.

Input parameters:

predicted_trajectory: The predicted trajectory used by the agent to reach the final response.
reference_trajectory: The expected predicted trajectory for the agent to satisfy the query.

Any-order match

metrics.append("trajectory_any_order_match")

If the predicted trajectory contains all the tool calls from the reference trajectory, but the order doesn't matter and may contain extra tool calls, then the trajectory_any_order_match metric returns a score of 1, otherwise 0.

Input parameters:

predicted_trajectory: The list of tool calls used by the agent to reach the final response.
reference_trajectory: The expected tool use for the agent to satisfy the query.

Precision

metrics.append("trajectory_precision")

The trajectory_precision metric measures how many of the tool calls in the predicted trajectory are actually relevant or correct according to the reference trajectory. It is a float value in the range of [0, 1]: the higher the score, more precise the predicted trajectory.

Precision is calculated as follows: Count how many actions in the predicted trajectory also appear in the reference trajectory. Divide that count by the total number of actions in the predicted trajectory.

Input parameters:

predicted_trajectory: The list of tool calls used by the agent to reach the final response.
reference_trajectory: The expected tool use for the agent to satisfy the query.

Recall

metrics.append("trajectory_recall")

The trajectory_recall metric measures how many of the essential tool calls from the reference trajectory are actually captured in the predicted trajectory. It is a float value in the range of [0, 1]: the higher the score, the better the recall of the predicted trajectory.

Recall is calculated as follows: Count how many actions in the reference trajectory also appear in the predicted trajectory. Divide that count by the total number of actions in the reference trajectory.

Input parameters:

predicted_trajectory: The list of tool calls used by the agent to reach the final response.
reference_trajectory: The expected tool use for the agent to satisfy the query.

Single tool use

from vertexai.preview.evaluation import metrics

metrics.append(metrics.TrajectorySingleToolUse(tool_name='tool_name'))

The trajectory_single_tool_use metric checks if a specific tool that is specified in the metric spec is used in the predicted trajectory. It doesn't check the order of tool calls or how many times the tool is used, just whether it's present or not. It is a value of 0 if the tool is absent, 1 otherwise.

Input parameters:

predicted_trajectory: The list of tool calls used by the agent to reach the final response.

Custom

You can define a custom metric as follows:

from vertexai.preview.evaluation import metrics

def word_count(instance):
  response = instance["response"]
  score = len(response.split(" "))
  return {"word_count": score}

metrics.append(
  metrics.CustomMetric(name="word_count", metric_function=word_count)
)

The following two performance metrics are always included in the results. You don't need to specify them in EvalTask:

latency (float): Time taken (in seconds) by the agent to respond.
failure (bool): 0 if the agent invocation succeeded, 1 otherwise.

Prepare evaluation dataset

To prepare your dataset for final response or trajectory evaluation: