View and interpret evaluation results

This page describes how to view and interpret your model evaluation results after running your model evaluation.

View evaluation results

After you define your evaluation task, run the task to get evaluation results, as follows:

eval_result: EvalResult = eval_task.evaluate(
  model=MODEL,
)

The EvalResult class represents the result of an evaluation run with the following attributes:

  • metrics_table: the per-instance results.
  • summary_metrics: the aggregate results for all instances for all requested metrics.
  • metadata: the experiment name and experiment run name for the evaluation run.

The EvalResult class is defined as follows:

@dataclasses.dataclass
class EvalResult:
    """Evaluation result.

    Attributes:
      summary_metrics: The summary evaluation metrics for an evaluation run.
      metrics_table: A table containing eval inputs, ground truth, and metric
        results per row.
      metadata: The metadata for the evaluation run.
    """

    summary_metrics: Dict[str, float]
    metrics_table: Optional["pd.DataFrame"] = None
    metadata: Optional[Dict[str, str]] = None

With the use of helper functions, the evaluation results can be displayed in the Colab notebook as follows:

Tables for summary metrics and row-based metrics

Visualize evaluation results

You can plot summary metrics in a radar or bar chart for visualization and comparison between results from different evaluation runs. This visualization can be helpful for evaluating different models and different prompt templates.

In the following example, we visualize four metrics (coherence, fluency, instruction following and overall text quality) for responses generated using four different prompt templates. From the radar and bar plot, we can infer that prompt template #2 consistently outperforms the other templates across all four metrics. This is particularly evident in its significantly higher scores for instruction following and text quality. Based on this analysis, prompt template #2 appears to be the most effective choice among the four options.

Radar chart showing the coherence, instruction_following, text_quality, and fluency scores for all prompt templates

Bar chart showing the mean for coherence, instruction_following, text_quality, and fluency for all prompt templates

Understand metric results

The following tables list various components of instance-level and aggregate results included in metrics_table and summary_metrics respectively for PointwiseMetric, PairwiseMetric and computation-based metrics:

PointwiseMetric

Instance-level results

Column Description
response The response generated for the prompt by the model.
score The rating given to the response as per the criteria and rating rubric. The score can be binary (0 and 1), Likert scale (1 to 5, or -2 to 2), or float (0.0 to 1.0).
explanation The judge model's reason for the score. We use chain-of-thought reasoning to guide the judge model to explain its rationale behind each verdict. Forcing the judge model to reason is shown to improve evaluation accuracy.

Aggregate results

Column Description
mean score Average score for all instances.
standard deviation Standard deviation for all the scores.

PairwiseMetric

Instance-level results

Column Description
response The response generated for the prompt by candidate model.
baseline_model_response The response generated for the prompt by baseline model.
pairwise_choice The model with the better response. Possible values are CANDIDATE, BASELINE or TIE.
explanation The judge model's reason for the choice.

Aggregate results

Column Description
candidate_model_win_rate Ratio of time the judge model decided the candidate model had the better response to total responses. Ranges between 0 to 1.
baseline_model_win_rate Ratio of time the judge model decided the baseline model had the better response to total responses. Ranges between 0 to 1.

Computation-based metrics

Instance-level results

Column Description
response The model's response being evaluated.
reference The reference response.
score The score is calculated for each pair of responses and references.

Aggregate results

Column Description
mean score Average score for all instances.
standard deviation Standard deviation for all the scores.

Examples

The examples in this section demonstrate how to read and understand the evaluation results.

Example 1

In the first example, consider evaluating a pointwise evaluation instance for TEXT_QUALITY. The score from the pointwise evaluation of TEXT_QUALITY metric is 4 (from a scale 1 to 5), which means the response is good. Furthermore, the explanation in the evaluation result shows why the judge model thinks the prediction deserves the score 4, and not a score that's higher or lower.

Dataset

  • prompt: "Summarize the following text in a way that a five-year-old can understand: Social Media Platform Faces Backlash Over Content Moderation Policies\nA prominent social media platform finds itself embroiled in controversy as users and content creators express discontent over its content moderation policies. Allegations of biased censorship, inconsistent enforcement, and suppression of certain viewpoints have sparked outrage among users who claim that the platform is stifling free speech. On the other hand, the platform asserts that its policies are designed to maintain a safe and inclusive online environment. This controversy raises broader questions about the role of social media in shaping public discourse and the responsibilities of platforms in curating online content."

  • response: "People are upset with a website because they think it's not being fair with what people can post. Some people say the website is stopping them from saying what they want. But the website says it's trying to keep everyone safe. This makes people wonder if websites should control what people can say online."

Result

  • score: 4,
  • explanation: The summary in response follows the instruction to summarize the context in a way that a five-year old can understand. It is grounded in the context and provides important details in its summarization. However, the language used in the response is a bit verbose.

Example 2

The second example is a pairwise comparison evaluation on PAIRWISE_QUESTION_ANSWERING_QUALITY. The pairwise_choice result shows the candidate response "France is a country located in Western Europe." is preferred by the judge model compared to the baseline response "France is a country." to answer the question in the prompt. Similar to pointwise results, an explanation is also provided to explain why the candidate response is better than the baseline response (candidate response is more helpful in this case).

Dataset

  • prompt: "Can you answer where France is based on the following paragraph? France is a country located in Western Europe. It's bordered by Belgium, Luxembourg, Germany, Switzerland, Italy, Monaco, Spain, and Andorra. France's coastline stretches along the English Channel, the North Sea, the Atlantic Ocean, and the Mediterranean Sea. Known for its rich history, iconic landmarks like the Eiffel Tower, and delicious cuisine, France is a major cultural and economic power in Europe and throughout the world."

  • response: "France is a country located in Western Europe.",

  • baseline_model_response: "France is a country.",

Result

  • pairwise_choice: CANDIDATE,
  • explanation: BASELINE response is grounded but does not fully answer the question. CANDIDATE response, however, is correct and provides helpful details on the location of France.

What's next