Quality AI best practices

This document outlines Google's recommendations for how best to use Quality AI. Following the guidelines in this document will ensure that Quality AI provides the most accurate and useful information possible for your business needs.

Scorecards

Scorecards provide access to agent performance metrics and detailed instructions for answering questions about a conversation. You must enter your conversation data, questions, and possible answer options, along with instructions for how to interpret those answers. For best results, use the Scorecards page in the Quality AI console to upload your example conversations.

Conversation data

Conversation data are transcripts of either voice or chat conversations with personally identifiable information redacted. Upload at least 2,000 conversations for each business unit or call center.

You can also upload audio recordings of voice conversations. For best results, record audio using the following specifications:

Two channels
16,000 Hz sampling rate (or 8,000-48,000 Hz)
Lossless encoding: FLAC or LINEAR16
Lossless encoding for WAV audio files: LINEAR16 or MULAW

The metadata for audio recordings of a voice call should include the following information:

Channel labels to identify the agent and customer
Agent ID, name, location, team, and CSAT
Audio language as a BCP-47 language tag, such as en-US

Questions

Within each scorecard, the questions and instructions for answering them provide valuable information for Quality AI to evaluate conversations and agent performance. To maximize the accuracy of automatic evaluations, write questions and instructions with the following concepts in mind:

Clarity: Write questions that are clear and a human can understand.
Specificity: Add answer options and instructions that are as specific as possible.
Details: Include instructions that provide enough details for a human to confidently and reliably evaluate the conversations.
Examples: Quality AI is even more accurate if you provide examples from real conversations that illustrate each answer to your questions.

Questions can take a variety of forms. The following are some useful question templates:

"Did the agent…?" with a specific action. This format indicates that the evaluator must look for something the agent said.
"Did the customer…?" with a specific action. This format indicates that the evaluator must look for something the customer said.
Beginning with question words such as what or why encourages evaluation of the whole conversation.

Questions with multiple answers

Users often write questions with only yes and no answers. However, a question might not apply to the conversation, which warrants N/A.

Alternatively, the question could be interpreted as yes or no in a variety of circumstances, which leads to inconsistent responses with only two options. Including questions that require other types of answers gives the AI model a greater depth of understanding of the conversation.

Acoustic analysis

Quality AI evaluates conversation transcripts and cannot perform acoustic analysis. Exclude questions that require acoustic analysis. For example, neither a person nor Quality AI can answer the question "Did the agent use a greeting with an upbeat tone of voice?" solely by reading a transcript of the conversation.

Instructions

Instructions define how each answer is interpreted; so instructions must be specific and leave no room for interpretation. The definition ensures that each evaluation of a conversation provides the same answer.

Format

Include a brief description of the question's purpose followed by a description of the criteria for each possible answer choice. This means you must define the precise circumstance in which someone would give each answer choice.

For example, the following instructions apply to a yes/no question that asks, "Did the agent address the customer's primary concern before cross-selling?"

Instructions:

The purpose of this question is to understand if the agent addressed the customer's primary concern before trying to sell an additional product. This creates a more positive experience for our brand.

Score "Yes" if the agent resolved the primary issue and then attempted sales. Example: "I just updated your account information. I see that you marked your smart home device as broken. Would you like to order a replacement?"
Score "No" if the agent tried to sell a product before resolving the primary issue. Example: "Before I update your account information, I see that you bought a laptop from us five years ago. Do you want to try our new model?"
Score "N/A" if there was no sales attempt.

Answer types

The answer type depends on the structure of the question. This section provides suggestions to help you get started, but not an exhaustive list of uses.

Yes/No

Yes/No is the most common answer type because you can quickly evaluate these questions, and the answers are often more intuitive than other answer types. Questions that benefit from a yes/no answer type often begin with "Did..." and ask if a specific action took place. These questions can also be written as true or false questions.

In example conversations, yes/no answers are recorded as a true or false value with the following formats:

A Yes answer is true.
A No answer isfalse.

Numbers

Numerical answers are useful for questions that ask for a count of something, a dollar amount, or ask you to rate something on a scale. Questions that benefit from this answer type often begin with "How many...", "How much...", or "On a scale of..." and ask you to determine a single answer.

In example conversations, numeric answers have the following format:

A 40.5 answer is 40.5.

Text

Text answers would require the most work from a human annotator. Questions that benefit from text answers often begin with question words, such as "What..." or "Why..." and often require evaluation of the conversation as a whole. Text answers encourage more variety in the responses, so the instructions must clearly explain how to interpret the question and when to assign each answer choice.

In example conversations, text answers have the following format:

An answer of Concluded is "CONCLUDED".

Assign scores

When you create a question, you can assign a numerical score to each answer choice. These scores represent the importance of each answer choice for the overall conversation score calculation.

A useful range for answer choice scores is 0-10. This range provides some variation for specificity and is comparable to a percentage. An answer choice with a score of 0 has no effect on the conversation score calculation. An answer choice with a score of 10 has the most impact on the conversation score. In other words, the answer choice with a 10 raises the conversation score more than any answer choice with a lower score. An answer choice with a score of 5 raises the conversation score by half as much as the answer with 10.

N/A

Click the checkbox to enable N/A as an answer choice when a question doesn't apply to a conversation. When Quality AI chooses N/A as the answer, the question is removed from the conversation score calculation.

Example scorecard inputs

The following examples illustrate how to add all the information needed for a useful scorecard. Each scorecard requires the following information:

Any questions about the conversation.
Instructions for interpreting the question and defining each answer choice.
Answer type (can be text, number, or yes/no).
Answer choices that define the possible answers based on answer type (can be yes and no, a list of numbers, or some text responses).
Score to set the points earned for each answer choice. The maximum score for a single question is determined by the highest score among all the answer choices.

You can include the following to help organize the questions on your scorecard, but they aren't required:

Tag to group the questions into categories (can be business, customer, or compliance).

Example 1

Question: What was the outcome of the conversation?
Tag: Customer
Instructions: The goal of any conversation is to reach a resolution or outcome that falls into one of four possible categories: concluded, transferred, redirected, or escalated.
- Concluded conversations are those that have been successfully resolved and don't require any further action. The customer's issue has been addressed, and the conversation has been concluded.
- Transferred conversations are those that need to be handled by a different department or agent. The customer might have been transferred to a specialist who can better assist them with their issue.
- Redirected conversations are those that need to be handled by a different channel. For example, a customer might have been redirected from a phone call to an online chat session.
- Escalated conversations are those that require the involvement of a manager or supervisor. The customer might have been escalated due to the severity of their issue or because they are not satisfied with the resolution offered by the initial agent.
Answer type: Text

Answer choice Score

Concluded 1

Transferred 1

Redirected 1

Escalated 0

Add N/A as an answer choice. If selected, the question won't be included in the total score calculation.

Answer choice	Score
Concluded	1
Transferred	1
Redirected	1
Escalated	0

Example 2

Question: On a scale of 0-5, how effective was the communication between the agent and the customer?
Tag: Business, Compliance, Customer
Instructions: Scale and Criteria
- 0, Extremely Poor: No communication or complete misunderstanding. Offensive, abusive, or harmful language. Total lack of respect or empathy.
- 1, Very Poor: Significant communication difficulties. Frequent interruptions or talking over each other. Minimal effort to understand or connect. Dismissive or disrespectful behavior.
- 2, Poor: Some communication challenges. Occasional misunderstandings or lack of clarity. Limited engagement or interest. Occasional disrespect or insensitivity.
- 3, Average: Basic communication achieved. Some effort to understand and be understood.Moderate level of engagement and connection. Generally respectful, but with room for improvement.
- 4, Good: Clear and effective communication. Active listening and understanding. Meaningful engagement and connection. Mutual respect and empathy demonstrated.
- 5, Excellent: Exceptional communication and understanding. Deep engagement and connection. Strong sense of collaboration and mutual support. High levels of respect, empathy, and compassion.
Factors to consider when evaluating:
- Clarity: Was the communication clear and easy to understand?
- Understanding: Did participants demonstrate active listening and understanding of each other's perspectives?
- Engagement: Were participants actively engaged in the conversation and interested in what others had to say?
- Respect: Was there mutual respect and consideration shown throughout the conversation?
- Empathy: Did participants demonstrate empathy and understanding of each other's feelings?
- Collaboration: Was there a sense of collaboration and teamwork, or did participants feel like they were competing against each other?
- Outcome: Did the conversation achieve its intended goals or lead to a positive outcome?
Remember: Context matters. Consider the context and purpose of the conversation. What might be appropriate in one setting might not be in another.

Subjectivity: Evaluation can be subjective. Different people might have slightly different interpretations of the same conversation.

Focus on improvement: Use evaluations as a tool for learning and improvement rather than just a way to judge or criticize.

This framework provides a basic guide for evaluating conversations, but you can adapt and adjust the criteria based on your specific needs and goals.
Answer type: Number
Answer choices and scores:

Answer choice Score

0 0

1 1

2 2

3 3

4 4

5 5

Answer choice	Score
0	0
1	1
2	2
3	3
4	4
5	5

Add N/A as an answer choice. If selected, the question won't be included in the total score calculation.

Example 3

Question: Did the representative (agent) greet the customer with a proper opening?
Tag: Customer
Instructions: The representative (agent) should always start the conversations with a proper opening, greeting. This serves as a crucial step in establishing a positive and professional rapport with the customer. The opening should be warm, friendly, and welcoming, setting a tone that makes the customer feel valued and respected. The representative (agent) should also ensure that the greeting is appropriate for the context and the customer's cultural background. By starting the conversation with a proper opening and greeting, the representative can create a positive first impression, build rapport, and lay the foundation for a successful interaction with the customer.
Answer type: Yes/No
Answer choices and scores:

Answer choice Score

"Yes" 1

"No" 0

Answer choice	Score
"Yes"	1
"No"	0

Add N/A as an answer choice. If selected, the question won't be included in the total score calculation.

Add example conversations

Example conversations are useful for clarifying question interpretation. Calibrating and customizing the AI model requires example conversations with answers assigned for each question. The AI model learns from real conversation data, so take examples from your existing conversations in Conversational Insights. If you don't provide any example conversations, Quality AI uses a foundational model which doesn't know the expected answers for your questions.

To improve the AI model's performance, include at least the following:

100 example conversations per question
40 example conversations per answer choice

If you provide fewer than 100 example conversations for a single question, the AI model won't learn how to accurately score that specific question. Your example conversations are stored and the model learns after you have enough. A single conversation can teach the model how to score multiple questions, and you can further improve scoring accuracy for any question by adding more example conversations.

For each question in your scorecard, include a percentage of conversations to illustrate each answer choice. The following example shows how many conversations you might include to illustrate two possible answer choices. This specific split isn't required.

If a question on a scorecard is "Did the agent exhibit empathy towards the customer?" and the response to that question can be yes or no, include both of the following:

Question	Possible answers	Share of conversations
Did the agent exhibit empathy towards the customer?	"Yes"	75%
	"No"	25%

Example conversation format

Example conversations minimally must include identifiers for each conversation, scorecard, and question as well as the expected answer. Your example conversations can also include the answer choices, scores, and instructions. Example conversations are uploaded as the FeedbackLabel resource. For information about editing example conversations using the API, see Setup Guide.

CSV

You must upload example conversations in a CSV file. The first line of your CSV file must be the header, and the file must contain the following categories:

ConversationId
QaScorecardId
QaQuestionId
QaAnswerLabel or individualized fields such as QaAnswerScore and QaAnswerValue

Quality AI can automatically create an example conversation template with the preceding IDs filled in. You can choose which scorecard to use for your example conversations and filter the template to include only some of your conversations. For instructions on creating a template and uploading example conversations, see the Quality AI setup guide.

CSV example conversation files can have a variety of formats. For example, yes/no answers correspond to a true or false value, numbers remain the same, and text answers are surrounded by quotation marks. This means that true is displayed as a Yes/No answer type and the selected the answer choice is Yes. On the other hand, "Yes" is displayed as a text answer type with a selected answer choice of Yes. The following examples illustrate some possible CSV formats.

The individual header QaAnswerValue does not assign a score.

ConversationId,QaScorecardId,QaQuestionId,QaAnswerValue
convo_id,scorecard_test_id,question_id_q3,"NO"
convo_id,scorecard_test_id,question_id_q6,"YES"
convo_id,scorecard_test_id,question_id_q6,true
convo_id,scorecard_test_id,question_id_q6,false
convo_id,scorecard_test_id,question_id_q6,40.5

Includes both QaAnswerValue and QaAnswerScore headers.

ConversationId,QaScorecardId,QaQuestionId,QaAnswerValue,QaAnswerScore
convo_id,scorecard_test_id,question_id_q3,"NO",score: 1.0
convo_id,scorecard_test_id,question_id_q6,"YES",score: 1.0

The QaAnswerLabel header encompasses both a score and an answer but does not separate them with a comma.

ConversationId,QaScorecardId,QaQuestionId,QaAnswerLabel
convo_id,scorecard_test_id,question_id_q3,score: 1.0 "NO"
convo_id,scorecard_test_id,question_id_q6,score: 0.5 40.5
convo_id,scorecard_test_id,question_id_q6,na_value:true
convo_id,scorecard_test_id,question_id_q3,true

Table

Within a spreadsheet, the visual format for your example conversations is a table with each row containing information to identify a single answer and each column containing separate identifications, as shown in the following table:

Conversation ID	Scorecard ID	Question ID	Answer
44748735396	5727080762913918243	4097398336657302301	`"YES"`
44748735396	5727080762913918243	3576133206121890384	`"NO"`
3495523396	5727080762913918243	4097398336657302301	`"YES"`
3495523396	5727080762913918243	3576133206121890384	`"NO"`

Evaluating a conversation

Human annotators use scorecard questions and instructions to manually evaluate conversations and determine the correct answers to each question in example conversations. When multiple people evaluate the same conversation, they sometimes provide different answers to each question. This inconsistency between evaluations introduces noise and confusion to the machine learning process. Within a conversation, if the same or a similar question is associated with multiple different answers, Quality AI has no way to learn the mapping between questions and answers.

Any of the following can cause inconsistency when multiple people answer the same questions for a single conversation:

Subjective questions that lead to different interpretations between annotators.
Rubrics with insufficient details or unclear guidelines.
Different versions of a question, answer options, or instructions, for example:
- You can begin with only yes/no answer options, then later change to a more fine-grained approach with no-a, no-b, and no-c options.
- However, combining the yes/no approach with no-a, no-b, and no-c options will confuse the model.
An evaluation task that requires a large cognitive load.

Measure consistency

To measure consistency in your example conversations, ask multiple annotators to independently evaluate the same conversation. Then, compute agreements between them using the Cohen's kappa coefficient. You want to see a Cohen's kappa coefficient of no less than 0.2. If consistency is low, try one of the following options:

Refine the question and instructions to provide less room for interpretation.
Communicate between annotators so they can resolve discrepancies and agree on a single grading standard.
Continuously monitor consistencies among annotators.
Provide additional training to annotators whose answers frequently differ from the grading standard.