Ten Criteria for Measuring Effective Voice User Interfaces

Page 1 of 6

Ten Criteria for Measuring Effective Voice User Interfaces November/December 2005 By James A. Larson, et. al. A Toolkit of Metrics for Evaluating VUIs Investors use standard metrics such as stock price and projected revenue per share to choose investment opportunities. Likewise, consumers use standard metrics such as floor space, number of bedrooms, or number of bathrooms when purchasing houses. This paper presents a toolkit containing some specific metrics for evaluating voice user interfaces (VUIs). The speech industry should use criteria from this toolkit to:  



Judge the most efficient of several VUIs for the same application from competing vendors. Determine whether a change to a VUI is worthwhile by comparing metrics from before the change to metrics after the change. Avoid misunderstandings about the meaning of frequently used criteria such as “ease of use” and “completion rate” by carefully defining the criteria.

The toolkit contains 10 metrics for evaluating VUIs which are categorized into two classes: subjective and objective. Subjective Metrics Caller opinions matter! If a VUI presents a poor experience, callers will not use it. If callers have a good experience, they will be more likely to use the VUI again and again, and be more “forgiving” if they experience problems in the future.1 It is very important that user interface experts use post-call surveys and questionnaires to collect callers’ subjective opinions about a VUI. While we often think of subjective metrics as being “fuzzy,” they become objective data when you ask a statistically significant number of properly chosen individuals. A Likert scale2 is often used in questionnaires and surveys. For each item on the questionnaire, respondents specify their level of agreement to a question such as “The voice was understandable.” Callers respond using a five-point Likert scale: 1. 2. 3. 4. 5.

Strongly disagree Disagree Neither agree nor disagree Agree Strongly agree

Contributing Authors

Jonathan Bloom, ScanSoft Juan E. Gilbert, Auburn University Tom Houwing, VoiceObjects Susan Hura, Intervoice Sunil Issar, Convergys Corporation Lizanne Kaiser, Genesys Telecommunications Laboratories James A. Larson, Intel (organizer and editor) David Leppik, Vocal Laboratories Stephen Mailey, Voice Partners Amir Mané, Voice Advantage Frances McTernan, Nortel Michael McTear, University of Ulster Steve Pollock, TuVox Phil Shinn, Genesys Telecommunications Laboratories Lisa Stifelman, Tellme Networks Dale-Marie Wilson, Auburn University

This five-point Likert scale may be used to solicit caller input on subjective3 criteria such as the following: 1. 2. 3. 4.

Caller satisfaction Ease of use Quality of audio output Perceived first-call resolution rate

The mean score of a large number of subjects represents callers’ subjective evaluation of the VUI. 1. Caller Satisfaction

Three Over-arching Attributes

file://D:\Presentations\Measuring VUIs\Ten Criteria for Measuring Effective Voice User I...

9/25/2006

Ten Criteria for Measuring Effective Voice User Interfaces

Caller satisfaction measures the degree to which the VUI meets the caller’s expectations. This metric is widely used, but requires some interpretation. Satisfaction does not correlate perfectly with task completion. For example, satisfaction is relative to caller expectations: mediocre service may be expected from Yugo, but not Mercedes-Benz.

Page 2 of 6

To gauge the health of a voice user interface, VUI designers consider three over-arching attributes: 

Criteria

Definition

Calculation Callers rate the VUI using a Likert scale with the statement: “My expectations were satisfied during this call”

The degree to Caller which the VUI satisfaction meets callers’ expectations

2. Ease of Use Ease of use measures callers’ perceptions of using the application with little or no training. Ease of use depends on many factors including navigation, intelligibility, effectiveness, and error recovery. Related criteria include intuitiveness of choices, ability of the IVR to satisfy callers’ needs, and providing help when needed. Criteria Definition Ease of use

Callers’ perceptions of using the application

Calculation Callers rate the application using a Likert scale with the statement: “The application is easy to use”

3. Quality of Output System audio can be divided into speech and non-speech elements. Speech covers the quality of the spoken system output, including synthesized speech and pre-recorded audio. Criteria

Definition

Calculation

Callers’ Voice subjective intelligibility ratings of voice intelligibility

Callers rate the voice using a Likert scale with the statement: “The voice was understandable”

Callers’ subjective ratings of voice quality

Callers rate the voice using a Likert scale with the statement: “The voice sounded good”

Voice quality





Effectiveness — A measure of whether callers can complete their tasks. Efficiency — Measure of the amount of time and effort required to complete tasks. Caller Satisfaction — How did callers perceive the quality of their interaction with the VUI?

These attributes are intertwined and a change in any one of these variables can potentially affect the other two. Adding copious instructions at every prompt may increase the effectiveness of a VUI, but at the expense of efficiency and satisfaction. If the option to transfer to a live agent is offered at every dialog state, callers will be more satisfied, but the VUI becomes less effective and potentially less efficient. Organizations deploying speech applications need to recognize these tradeoffs and prioritize the goals of the application at the very beginning of a new speech project. Both efficiency and effectiveness can be measured objectively,8 but caller satisfaction is purely subjective and based upon the opinions of callers.

Non-speech audio includes earcons—sounds that convey a message (e.g., a ticking clock indicating that the computer is busy or a doorbell when a message has arrived) and audio logos—music or sound effects used for branding (e.g., the “bong” heard when first connected to AT&T or the four tones for Intel Inside heard on commercials about Intel chips.) Criteria

Definition

Callers’ recognition of the Earcon message recognition (semantics) associated with the earcon

Calculation Callers rate the earcon using a Likert scale with a statement such as: “I understood what the non-verbal sounds (sound effects) signified”

file://D:\Presentations\Measuring VUIs\Ten Criteria for Measuring Effective Voice User I...

9/25/2006

Ten Criteria for Measuring Effective Voice User Interfaces

Callers’ recognition of the Audio logo brand associated with the audio logo

Page 3 of 6

Callers are able to recognize the brand associated with the audio logo

4. Perceived First-Call Resolution Rate This subjective criterion measures whether callers accomplish their goals on the first call, including both the interaction with the VUI and possible interaction with a human agent. This criterion is also known as first customer service resolution, first time-final, and once-and-done. According to Nederlof and Anton, this metric is the most predictive criterion for positive customer satisfaction.4 Criteria

Definition

Calculation

Perceived first-call resolution rate

Perceived successful completion rate on the first call, including both VUI and possible interaction with a human agent

Callers rate the voice using a yes/no answer to the question: “Did you accomplish your goal?”

Objective Metrics Objective metrics are measurements of time or activity that do not involve subjective judgments by the callers. To capture objective data, VUI It is critical for human agents to be developers must: (1) insert logging instructions at strategic points in the able to access information previously dialog code, which (2) record the times of specific activities to a log file, provided to the automated system. and (3) summarize the recorded times using a scoring program, which Both callers and live agents waste aggregates and calculates scores for a variety of objective metrics. time when the caller must provide the Objective criteria include: same information twice. This impacts agent costs, frustrates callers, and 5. Time-to-task may decrease adoption of the 6. Task rate automated system. 7. Task completion time 8. Correct transfers Conversely, the VUI should be able to 9. Abandonment rate provide callers with information about 10. Containment rate the live agents, including the estimated waiting time until a live 5. Time-to-Task agent is available. This suggests When customers call an airline, bank, or other business, they generally another metric: the caller should be able to access information about the have a task in mind or a problem to solve. Time-to-task measures the amount of time it takes for a caller to begin the task he called about. availability of a live agent. Lengthy instructions, references to a Web site, untargeted marketing messages, or other irrelevant information at the top of a call delay callers from their tasks. Twice Is Not Better Than Once

Criteria Definition

Calculation

Time it takes from answering the call Time–to- to the time the task caller starts performing the desired task

The time elapsed from the beginning of the call until the first prompt or relevant information is presented to the caller

6. Task Rate Callers typically rate their automated experience quite high if they are able to accomplish their task. An automated application can be divided into different conceptual tasks (e.g., authentication, account balance,

Other Important VUI Criteria The workshop attendees felt that the

file://D:\Presentations\Measuring VUIs\Ten Criteria for Measuring Effective Voice User I...

9/25/2006

Ten Criteria for Measuring Effective Voice User Interfaces

forms request, payment locations) and each task can be flagged with start- and end- points. Task rate comprises two related measurements: 1. Task Initiation Rate (TIR)—Appropriate for evaluating all type of tasks. With informational tasks, TIR is a more suitable measurement because there is a clear starting point, but not necessarily a well-defined end-point. For instance, if callers request a summary of insurance benefits and hang up or opt out before hearing the full summary, it is uncertain if they received the needed information they were looking for. 2. Task Completion Rate (TCR, also known as transaction completion rate) — Appropriate for transactional tasks, which have clearly defined end-points (e.g., changing an address, transferring funds).

Page 4 of 6

following criteria are important, but these criteria do not lend themselves easily to a metric that can measure their successful application: 





Criteria

Definition

Calculation

Task Initiation Rate (TIR)

Percentage of calls that trigger a specific task start-point

Number of times a specific task start-point is triggered divided by the number of calls

Task Completion Rate (TCR)

Percentage of calls that trigger a specific task end-point

Number of times a specific task end-point is triggered divided by the number of calls where this task was initiated



Cognitive Load — The mental effort required to use the VUI should not exceed the mental capabilities of callers. Branding — The VUI should build the brand value of a company in the mind of the caller. Perceived Affordance — The degree to which the system is effective in communicating to the caller how it may be used: that is, what is the functionality that the system has to offer and what must the caller say in order to have the system perform that functionality. Error Handling — The successful application of strategies for recovering from problems that occur during human-computer interactions.

7. Task Completion Time Task completion time (also known as transaction duration) measures the time a caller takes to complete a specific task. Generally, a shorter task completion time is desirable for the caller and for the service provider. Gupta and Gilbert have recommended two target task completion times5: 1. Maximum Task Completion Time—The maximum acceptable duration for a task for a specific application. The time taken for callers to complete the task can be compared against this metric. 2. Expected Task Completion Time—The time taken by expert callers to complete a task for a specific application. This can be used as the basis for comparison with the time taken by all callers of the voice user interface. Over time, this metric can be adjusted to reflect the task completion time of typical callers using the interface. The task completion time is highly correlated to TIR minus TCR. Criteria

Definition

Calculation

Task completion time

Time to complete a specific task

Time between the start of a specific task and the end of the same task

8. Correct Transfer Rate Callers may be redirected from an automated system to a live agent if either (1) the caller cannot proceed with the automated dialog or (2) the caller requests to be transferred. If a call is misrouted, the agent must redirect it, delaying the caller’s task and increasing costs. Criteria

Definition

Calculation

Correct transfer

Number of calls successfully transferred to the

Divide the number of correctly routed calls by the number of

file://D:\Presentations\Measuring VUIs\Ten Criteria for Measuring Effective Voice User I...

9/25/2006

Ten Criteria for Measuring Effective Voice User Interfaces

rate

correct party

Page 5 of 6

routed calls

9. Abandonment Rate Abandonment rate has traditionally been used in call centers to determine the percentage of callers who hang up while waiting in queue to speak with an agent. Similarly, this measurement can be applied to VUIs; namely, the percentage of callers who hang up before carrying out a task in an automated system.6 Abandonment rates will be higher in situations where there are frequently misdialed calls or where the introduction asks callers for information they currently may not have (e.g., account number), resulting in callers hanging up to find that information before calling back. To make this metric precise, it should be associated with a specific task rather than the entire telephone call. Criteria

Definition

Percentage of callers who hang up before Abandonment carrying out a rate specific task in an automated system

Calculation Number of callers who hang up before completing a specific task divided by the total number of callers beginning the task

10. Containment Rate Since a common objective of voice applications is to reduce call center costs, automation success is frequently assessed in terms of containment—the percentage of calls not transferred to human agents. However, concentrating on containment rates may result in an application design that blocks or hides the exit, so callers cannot access a human agent easily. This can have disastrous consequences because callers quickly learn alternative ways of escaping from the automated system and transferring to a human agent (e.g., pressing keys randomly, “playing possum” until the system transfers them). This also adversely impacts customer satisfaction with callers spending valuable minutes venting their displeasure to the human agent. Based on data from 60 studies conducted by Vocal Laboratories, Inc., difficulty reaching an agent accounted for 61 percent of the variance in caller satisfaction levels and 49 percent in first-call resolution rates.7 Companies that made it harder to reach a human saw much lower first-call completion rates and more repeat calls as well as frustrated callers. While a high containment rate is desirable, this goal should not prohibit callers with complex and difficult requests from connecting to a human agent. Containment is often gauged using the reverse measurement of opt-out rate (i.e., the percentage of calls that transfer to an agent). Some VUI specialists count callers, not calls, and treating multiple calls from a single person as a single event. This takes into consideration the situations when the caller gets lost and redials in order to reset or when the caller hangs up to find additional information and before redialing. Criteria

Definition

Percentage of Containment calls not rate transferred to human agents

Calculation Calls completed within the IVR divided by the total number of calls

This report focuses on 10 widely used VUI measurable metrics. The authors have agreed upon the definitions and calculations presented in this toolkit. Apply the criteria from this toolkit to measure your VUIs, both to see if changes actually improve your VUIs and to compare your VUI with similar VUIs. By using the same terminology and performing the same calculations, the number of misunderstandings and misrepresentations in the speech industry will decrease.

References: 1 Norman, Donald A. 2004. Emotional Design: Why We Love (Or Hate) Everyday Things. New York, NY: Basic Books. 2 The Likert scale was named for Rensis Likert, who invented the scale in 1932.

file://D:\Presentations\Measuring VUIs\Ten Criteria for Measuring Effective Voice User I...

9/25/2006

Ten Criteria for Measuring Effective Voice User Interfaces

Page 6 of 6

3 A Likert scale may also be used to solicit non-subjective criteria from callers, such as “Did you complete the task?” 4 Nederlof, Ad & Jon Anton. 2002. Customer Obsession: Your Roadmap to Profitable CRM. Santa Maria, CA: The Anton Press, pp. 186189. 5 Gupta, Priyanka and Juan Gilbert. “Usability Metrics for Spoken Language Systems.” International Journal of Speech Technology. 6 Abandonment rates can measure either hang ups before a task has been initiated or hang ups within a given task. 7 Leppik, Peter, 2005. “Does forcing callers to use self-service work?” http://www.vocalabs.com/resources/newsletter/newsletter22.html. 8 For call routers, effectiveness is measured according to whether the VUI routes to the appropriate destination and efficiency is the speed with which

the caller reaches the appropriate destination.

© Copyright 2005 AmComm Holdings LLC. All rights reserved. All product names contained herein are the trademarks of their respective holders.

file://D:\Presentations\Measuring VUIs\Ten Criteria for Measuring Effective Voice User I...

9/25/2006

Ten Criteria for Measuri - pdfMachine from Broadgun Software, http ...

It is very important that user interface experts use post-call surveys ... branding (e.g., the “bong” heard when first connected to AT&T or the four tones for Intel .... Since a common objective of voice applications is to reduce call center costs, ...

85KB Sizes 0 Downloads 43 Views

Recommend Documents

Ten Criteria for Measuri - pdfMachine from Broadgun Software, http ...
Callers rate the voice using a Likert scale with the statement: “The voice sounded good”. Criteria. Definition. Calculation. Earcon recognition. Callers' recognition ... automated system. Conversely, the VUI should be able to provide callers with

Ten Criteria for Measuri - pdfMachine from Broadgun Software, http ...
Sep 25, 2006 - branding (e.g., the “bong” heard when first connected to AT&T or the four tones for Intel Inside heard on .... than the entire telephone call. 10.

10.PM5 - pdfMachine from Broadgun Software, http://pdfmachine ...
... $#%#&!'#(#&!"##)*##!+#*#,#!$#-*##)*##!'#(#&!.#/##/#0!.##1!'#(-!+#'#2(3#&4. 5 !$#6-#%##7!8#79:8#79!16#;%#.##!+##?@#)##7!=##A3##74. B !C##%#!

DJ21- - pdfMachine from Broadgun Software, http://pdfmachine.com ...
2%D "& ( 0"J %"& '(") N "P 29 Q "E 0"& ( 0"+5" 2& ("$7" $8 &49 ( :' $;

Untitled-1 - pdfMachine from Broadgun Software, http://pdfmachine ...
&$]+0$OL$0$ D$$H@ *$% 6$D$$8 1$5$8 #$$%: 6$% ($.4$< ^$:$ 4$4$8 L$$8 1$5$ #$$)D$*$O4*$ $)1$50$$J D$_$+$$ $)*$/$`: L$$8 D$$8 4$4$8 L$$8. 1$%5 6$$P$ 4$% &$$6$ #$7&$4$.$$4 &$< 6$%/$$ 3$: :7% Z*7< D$$8 .$7-+$!.$7-+$ .$,$$H@0$$O 4$8J *$$8=$E .$7*$ +$P$$ X$

Untitled-1 - pdfMachine from Broadgun Software, http://pdfmachine ...
0 / &ЙQ$7 8= E' $D E$A !$ R5$ &7 8= 4 5$D E$A !Q$4 !X E$7 0 = $= MGQ$ &7

DIN2-11.PM5 - pdfMachine from Broadgun Software, http://pdfmachine ...
... $%&!!'(!)!!*+,!-!+.!!/)!0+!'12!+,!-!+/!!3#!+4!!"!!+5!!+ !!6+4!!+!'2!)!!77. %&&9& 3&::&&; 1&'< 0&&5 %&&=> 40&&> ,&' 3&&4&5&5& 1&J' JKF ...

Untitled-1 - pdfMachine from Broadgun Software, http://pdfmachine ...
7!G! F!@! *!/Y!8 4!6M!! 4!5G$!8 4!5! .AZ %!!8 @!!@!! $!8 4!5.! .A 3!G! 7!! 4!5. ..... 4!5!8 !(k!M! !S!%!!J _87!8 f!!$!M!/4!6%! F!!AG M!!8S!M!/4!6%! F!!%*!!F!!+ 7!8 3!!8 B!!-

DJ21- - pdfMachine from Broadgun Software, http://pdfmachine.com, a ...
F D 7 ,< 0")F X$ "7 8" 2)N 9 "+ '"4 2& L9 J 8"& ( 0" 2& (/_ 8"B 5(D '-"9 '"%'< ") # & ( 0" 2& (/_ 8" 2* & (D ":& ("IH I J "# * "? @A. 5/"$ + SU 0"& '("* 0> '"! '" 97 J 8"< SU ...

pdfmachine free
Loading… Whoops! There was a problem loading more pages. Retrying... Whoops! There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. pdfmachine free. pdfmachine f

Ten Latest Trends from Hollywood for Automotive ... - Automotive Digest
producing content & marketing that will ultimately impact automotive marketing ... Two new programming sites producing own programming – Crackle & Epix.

Ten Latest Trends from Hollywood for Automotive ... - Automotive Digest
Live Streaming of programming content is hot. State of ... Two new programming sites producing own programming – Crackle & Epix. Crackle ... Epix has content from three big network studio and with ... going to be a kind of social platform.

http://vustudents.ning.com http://vustudents.ning.com -
If a company possess higher required rate of return, the justified P/E will be. Lower ... In top-down approach of fundamental analysis, investors begin their analysis with: ... Interest sensitive industry ... Shares in mutual savings banks only.

Formalization of control-flow criteria of software testing
Importance of the software testing is increasing as a result of the extension .... with previously defined criteria and using a definition from [22] as a base:.

Criteria for Determining Predatory Open-Access Publishers.pdf ...
their own papers). Page 3 of 6. Criteria for Determining Predatory Open-Access Publishers.pdf. Criteria for Determining Predatory Open-Access Publishers.pdf.

http://pdfbooksfree.blogspot.com http://pdfbooksfree ...
Page 1. http://pdfbooksfree.blogspot.com http://pdfbooksfree.blogspot.com. Page 2. http://pdfbooksfree.blogspot.com http://pdfbooksfree.blogspot.com. Page 3 ...

http://www.nepalspiritualtrekking.com/nepal/travel-info.html http ...
http://www.nepalspiritualtrekking.com/nepal/special-tour-programs/private-tour-program.html.

[From]​ ​​http://www.howtoeatacupcake.net/2010/07/patriotic-poke ...
[From]​ ​​http://www.howtoeatacupcake.net/2010/07/patriotic-poke-cake.html. Patriotic​ ​Poke​ .... Reduce​​the​​speed​​to.

Generalized Expectation Criteria for Semi-Supervised ... - Audentia
Generalized Expectation Criteria for Semi-Supervised Learning of. Conditional Random Fields. Gideon S. Mann. Google Inc. 76 Ninth Avenue. New York, NY ...

TWO EFFICIENT STOPPING CRITERIA FOR ...
Email: [email protected]. ABSTRACT ... teria need not to store any extra data, and can reduce the .... sion of the last iteration, and the S3 criterion needs to store.

Criteria for Complimentary Investor Registration: Qualified investors ...
Senior management or investment staff from a pension or employee benefit plan currently invested in life sciences. 3. Senior management at a ... Senior management at a charitable or not-for-profit organization with a fund of $5 million or more ...

http://islamicbookshub.wordpress.com/
Page 1. http://islamicbookshub.wordpress.com/. Page 2. http://islamicbookshub.wordpress.com/. Page 3. http://islamicbookshub.wordpress.com/. Page 4 ...