EYES-FREE TEXT ENTRY ON MOBILE DEVICES
HUSSAIN TINWALA
A THESIS SUBMITTED TO THE FACULTY OF GRADUATE STUDIES IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF SCIENCE
GRADUATE PROGRAMME IN COMPUTER SCIENCE AND ENGINEERING YORK UNIVERSITY TORONTO, ONTARIO
NOVEMBER 2009
EYES-FREE TEXT ENTRY ON MOBILE DEVICES by Hussain Tinwala A thesis submitted to the Faculty of Graduate Studies of York University in partial fulfilment of the requirements for the degree of MASTER OF SCIENCE © 2009 Permission has been granted to: a) YORK UNIVERSITY LIBRARIES to lend or sell copies of this thesis in paper, microform or electronic formats, and b) LIBRARY AND ARCHIVES CANADA to reproduce, lend, distribute, or sell copies of this thesis anywhere in the world in microform, paper or electronic formats and to authorise or procure the reproduction, loan, distribution or sale of copies of this thesis anywhere in the world in microform, paper or electronic formats. The author reserves other publication rights, and neither the thesis nor extensive extracts from it may be printed or otherwise reproduced without the author’s written permission.
EYES-FREE TEXT ENTRY ON MOBILE DEVICES
by Hussain Tinwala
By virtue of submitting this document electronically, the author certifies that this is a true electronic equivalent of the copy of the thesis approved by York University for the award of the degree. No alteration of the content has occurred and if there are any minor variations in formatting, they are as a result of the conversion to Adobe Acrobat format (or similar software application).
Examination Committee Members: 1.
Doctor I. Scott MacKenzie
2.
Doctor Wolfgang Stuerzlinger
3.
Doctor Melanie Baljko
4.
Professor Sandra Gabriele
ABSTRACT This thesis explores eyes-free text entry solutions for mobile devices. For devices with physical buttons, a wheel-based technique is proposed, with a focus on visually impaired users. The technique, called LetterScroll, uses a mouse wheel to manoeuvre a cursor across a linear sequence of characters. Text input is accompanied by speech feedback. Similarly for touch sensitive devices, a stroke-based input technique is presented. Entry progresses by inking Graffiti strokes with a finger. The strokes are accompanied by speech and non-speech sounds. Three user studies assess the value of both techniques. In particular, various aspects of the interaction are measured such as speed and accuracy. The results suggest that eyes-free text entry is possible on both classes of hardware. Although throughput rates for LetterScroll are not as high as expected, they are within the range of those found for visually impaired persons. The stroke-based technique was able to reach entry rates equal to those found for walk-up Graffiti use (7.0 wpm). Results from this evaluation were used to develop an enhanced version of the technique that incorporated an error correction algorithm. A follow up study found entry rates averaging 10.0 wpm with an overall accuracy of 95.7%.
iv
ACKNOWLEDGEMENTS Many thanks to my supervisor, Dr. Scott MacKenzie, for providing his expertise and support, and mostly for his flexibility and understanding. Thanks also for funding me and the user studies, which made this research possible. I would also like to thank all my colleagues at the Interactive Systems Research Group for their valuable support and comments during the development of this work. Thanks to Dr. Wolfgang Stuerzlinger, Dr. Melanie Baljko, and Professor Sandra Gabriele for taking time out of their schedule to serve on my committee. Most importantly, I would like to thank my parents, Najam and Malika, for their patience, tolerance, love, and encouragement throughout my time as a graduate student.
v
TABLE OF CONTENTS
CHAPTER 1 Introduction .................................................................................................... 15 1.1 Thesis Objectives ................................................................................................. 17 1.2 Thesis Contributions ............................................................................................ 18 1.3 Thesis Outline ...................................................................................................... 19 CHAPTER 2 Previous and Related Work............................................................................ 21 2.1 Mobile Text Entry................................................................................................ 21 2.1.1 Mobile Keypads ............................................................................................ 21 2.1.2 Common Text Entry Methods ...................................................................... 25 2.2 Eyes-free Text Entry Methods ............................................................................. 27 2.2.1 Physical Button Devices ............................................................................... 27 2.2.2 Touch Sensitive Devices............................................................................... 30 CHAPTER 3 Eyes-free Text Entry using a Wheel .............................................................. 33 3.1 LetterScroll Interaction Strategies ....................................................................... 34 3.1.1 Navigation..................................................................................................... 34 3.1.2 Jumping......................................................................................................... 35 3.1.3 Character Selection and Word Advancement ............................................... 36 3.1.4 Feedback ....................................................................................................... 37 vi
3.1.5 Reviewing Text............................................................................................. 37 3.2 Modeling the Interaction Using KSPC ................................................................ 38 3.2.1 The Basic Approach...................................................................................... 39 3.2.2 Jumping with Vowels ................................................................................... 41 3.3 Method ................................................................................................................. 44 3.3.1 Participants.................................................................................................... 44 3.3.2 Apparatus ...................................................................................................... 44 3.3.3 Procedure ...................................................................................................... 45 3.3.4 Design ........................................................................................................... 47 3.4 Results and Discussion ........................................................................................ 47 3.4.1 Speed............................................................................................................. 47 3.4.2 Accuracy ....................................................................................................... 49 3.4.3 Keystrokes per Character.............................................................................. 50 3.4.4 Two-handed Interaction ................................................................................ 56 3.4.5 Stimulus and Input Review........................................................................... 57 3.4.6 Limitations of the Apparatus ........................................................................ 58 3.4.7 Participant Questionnaire.............................................................................. 59 3.4.8 Summary ....................................................................................................... 59 CHAPTER 4 Eyes-free Text Entry on Touchscreen Devices.............................................. 61 4.1 The Design Space ................................................................................................ 63 vii
4.2 Finding the Right Mix.......................................................................................... 65 4.3 Design Considerations ......................................................................................... 66 4.4 Graffiti Input Using a Finger ............................................................................... 67 4.5 Text Entry Interaction .......................................................................................... 68 4.6 Implementation .................................................................................................... 70 4.6.1 Hardware Infrastructure ................................................................................ 70 4.6.2 Host and Device Software Architecture ....................................................... 71 4.7 Methodology ........................................................................................................ 73 4.7.1 Participants.................................................................................................... 73 4.7.2 Apparatus ...................................................................................................... 74 4.7.3 Procedure and Design ................................................................................... 74 4.8 Results and Discussion ........................................................................................ 77 4.8.1 Speed............................................................................................................. 77 4.8.2 Accuracy and KSPC ..................................................................................... 79 4.8.3 Unrecognized Strokes and Corrections......................................................... 83 4.8.4 Stroke Size .................................................................................................... 85 4.8.5 Informal Feedback and Questionnaire .......................................................... 86 4.9 Apparatus Limitations.......................................................................................... 88 4.10 Summary ............................................................................................................ 89
viii
CHAPTER 5 Enhancing the Eyes-free Touchscreen Interaction......................................... 91 5.1 The Redesign Process .......................................................................................... 91 5.1.1 Issues Found ................................................................................................. 91 5.1.2 Improvements and Problems......................................................................... 92 5.2 Exploring the Enhancements ............................................................................... 93 5.2.1 Interaction Modifications.............................................................................. 93 5.2.2 Dictionary-based Error Correction ............................................................... 95 5.2.3 Tying it Together with an Auditory Display............................................... 101 5.3 Feedback Modes ................................................................................................ 103 5.4 Methodology ...................................................................................................... 104 5.4.1 Participants.................................................................................................. 104 5.4.2 Apparatus .................................................................................................... 105 5.4.3 Procedure .................................................................................................... 105 5.4.4 Design ......................................................................................................... 106 5.5 Results and Discussion ...................................................................................... 106 5.5.1 Entry Speed................................................................................................. 107 5.5.2 Accuracy ..................................................................................................... 110 5.5.3 KSPC Analysis ........................................................................................... 113 5.5.4 System Help ................................................................................................ 115 5.5.5 Word Level Analysis .................................................................................. 119 ix
5.6 Summary ............................................................................................................ 123 CHAPTER 6 Conclusion.................................................................................................... 126 6.1 Physical Button Devices .................................................................................... 126 6.1.1 Future Work ................................................................................................ 127 6.2 Touch Sensitive Devices.................................................................................... 129 6.3 Comparisons with Other Research..................................................................... 132 Bibliography ................................................................................................................... 134
x
LIST OF TABLES
Table 3-1. Four variations of LetterScroll. ....................................................................... 39 Table 3-2. KSPW for each interaction by method............................................................. 55 Table 3-3. Applying Guiard's model of bimanual control to Method #4. ........................ 57 Table 4-1. Participant questionnaire. ................................................................................ 88 Table 5-1. Sample list of search words and results......................................................... 101 Table 5-2. Potential error rate frequency ........................................................................ 113 Table 5-3. Presented text compared with entered and corrected text. ............................ 113
xi
LIST OF FIGURES
Figure 2-1. An example of a standard 12-key mobile keypad.......................................... 22 Figure 2-2. Mobile phone keypad variants. ...................................................................... 23 Figure 2-3. Two Qwerty keypad variants: SureType (left); Blackberry Bold (right)....... 25 Figure 2-4. Touchscreen related interaction techniques. .................................................. 31 Figure 3-1. Conceptual model of the LetterScroll character set. ...................................... 35 Figure 3-2. Traversing the alphabets using the scroll wheel. ........................................... 35 Figure 3-3. A mapping of left hand fingers to keys representing vowels (not to scale)... 42 Figure 3-4. The user interface in the experiment (scaled down). ..................................... 45 Figure 3-5. Experimental apparatus used for LetterScroll................................................ 46 Figure 3-6. Results for text entry speed (wpm) by method; three phrases constitute one block. ....................................................................................................................... 48 Figure 3-7. Modeled and observed KSPC by method....................................................... 51 Figure 3-8. Keystroke distribution by method. (a) Method #1 (b) Method #4................. 52 Figure 3-9. Modeled and observed KSPW by method...................................................... 54 Figure 4-1. Exploring the design space. Is eyes-free text entry possible on a touchscreen phone?...................................................................................................................... 64 Figure 4-2. The Graffiti alphabet. ..................................................................................... 65 xii
Figure 4-3. Text entry interface – the Graffiti alphabet is overlaid on the screen to aid learning. The translucent stroke map has been enhanced for clarity....................... 68 Figure 4-4. A delete stroke is formed by a “left swipe” gesture....................................... 70 Figure 4-5. Hardware used. The iPhone and the host system communicated over a wireless link............................................................................................................. 71 Figure 4-6. The host interface. Participants were only allowed to see the zoomed in area of the interface. The black region recreates the participant’s digitized ink as it is received.................................................................................................................... 72 Figure 4-7. Ink trails of an unrecognized (left) and recognized (right) “G” stroke. ......... 76 Figure 4-8. Participants entering text on the iPhone eyes-on (left) and eyes-free (right). In eyes-free mode, participants could not see the iPhone display. .............................. 76 Figure 4-9. Entry speed (wpm) by entry mode and block. ............................................... 78 Figure 4-10. MSD error rate (%) by block and entry mode.............................................. 80 Figure 4-11. Keystrokes per character (KSPC) by block and entry mode........................ 82 Figure 4-12. Unrecognized stroke frequency (%) by character........................................ 83 Figure 4-13. Stroke traces for the letter “O” for eyes-on (top two) and eyes-free (bottom two) modes. ............................................................................................................. 84 Figure 4-14. Amount of screen space used (%) for strokes by entry mode and block. .... 86 Figure 5-1. Modified interface with word highlighting. ................................................... 95 Figure 5-2. Entry speed (wpm) by entry mode and block. ............................................. 107 xiii
Figure 5-3. Interaction line plot for entry speed (wpm).................................................. 110 Figure 5-4. Corrected error rate (%) by entry mode and block. ..................................... 112 Figure 5-5. Keystrokes per character (KSPC) by block and entry mode........................ 114 Figure 5-6. The life-cycle of transcribed text. Accepted text is the final result of text entry....................................................................................................................... 116 Figure 5-7. System help (%) by entry mode and block. ................................................. 117 Figure 5-8. Raw and corrected error rates. ..................................................................... 118 Figure 5-9. Mean list size per word by block. ................................................................ 120 Figure 5-10. Word position in candidate list by block and mode................................... 121 Figure 5-11. Word position frequency by entry mode.................................................... 122
xiv
Chapter 1 Introduction The widespread use of computing devices over the past four decades has resulted in a plethora of text entry research. In its infancy, text entry research focused on word processing and document management for large computers. The advent of WIMP (Windows, Icons, Menus, and Pointers) shifted the community’s focus to graphical user interfaces, object manipulation, and mouse input. Over the years, computing power increased in magnitude and decreased in physical size. This advancement fuelled the birth of a new communications-based industry that produced small and portable devices such as pagers and mobile phones. The smaller form factor was the main limiting factor for text entry. The search for efficient and usable text entry interfaces began once again as researchers and scientists explored alternative ways to enter text within these constraints. The term ‘mobile phone’ has been extensively used to refer to a range of different devices. In actuality, most of the devices this term refers to are much more than just a portable telephone. Even the most basic mobile phone provides support for SMS (Short Message Service), more popularly known as texting, to refer to sending a text message from one cellular device to another device or system. In addition to the standard voice and SMS functions, current mobile phones may support many additional services, such as email, online presence, instant messaging, web browsing, MMS (Multimedia Messaging 15
Service), Bluetooth connectivity, infrared, digital camera, media playback, and GPS. This broad range of functionality and connectivity instantly transforms a mobile phone into a device that serves its users with a variety of everyday tasks. As digital computing becomes evermore common, a large amount of text-based tasks are shifting from paper to electronic channels. Mobile phones, PDAs, and multimedia Internet devices are a few of the many devices driving the shifts in the way we communicate and exchange information. As these devices become increasingly affordable, their popularity has risen. Acceptance is widespread. Although the range of available services is vast, SMS is the single service that has acted as a catalyst in the growth of text input on mobile devices. Various text entry techniques have been implemented to support text entry on mobile devices (see Chapter 2 for an overview). However, one factor that is understudied is eyes-free text entry on these devices. In general, text entry is highly reliant on vision and this decreases the mobility of the task. Furthermore, the heavy reliance on vision creates a barrier that make this technology inaccessible to persons with visual impairments. Mobile situations often involve multitasking. Users engage in mobile interactions while carrying out other tasks, such as shopping or walking. Almost always, every new iteration of mobile advancement results in a need for increased attention on the part of the user. As device capability increases, so does complexity. Unfortunately, this added complexity often translates to user-device engagement for longer periods of time.
16
1.1 Thesis Objectives At the very heart of this thesis, an attempt is made to seek how current devices can be enabled for eyes-free text entry in order to increase multitasking abilities and improve accessibility. This thesis explores the domain of eyes-free text entry on mobile devices by defining two broad classes of mobile device hardware: physical button devices and touch sensitive devices. Mobile devices that use physical buttons for interaction have pre-configured hardware such as a number pad (with one or more language characters associated with each button), navigation buttons, joystick, scroll wheel, and other similar controls. On the other hand, touch sensitive devices respond to user or stylus touch. The interactive elements are software-based buttons and other widgets that are displayed on a screen. This allows for flexibility in the way screen real estate is used. However, the lack of tactile distinction between these screen widgets makes it difficult to operate based on the sense of touch alone. These two classes are inherently different. Touch sensitive devices allow for direct object manipulation as well as gesture recognition using swiping, tapping, flicking, and even pinching (for devices supporting multi-touch). These natural interactions are the product of software-based interfaces, making them highly customizable. On the other hand, physical button devices lack this flexibility but are able to engage the user’s tactile, kinaesthetic, and proprioceptive sensory channels during interaction. Consequently, each class of hardware needs to be approached differently. 17
Differences in hardware result in differences in interaction. This thesis studies both hardware types separately with an investigation of potential software-based solutions for enabling eyes-free text entry. The primary objective is to determine whether eyes-free text entry is possible in each class of hardware; and, if so, to what extent? The latter question is answered by collecting data in controlled scientific evaluations and gauging various metrics such as speed and accuracy in a comparative analysis. Another goal of this work is to identify ways in which the visual attention demands imposed by current mobile devices can be reduced.
1.2 Thesis Contributions For each of class of hardware, this thesis proposes two separate software-based solutions that enable eyes-free use. Note that there are two positive consequences of a softwarebased solution. First, incorporating these solutions onto existing devices is cheap and require no hardware modifications. Second, eyes-free entry opens the door to making text entry accessible to persons with visual impairments, and this is the focus of the solution proposed for physical button devices in Chapter 3. Although many eyes-free text entry techniques exist for physical button devices (see Chapter 2 for an overview), they are often associated with bulky or expensive hardware, or require learning a new input language. Some solutions in the literature seem promising; however, the lack of empirical evaluations deem the value of those implementations inconclusive. Visually impaired users would benefit most from viable eyes-free text entry techniques. Currently, such users must either invest time learning to 18
touch-type or use other text entry techniques. These techniques often have a long learning curve and the initial hardship or frustration often prevents easy adoption. This thesis presents an alternative interface for physical button devices that builds upon existing hardware. The design rationale, models, and implementation, are supported with an empirical evaluation of the system. In a similar vein, there were no scientific evaluations found for eyes-free text entry on touch sensitive devices. This thesis presents a prototype that enables such an interaction, followed by an initial experiment comparing eyes-on and eyes-free use. Following this experiment, an enhanced version is developed building upon the results of the prior evaluation. A final experiment comparing three different feedback modes is carried out to identify which elements of the interaction aid text entry performance (in terms of speed and accuracy), and which elements hinder it.
1.3 Thesis Outline This thesis first presents an overview of related work and general issues in text entry for mobile devices. Interfaces that are capable of, enable the use of, or are specifically designed for use in an eyes-free manner are also covered. Thus, mobile text entry systems for persons with visual impairments are also discussed. Following this review, Chapter 3 presents LetterScroll: an eyes-free text entry interaction for physical button devices specifically targeted towards persons with visual impairments. The problem is first approached theoretically where a priori analyses are conducted using models. This is followed by a user study using blindfolded participants. 19
Chapter 4 shifts its focus to touch sensitive devices. The lack of sufficient tactile feedback on these devices increases the difficulty in enabling eyes-free text entry. An initial design is described followed by a discussion of the hardware and software used for the implementation. The prototype is then used in an initial user study comparing eyes-on and eyes-free use. This chapter lays the groundwork for later enhancements to the system. Chapter 5 presents an enhancement of the system presented in the pervious chapter. Various observations from the initial experiment are incorporated into the new system, including the addition of an error correction algorithm. Subsequently, a formal evaluation is conducted to determine which variation of the interaction is best suited for eyes-free use. In addition, the data analysis toolkit used is richer in its features, allowing for better word and character-level analyses of the data. Finally, Chapter 6 concludes the thesis with a brief summary of the experimental results, and an overview of topics for future work.
20
Chapter 2 Previous and Related Work This thesis addresses the mobile text entry problem within the scope of eyes-free use. In doing so, various aspects of mobile text entry design are discussed including text ambiguity, interaction methods, hardware type, and so on. This chapter presents previous and related work from these areas.
2.1 Mobile Text Entry This section reviews the status quo of mobile text entry. Text entry interaction consists of two parts: hardware and software. Section 2.1.1 first reviews a range of mobile keypads that are either discussed in the literature or are commercially used. Following this, Section 2.1.2 reviews a variety of interaction methods that use these keypads.
2.1.1 Mobile Keypads The foremost factor that makes text entry on mobile devices a challenge is the limiting form factor. Constrained space has resulted in a variety of “reduced-key” keyboards for text entry on mobile devices. The most popular of these is illustrated in Figure 2-1.
21
Figure 2-1. An example of a standard 12-key mobile keypad. Each of the eight keys from the numbers “2” to “9” is mapped to three or four different characters of the English language. The “0” key is mapped to a SPACE character. Some layouts may use the “#” key instead. Such an arrangement allows for larger keys, while representing the entire character set. An arrangement that dedicates a single key to each character within the same physical space requires a decrease in overall key size, making it hard to target specific keys. The popularity of SMS messaging and issues in text entry on mobile devices has generated a substantial amount of research on reduced-key keyboards. In their book, MacKenzie and Tanaka-Ishii (2007) present an overview of the various approaches found in the literature. Figure 2-2 shows some of these.
22
Figure 2-2. Mobile phone keypad variants (MacKenzie and Tanaka-Ishii, 2007). (a) Qwerty-like phone keypad (Hwang and Lee, 2005) (b) LetterEase (Ryu and Cruz, 2005) (c) FLpK: Fewer Letters per Key (Ryu and Cruz, 2005) (d) LessTap (Pavlovych and Stuerzlinger, 2003) (e) ACD: Alphabetically Constrained Design (Gong and Tarasewich, 2005) (f) EQ3 (eatoni.com) (g) QP10213 (MacKenzie and Tanaka-Ishii, 2007) Hwang and Lee present an arrangement where the letters are placed on 9 keys instead of 8 (Figure 2-2a). The arrangement is said to mimic the Qwerty keyboard, hence the name “QWERTY-Like”. The intent is to decrease visual scan time. Similarly, Ryu and Cruz use a few more keys and rearrange the letters to improve disambiguation (Figure 2-2b). They also proposed a second design where the letters follow the alphabetic order strictly (Figure 2-2c). In Figure 2-2d, the key and letter groupings of the original keypad are maintained, but the letter orderings on the key are rearranged to improve disambiguation. Figure 2-2e alters the letter grouping on the first two rows of the keypad. EQ3 in Figure 2-2f uses ten keys to spread the letters on the keypad. MacKenzie’s QP10213 does the same except using 9 keys and the arrangement strictly adheres to the Qwerty keyboard. 23
Overall, the various keypads differ primarily in their assignment and distribution of the alphabet across the keys. These are motivated by different design intents such as using familiar layouts (e.g., Qwerty keyboard) to build on previous experiences, or using language characteristics (such as the frequency of specific letters). The incorporation of this last point is easily distinguishable in many of the keypad design variants. The letter “e” is the most frequent letter in the English language. In the seven keypads illustrated in Figure 2-2, five (a, b, c, d, and e) place “e” as the first character on the key requiring a single press to access the letter “e”. Note that this is not the case with the standard mobile keypad as shown in Figure 2-1, where “e” is on the “3” key in the second position (def). Two different keypads designed by Research in Motion (RIM) are illustrated in Figure 2-3. These keypads resemble the layout of a Qwerty keyboard. The keypad on the left assigns the 26 letters of the English language to 14 buttons (Figure 2-3 left). When this is broken down further, each button represents one letter, but as a consequence the keys become smaller and harder to target (Figure 2-3 right). One approach to deal with shrinking key size is to make the device wider. However, pushing this too far can make the device bulky.
24
Figure 2-3. Two Qwerty keypad variants: SureType (left); Blackberry Bold (right) 2.1.2 Common Text Entry Methods The most common keypad layout is the one first shown in Figure 2-1. Various text entry methods can be used with this keypad. Among them, the most popular mode is multitap. In this mode, repeatedly pressing the same key cycles through the letters on the key. For instance, pressing the “3” key twice would indicate the letter “e”. Pausing for a set period of time or pressing a different key will automatically select the letter. The following entry sequence describes the entry for the word “hello” using multitap: 4433555N555666
keystrokes: 14
The N keystroke stands for a “next” key that indicates letter selection. Users can select a letter and override the timeout with this key and reuse the same key for the next letter without having to wait for a time out key (h-e-l-N-l-o). This is needed when the following letter of a word resides on the same key as the current letter. One such example is the word “feed”. All letters of this word reside on the “3” key. The keystroke sequence for entering this word would look like f-N-e-N-e-N-d, which in numbers is: 333N33N33N3
keystrokes: 11
25
Multitap is the only method that can support some level of eyes-free text input due to its explicit entry of each letter. By remembering the positions of the letters, users can enter text by counting the number of times they press a key. However, there is still a need to visually monitor input to comprehend the evolution of text and this makes error correction difficult. Two other modes of entry are T9 (by Tegic Communications; www.tegic.com) and iTap (by the Lexicus division of Motorola; www.motorola.com/lexicus). These modes use a dictionary to resolve ambiguity. In these modes, each key is pressed only once. Using the dictionary of a target language, these keystrokes are mapped to a word. Naturally, the disambiguation is not always perfect as two or more words may have the same key sequence. Such events are called “collisions”. When collisions occur, the most probable word is suggested as the default choice. If the resulting word is not what the user intended, a NEXT key is used to traverse words that match the same key sequence. The following entry sequence describes the entry for the word “hello” using T9: 43556N
keystrokes: 6
In the above sequence of keystrokes, the letter N represents the press of the next key, which displays the next potential word obtained with the sequence of keystrokes entered. Unfortunately, the use of word frequencies in this manner requires visual inspection of text. Predicted words and/or lists of ambiguous words are presented as entry proceeds. The reliance on this visual inspection eliminates the potential for entering text eyes-free. 26
Variations in multitap and T9 have been proposed by changing the letter arrangement on each key or altering the layout entirely to obtain lower keystrokes per character (KSPC). The goal is to improve disambiguation (see Figure 2-2). However, they all have an implicit reliance on vision. To use any of these in an eyes-free manner would require substantial time learning the layout. These add to the difficulty and complexity of eyes-free text entry, which make the task more challenging.
2.2 Eyes-free Text Entry Methods 2.2.1 Physical Button Devices The quest for an eyes-free text entry method on devices with physical buttons was motivated by addressing the needs of users with visual impairments. Consequently, the literature studied includes text entry solutions for such users. This section presents previous work in eyes-free text entry that addresses the needs of visually capable or visually impaired users. Currently, there are many text entry solutions for people with visual impairments. Arato et al. (2004) describe a software solution called the Braille Slate Talker. In this device, a fixed plastic guide (a 3 by 3 matrix) is placed on top of a touch sensitive PDA (Personal Digital Assistant) to allow Braille input. The method uses the plastic guides as a virtual slate representing six Braille cells, and cells for space, backspace, and carriage return. For input, each Braille code (multiple dots) is converted into an ASCII character.
27
The user receives non-speech audio feedback and each character is spoken using a TTS system. No performance tests were conducted with participants. Six-In-Braille is a text entry system proposed by Blenkhorn and Evans (2004). It uses the Qwerty keyboard to translate contracted Braille (a compact form of regular Braille using less space and several abbreviations) into text. Their system is for visually impaired users who prefer to type using Braille instead of a standard Qwerty keyboard. The system lies between the operating system and application and provides a general framework for accepting text encoded in Braille. Input uses six keys, three on the left (S, D, F) and three on the right (J, K, L). The homing keys F and J help users locate the keys using tactile feedback. Again, no human performance tests were conducted. Twiddler (Lyons et al., 2004) – a one-handed chording keyboard, was used to investigate eyes-free typing. Entry speeds for some participants approached as high as 67 wpm. The difficulty with Twiddler is the steep learning curve. This can frustrate users and may result in low acceptability of the technique. It is important to remember that a mobile phone is, essentially, a consumer product. Investing substantial time to learn the basic operating modes of the device is generally unacceptable. Another study compared multitap text entry with the use of an isometric joystick using EdgeWrite – a stroke-based text entry technique (Wobbrock et al., 2007). The study involved entering EdgeWrite strokes using gestures on a joystick. The mobile device used included two isometric joysticks, one on the front and one on the back. Of specific
28
interest is the finding that the front joystick allowed eyes-free entry at 80% of the normaluse speed (»7.5 wpm). Wigdor and Balakrishnan (2003) present TiltText – a text entry technique that uses the orientation of the device to disambiguate characters. Instead of multiple key presses, the user presses a button and tilts the device in the direction of the character on the mobile phone keypad. For instance, to enter “A”, the user presses the 2 key and simultaneously tilts the device left. This is a potentially eyes-free technique (like multitap) but no evaluation was carried out to test this possibility. Mobile ADVICE (Amar et al., 2003) uses tactile feedback and an auditory display to assist users navigating across menus on a mobile device. The evaluation tested two applications, playing music and checking email. The findings revealed that sighted users performed better than non-sighted users. Subjective measures between groups revealed that non-sighted users found the device easy to use. Li et al. (2008) describe BlindSight, a prototype that modifies a phone's visual incall menu to improve interaction during a call. Users interacted with an inverted phone (keypad away from the face) allowing their fingers to press buttons while on a call. The auditory feedback was only heard by the user and not by the other party. In addition, the device used tactile feedback. Subjective analysis of the eyes-free mode revealed that users found it easy to listen and press buttons, and felt it useful to hear content without looking.
29
Perhaps the work of Guerreiro et al. comes closest to this research. The authors present BloNo – a mobile text entry interface for visually impaired users (Guerreiro et al., 2008). BloNo navigates a cursor over a matrix of the alphabet. Each row of the matrix begins with a vowel, segmenting the alphabet into five rows. An evaluation with five blind participants involved sending a one-word text message to a contact in the address book. The authors report improved text entry rates but do not qualify this with any measurements. Finally, we should mention TTS (Text-To-Speech), which dates back to at least the 1930s with work at Bell Labs. Most operating systems today, provide built-in TTS services. These “describe” the GUI using speech as the cursor is moved around the screen. The Apple Macintosh provides a powerful and customizable TTS engine. In the context of text entry, each character of input is accompanied with spoken feedback; selection of text fragments automatically begins narration. However, users who cannot touch-type must learn the layout of keys. The complexity of the system makes it less portable across smaller devices.
2.2.2 Touch Sensitive Devices A literature search revealed very little on eyes-free text entry solutions for touchscreen devices. Although touchscreen devices have been extensively studied, there is little work in the area of eyes-free text entry on touchscreen mobile devices. Below we present some related work in this area.
30
FreePad (Bharath and Madhvanath, 2008) investigated pure handwriting recognition on a touchpad (Figure 2-4a). Subjective ratings found that the overall experience of entering text was much better than predictive text entry (aka T9). Text entry speeds were not measured, however.
Figure 2-4. Touchscreen related interaction techniques. (a) In-place handwriting recognition (Bharath and Madhvanath, 2008) (b) In-stroke word completion (Wobbrock et al., 2006) (c) SlideRule gestures for menu navigation and selection (Kane et al., 2008) (d) Text entry using gestures and pie menus (Yfantidis and Evreinov, 2006) Wobbrock et al. (2006) proposed an enhancement to EdgeWrite called Fisch, which provides in-stroke word completion (Figure 2-4b). After entering a stroke for a letter, users can extend the stroke to one of the four corners of the touchpad to select a word. Thus, multiple characters are entered in one stroke. The authors duly note that the mechanism provides potential benefits for eyes-free mobile use. However, no investigation was carried out to test this. 31
SlideRule (Kane et al., 2008) leverages multi-touch and gestures to let users navigate menus and make selections (Figure 2-4c). An evaluation was carried out on an Apple iPhone and ASUS MyPal. The findings showed that SlideRule was significantly faster but more error prone than a comparative button-based system. Yfanditis and Evreinov (2006) present a text entry solution using gesture recognition on a touchscreen. Upon contact, a pie menu is displayed with eight characters around it (Figure 2-4d). An advantage of pie menus is that all characters are equidistant from the point of contact. Dwelling on a character updates the pie menu with another set of characters by displaying another “layer”. This interaction technique is also complemented with auditory feedback, allowing for eyes-free entry. Although the layout of layers is customizable, the letter positions in each layer must be learned. In an empirical study, participants reached 7 wpm (words per minute) after two hours of practice. Sánchez and Maureira (Sánchez, 2006) proposed a text entry technique that places nine virtual keys on a touchscreen device. Consequently, text entry is similar to multitap. The target users are people with visual impairments. The primary mode of feedback is synthesized speech. Unfortunately, the work focused on design and implementation, and did not include an evaluation. Overall, we were unable to find any literature on eyes-free text entry on a touchscreen device using a stroke-based entry mechanism. While occasionally mentioned, no controlled evaluations have been conducted. 32
Chapter 3 Eyes-free Text Entry using a Wheel1 Eyes-free text entry manifests a two-fold advantage. First, it encourages multitasking in mobile contexts. Second, a clear consequence of eyes-free use is that the technology becomes accessible to visually impaired users. Although current mobile technology has advanced significantly, a key question is, are current devices usable by persons with visual impairments and, if so, to what extent? In general, text entry is highly reliant on vision, leaving visually impaired users at a disadvantage. Small devices such as mobile phones rarely have any form of TTS (Text-To-Speech) built in, and this makes the simple task of sending a text message (SMS) almost impossible for visually impaired users. The primary goal of studying and exploring this area of text entry is to develop text entry solutions that provide an alternative to persons with visual impairments. An essential criterion is to utilize the form factor of the device in order to eliminate the need for additional hardware.
1
The work presented in this chapter has been published in the ACM Conference on Human Factors in
Computing Systems. Tinwala, H. and MacKenzie, I. S. (2008). Letterscroll: Text entry using a wheel for visually impaired users. Extended Abstracts of the ACM Conference on Human Factors in Computing Systems – CHI 2008, pp. 3153-3158. New York: ACM.
33
3.1 LetterScroll Interaction Strategies LetterScroll is a text entry technique that extends on a three-key text entry technique to include visually impaired users. The underlying concept is the date stamp method (MacKenzie, 2002b). As a text entry method, increment or decrement operations manoeuvre a cursor sequentially through the character set. A select operation inputs the character and consecutive increment or decrement operations allow the user to traverse the alphabet to the next character. Depending on the configuration, LetterScroll can be used in multiple ways. This section describes the basic elements involved in entering text using LetterScroll.
3.1.1 Navigation In the interest of simplicity, the character set used is the 26 letters in the English alphabet and a
SPACE
character. The set can be expanded to include numbers, diacritics, and
punctuation. LetterScroll uses a scroll wheel on a standard wheel mouse to perform the increment and decrement operations to cycle through the character set. Figure 3-1 represents a conceptual model that may assist users.
34
Figure 3-1. Conceptual model of the LetterScroll character set. An increment operation (scroll forward) constitutes rotating the wheel by one segment or “tick” towards the user in a downward direction (see Figure 3-2). For example, one increment operation from o will advance the cursor to p. Continuous downward rotations advance the cursor to z and back to a, resulting in a cycle. Similarly, a decrement operation (scroll backward) constitutes rotating the wheel by one tick away from the user in an upward direction. For example, one decrement operation from o will shift the cursor to n.
Figure 3-2. Traversing the alphabets using the scroll wheel. 3.1.2
Jumping
LetterScroll allows jumping to characters using additional keys on the device (such as a mobile keypad or a keyboard). The keys form a mapping that provides random access to 35
certain characters. The implementation is device specific. The jump operation moves the cursor from its current position to a specific character belonging to a small and customizable set of characters, such as the English vowels a, e, i, o, and u. Figure 3-1 illustrates this with the vowels placed in rectangular boxes. Five keys are selected to represent these vowels. The keys used are specific to the implementation. From letter frequencies gleaned from the British National Corpus, the five most frequent letters are e, t, a, o, and i, in decreasing order (BNC, 2009). The intersection of the vowels with the most frequent letters captures four (a, e, i, o) out of five vowels. Missing is u, which can potentially replace t. However, since vowels are easily recognized and recalled, giving them special treatment may help novices. Overall, the use of vowels for jumping should decrease the cognitive load on the user.
3.1.3 Character Selection and Word Advancement To select the character at the current cursor position, users “click” the primary mouse button. Upon selection, the LetterScroll device cursor persists at the last position; the user navigates forward or backward from that position. For example, entering the word “is” requires navigating to i, followed by a select operation, followed by navigating to s from i, and so on. Upon completing text entry for a word, the user clicks the secondary mouse button to insert a space. Any further text entry requires navigating from the last cursor position (s, according to the example for “is”).
36
3.1.4 Feedback Today, even the most ordinary mobile device supports audio playback. This feature is extensively exploited by LetterScroll in the form of speech-based auditory feedback. As users navigate the character set, a speech synthesizer speaks the character at the current cursor position. As the cursor shifts from its current position, the synthesizer speaks the character at the new position. Similar feedback is provided when a key is pressed to jump to a vowel. This feedback provides a one-to-one mapping of each character to an auditory sound. A potential problem is that the speech synthesizer may be slower than the movement of the cursor. To accommodate this, the speech synthesizer empties its buffer upon invocation and speaks out the character at the current position of the cursor, aborting the sound if the cursor moves to a new character. Upon character selection, the letter is spoken (again). When a SPACE is inserted, the speech synthesizer speaks the word “space”.
3.1.5 Reviewing Text The user can review the evolution of transcribed text by pressing the initiates a character-by-character narration of the entered text. The
ESC
ESC
key. This
key was selected
due to its boundary position on the keyboard, affording a quick and easy method for locating using the tactile sense. Since the target users are visually impaired, it is equally important to review the presented text for experimentation. A pilot study revealed that users occasionally forget the stimulus. For this purpose, the space key was employed 37
initiating a word-by-word review of the presented text. Similar to the ESC key, the space key is wide and easily located using the kinesthetic sense alone.
3.2 Modeling the Interaction Using KSPC Design involves the construction of a viable solution within constraints. Armed with the ideas above on the interaction method and implementation details, the next step was to build a model of the interaction to explore design scenarios. An important statistic for this purpose is keystrokes per character (KSPC) – a weighted average of the number of keystrokes required to generate a character in a given language using a given text entry technique (MacKenzie, 2002a). For instance, KSPC for the QWERTY keyboard is 1.00 since each character has a dedicated key; KSPC for multitap is about 2.03. There are many ways to implement LetterScroll. Small differences in the technique can have strong influences on KSPC, and this makes a priori analyses of the interaction worthwhile. Earlier work suggests an inverse relationship between KSPC and throughput (words per minute). If other factors (e.g., motor sensory limits, attention demands) are kept constant, then KSPC provides a reasonable means of theoretically evaluating a technique based on its mechanics. With this in mind, four variations of LetterScroll are presented in Table 3-1. In determining KSPC for LetterScroll, each increment or decrement operation, character selection operation (one primary button click on mouse), space insertion (one secondary button click on mouse), and jump operation (the keys mapped to the vowels) is counted as a single keystroke. 38
Method
Character Navigation
KSPC
#1
Scrolling both ways
6.38
#2
Scrolling and acceleration
4.61
#3
Scrolling and jumping [~,1]
2.95
#4
Scrolling and jumping [~, 1, 2, 3, 4]
2.39
Table 3-1. Four variations of LetterScroll. In all four methods, character navigation is cyclical. Increment operations (advancing forwards, such as a, b, c, …, and so on) are mapped to scrolling downward and decrement operations (regressing backwards, such as z, y, x, …, and so on) are mapped to scrolling upwards. This inverse mapping is chosen because scrolling downwards is predominantly associated with advancing forward in document-based applications. Thus, the aim is to exploit the past experiences of users in a productive way.
3.2.1 The Basic Approach The basic mode is Method #1. Interaction proceeds by moving the cursor using the mouse's scroll wheel; character selection occurs by clicking the primary mouse button. Clicking the secondary mouse button enters a SPACE. At all times, the cursor persists at its last position. The keystrokes to enter a character depend only on the preceding character, as this determines the cursor position. The language model uses a word-frequency reduction of the British National Corpus (Silfverberg et al., 2000). Below is an example of the three most frequent words in the list appended with their frequency and keystrokes for Method #1: 39
the_
5776384
ZZZZZZZZCBBBBBBBBBBBBCBBBCS
of_
2789403
ZZZZZZZCBBBBBBBBBCS
and_
2421302
ZZZZZZZZCFFFFFFFFFFFFFCBBBBBBBBBBCS
For each word, initial navigation appears as “Z”. These keystrokes are based on a weighted average using trigram frequencies crossing word boundaries. Essentially, they provide an estimate on how many keystrokes are needed, on average, to navigate to the first letter in a word, when using Method #1. “C” is the select keystroke. “B” and “F” are backward and forward keystrokes, respectively. “S” is the final
SPACE
to select a word.
The word “the” requires 8 initial keystrokes (increment or decrement) to navigate to t. This is followed by one select keystroke to select t. Navigating from t to h requires 12 decrement and 1 select keystroke. This is followed by 3 decrement and 1 select keystroke to navigate from h to e. Finally; the word is terminated with a
SPACE
keystroke by
clicking the secondary mouse button. KSPC is computed by summing these weighted keystroke counts. There are at least two paths to reach the desired character (backwards or forwards). This analysis uses the shortest path possible and therefore yields a lower bound for KSPC. For Method #1, the result is KSPC = 6.38
(Method #1)
In an attempt to improve the interaction, an additional feature was introduced – acceleration. This is Method #2. In this method, holding down the
CTRL
key while
scrolling allows the cursor to move two characters at a time, instead of one. Thus, scrolling from a to c would require one increment operation instead of two. When 40
acceleration is used, an additional keystroke is added for pressing the accelerator key. There is a reasonable improvement: KSPC = 4.61
(Method #2)
3.2.2 Jumping with Vowels Although Method #2 is promising, there are many increment/decrement operations to move from one character to another. A potential improvement is to jump - to use keys to navigate a subset of the character set. This allows the interaction to support a kind of random access. This is Method #3. Engaging the tactile sense for this, the two top-left keys above the QWERTY row are chosen. On the keyboard used, these are the back quote (`) and 1 keys. They are associated with backward and forward navigation, respectively. The characters they access are the vowels. Consecutive presses of 1 traverses a, e, i, o, u and back to a. The back quote key exhibits the same behaviour in reverse (u, o, i, e, a, and back to u). A similar technique is a 1:1 mapping of vowels to keys. Instead of stepping through the vowels, five keys on a keyboard are dedicated to the vowels. This is Method #4. For the purposes of this research, five horizontally aligned keys: back quote, 1, 2, 3, and 4 were selected. Figure 3-3 shows a mapping of vowels to keys and keys to fingers. The index finger is shared between o and u.
41
Figure 3-3. A mapping of left hand fingers to keys representing vowels (not to scale). The back quote (`) key is a boundary key, which assists users in quickly locating it by sensing the edge of the keyboard. Locating this key provides enough information to successfully home the remaining fingers on neighbouring keys. In both methods (Method #3 and Method #4), there are multiple ways to navigate the character set. For example, given that the cursor is currently at a, navigating to the letter t can be done at least four ways: I. Scrolling forward to t (a b c d … s t). II. Scrolling backward (a z y x w v u t). III. Jumping to the vowel o followed by scrolling forward (o p q r s t). IV. Jumping to the vowel u followed by scrolling backwards (u t).
42
Of the four techniques described, interaction IV requires the least keystrokes. For both methods #3 and #4, KSPC is calculated using the shortest path possible. Given the interaction techniques in Method #3 and Method #4, and keystroke appended word frequency lists, KSPC for each was computed. The results obtained for Method #3 and Method #4 are KSPC = 2.95
(Method #3)
KSPC = 2.39
(Method #4)
This preliminary analysis provides a good idea of which text entry methods to explore further. Although the KSPC value suggests that Method #4 requires 62% fewer keystrokes than Method #1, it is not certain that Method #4 will result in higher text entry rates. The preceding analysis does not take into account other factors such as cognitive load and attention demands introduced by jumping. All four variations were implemented and each one was informally tested with two users. Overall, it was determined that Method #1 was easy to use and Method #4 provided substantial benefit in terms of KSPC when compared to the other methods. The following sections present the methodology and results of an empirical evaluation comparing Method #1 and Method #4.
43
3.3 Method 3.3.1 Participants Twelve paid volunteer participants (9 male, 3 female) were recruited from the local university campus. Participants ranged from 19 to 27 years (mean = 25, SD = 3.6). All participants were experienced computer users and identified themselves as proficient mouse users. Computer usage was reported from 4 to 12 hours per day (mean = 6.9). All participants were right handed and held the mouse in their right hand during the experiment.
3.3.2 Apparatus An Apple Macintosh notebook computer with a 13” screen was used to conduct the experiment. The pointing device was a standard two-button USB mouse with a scroll wheel between the two buttons. Tracking features of the mouse and built-in touchpad were disabled, as the interaction did not require participants to move a cursor. Software for the experiment was written in Java. An open source implementation of the Java Speech SDK – FreeTTS (Free Text-To-Speech) was used for speech synthesis. The synthesized voice was male with a slight American accent. The synthesizer could speak individual letters or complete words and phrases. All mouse and keyboard events were recorded. All sessions were run in full screen mode. Figure 3-4 depicts the interface used. The first line displays the presented text and the second line displays the transcribed text. The presented text phrases were 44
drawn randomly from a standard 500-phrase set designed to have a high correlation with English (MacKenzie and Soukoreff, 2003). The letter in the middle displays the current location of the cursor in the alphabet. Data collection begins once the “Begin Data Collection” button is pressed. The speech synthesizer speaks out the phrase to enter at the start of each phrase. Timing begins with the first keystroke received after the phrase is spoken.
Figure 3-4. The user interface in the experiment (scaled down). 3.3.3 Procedure The experiment was performed in a quiet room. Participants sat in front of the apparatus on a (standardized) study chair with the screen positioned at eyesight level. The mouse and keyboard were at the same level as each other. It took about 50 minutes to 1 hour per participant to test both conditions of the experiment. All participants were blindfolded with a sleep shield and wore earphones for speech feedback generated by the synthesizer. The keyboard keys were modified with two layers of thinly cut sticky notes to improve tactile feedback (see Figure 3-5). 45
Figure 3-5. Experimental apparatus used for LetterScroll. Prior to data collection, participants completed a questionnaire soliciting demographic data. The experimenter then explained the task and the operation of Method #1. Participants were instructed to enter text “at a pace comfortable to them, and to do so as accurately as possible”. If an error occurred, they were to continue without correcting it. Similar to Method #1, software jumping features were explained and demonstrated before testing Method #4. Each participant was allowed approximately two minutes of practice without the blindfold before the start of the experiment. During this time, they were allowed to ask 46
questions about the procedure. When ready, they were blindfolded and asked to press the ENTER
key to begin the experiment. After entering the text for a phrase, participants were
instructed to press the same ENTER key to move to the next phrase. For each method, there were two blocks of three phrases each. The software recorded keystrokes, timestamps, and other information for follow-up analyses. After the experiment, participants completed a short post-test questionnaire (discussed later).
3.3.4 Design The experiment was a 2 x 2 within-subjects design. The two independent variables were Method (#1, #4) and block (1, 2). The order of conditions was not counterbalanced. This was done to give participants sufficient experience with the scrolling technique before investigating how the added ability to jump to vowels affects overall performance. The blocks were administered one after the other without breaks. Participants took an optional break of about two minutes between the two conditions.
3.4 Results and Discussion Overall, participants entered 72 phrases of text using Method #1 and 72 phrases using Method #4. Throughput, accuracy, and other observations are reported next.
3.4.1 Speed The overall mean for entry speed was 3.6 wpm. Method #4, with a mean entry speed of 4.3 wpm, was about 53% faster than Method #1 at 2.8 wpm (see Figure 3-6). Although 47
the difference was statistically significant (F1,11 = 33.2, p < .0001), this higher speed with Method #4 is at least partly due to learning while using Method #1.
Figure 3-6. Results for text entry speed (wpm) by method; three phrases constitute one block. There was an increase in entry speed for both methods from one block to the next, as evidenced by the significant effect of block on entry speed (F1,11 = 10.5, p < .01). Aside from learning effects, the primary reason for the increasing entry rates was that participants started to build a strategy. In the first phrase, participants would listen to every letter announced by the synthesizer. Thus, they scrolled forwards or backwards one letter at a time. By the second phrase, they began to scroll across multiple letters and then inspect the current position of the cursor with the audio feedback. This allowed faster
48
navigation of the character set wherein participants would “accelerate” the process by scrolling to the vicinity of the desired character.
3.4.2 Accuracy Overall, error rates averaged between 3.2% to 5.2% for Method #1 and Method #4, respectively. Although the average error rate for Method #4 is slightly higher than Method #1, error rates varied substantially across participants. The difference proved not statistically significant (F1,11 = 0.5, ns). Some errors were caused by overshoot or undershoot (selecting a character neighbouring the target character). However, this was a small percentage of the overall error rate. Although the overall error rates are very low, the use of an auditory display gave rise to interesting observations. Unlike interfaces with visual feedback, the errors exhibited in this experiment were very different. The following classifies the observations of errors. Invalid recall. The presented text was recalled differently. For instance, the phrase “three two one zero blast off” was entered as “three two one blast off”. Alternative recall. Some participants entered words that were completely different from the presented text. One participant transcribed the phrase “for your information only” as “for your eyes only”. The effect of previous experience is evident here.
49
Spelling mistakes. Participants misspelled words at times. Although the interface allows users to inspect the spelling of a word in the phrase, this feature was used sparingly. The cause of errors that fall into categories 1 or 2 is not clear. Perhaps varying cognitive demands in the methods may have caused the observed behaviours.
3.4.3 Keystrokes per Character Earlier discussions of KSPC were useful to compare design factors and to consider which LetterScroll methods to empirically test. The KSPC values given previously were computed. In the experiment, the actual keystrokes were collected. Based on this (observed) data, the keystrokes per character were calculated with KSPC serving as a dependent variable. Overall, the observed keystrokes per character were 40.3% greater than the modeled values. Figure 3-7 shows the results of the modeled and observed KSPC for Method #1 and Method #4. The chart uses a KSPC value of 1.00 for QWERTY text entry as a baseline.
50
Figure 3-7. Modeled and observed KSPC by method. (Error bars span one standard deviation) KSPC observed for Method #1 was 8.33. This is 30.6% higher than the modeled KSPC (6.38). KSPC observed for Method #4 was 50% more than the modeled KSPC (2.39), at 3.58. An analysis of variance revealed that there was a significant main effect of input method on keystrokes per character (F1,11 = 947.8, p < .0001). Similar analyses (MacKenzie, 2002b) suggest that these differences could surface due to linguistic variation between the data for computing KSPC (the language corpus) and the data for the observed KSPC (the entered phrases). However, it was noted earlier that there is a high correlation in the letter frequencies between the phrases and the language corpus. Thus, this explanation is minor. The strongest influence in the inflated KSPC values is likely non-optimal entry. Humans learn the alphabet forwards, from a to z. As a result, participants will frequently take a longer path to reach the desired character because it imposes a smaller cognitive load. Recalling the alphabet in descending order 51
induces a higher mental demand. (Try to say the alphabet backwards, and the point is made.) Another factor is the way each participant mentally chunks the alphabet. Different chunks lead to different decisions when attempting to locate the next character. Keystroke Distribution
Figure 3-8 shows the distribution of the various keystrokes entered by the participants for both the methods.
a)
b) Figure 3-8. Keystroke distribution by method. (a) Method #1 (b) Method #4 In Method #1, participants used scroll forward 45% of the time. This decreased significantly to 16% when using Method #4. A significant decrease was also found with 52
the scroll backward feature, dropping from 43% to 28%. As a result of introducing jumping in Method #4, 33% of the keystrokes were attributed to the vowels; a, e, i, o, and, u. These numbers are clear indicators that participants found the vowels convenient and made extensive use of them to reach the desired character. Word-level Analysis of Keystrokes
KSPC provides valuable insight on the overall effort needed to enter text. However, it does not highlight how much each type of interaction contributes to this number (e.g., Scroll Forward, Select, Space, etc). To get a better understanding of what elements of the interaction induced the increases in KSPC, the collected keystrokes were analyzed by type, and each type’s contribution was calculated at a word level. The metric used is keystrokes per word (KSPW), which represents the number of keystrokes needed, on average, to produce a word of text in a given language using a given text entry technique (MacKenzie and Tanaka-Ishii, 2007). KSPW is much more revealing for LetterScroll as it expresses the contribution each type of interaction has per word. Figure 3-9 compares the modeled and observed KSPW by type for each method of entry. Notice that KSPW for Select is consistently between 4 and 5, which reflects the average word length in the English language (MacKenzie, 2002a). The observed KSPW was greater than the modeled KSPW for Scroll Forward, Scroll Backward, and Vowels. They were approximately equal for the Select keystrokes.
53
Figure 3-9. Modeled and observed KSPW by method. Table 3-2 identifies the overhead observed for these keystrokes. Overall, the overhead in Method #1 was 2.4 times higher than Method #4. This is apparent from the high numbers for Scroll Forward and Scroll Backward in Method #1. In comparison, Method #4 segmented the alphabet into five parts allowing participants to easily navigate the characters.
54
Method #1 Method #4
Keystroke Scroll Forward
Model 13.63
Observed 18.28
Overhead 4.65
Scroll Backward
12.17
17.48
5.31
Space
1
0.98
-0.02
Scroll Forward
2.63
2.82
0.19
Scroll Backward
2.72
4.75
2.03
Vowels
2.80
4.74
1.94
Space
1
0.99
-0.01
Table 3-2. KSPW for each interaction by method. For Method #4, there was an overhead of 74% (1.94) vowel keystrokes per word. One explanation for this high overhead of vowel keystrokes is that participants frequently relied on the audio feedback to know what vowel they have pressed. Instead of recalling the vowel beneath their finger, they relied on recognition based on the audio feedback. Due to this approach, participants often hit multiple vowel keys in an effort to locate the one they needed. This behaviour is possibly related to a comparatively higher cognitive load in Method #4. Additionally, the poor quality of the speech synthesizer resulted in multiple presses of the same vowel key to confirm their choice, adding to this overhead. The negative overhead for the Space keystroke in both methods is due to the fact that participants occasionally forgot to enter a SPACE character at the end of a word. This behaviour was observed most notably with small, highly frequent words such as “the”, and “of”.
55
3.4.4 Two-handed Interaction In Method #1, participants identified the cursor’s location by retaining it in memory, or by scrolling to a neighbouring character for audio feedback. In contrast, participants had a higher degree of control in Method #4 as they actively decided on jumping to a specific vowel. The task of locating and entering a given character (targetCharacter) in Method #4 was composed of the following three steps: 1. Identify and jump to the vowel closest to the targetCharacter. 2. Use the wheel on the mouse to scroll to the targetCharacter. 3. Select the character. Participants frequently used a strategy that is well captured by Guiard’s model of bimanual control (Guiard, 1987). Table 3-3 presents an adoption of this model for entering a character using Method #4.
56
Hand Non-preferred
•
Role and Action leads the preferred hand
•
sets the frame of reference by navigating to the vowel closest to the desired character
•
performs coarse movements within the alphabet by jumping to one of the five vowels
Preferred
•
follows the non-preferred hand
•
works within the vicinity of the vowel selected by the non-preferred hand
•
performs fine movements by scrolling forward or backward one character at a time until the desired character is reached
Table 3-3. Applying Guiard's model of bimanual control to Method #4. Based on in-experiment observations, the keystroke data were analyzed to investigate the frequency of this approach. In total, there were 1200 characters of text entered in Method #4. Of these, 76% were transcribed using the method outlined in Table 3-3, i.e., using the vowels to get close to the intended character with the non-preferred hand, followed by navigating forward or backward to the intended character using the preferred hand.
3.4.5 Stimulus and Input Review Since the feedback was auditory, participants had the ability to inspect the stimulus (presented text) and the input (transcribed text) as often as needed. As such, these features were not used extensively. Overall, participants reviewed the stimulus 0.9 times
57
per phrase for Method #1, and 0.5 times per phrase for Method #4. Participants reviewed the input 0.6 times per phrase for Method #1, and 0.3 times per phrase for Method #4. Interestingly, the overall review frequency for Method #4 was less than for Method #1. Clearly, some of this behaviour is attributed to learning effects. However, there is some additional influence of the entry method. Since the entry rate was higher for Method #4, participants spent less time entering a character. This allowed participants to retain the stimulus and input longer, thus requiring fewer reviews of the stimulus and input.
3.4.6 Limitations of the Apparatus Although the apparatus was functional, there were some limitations. The voice synthesizer was of poor quality. Perhaps replacing the alphabets and phrases with clearer, pre-recorded audio files would be a better option. Approximately half the participants reviewed their input at least once. Input review narrated the entire stream of text that had been transcribed character-by-character. However, the primary goal was to identify the last few characters entered. Participants could potentially be less distracted and demonstrate better retention capabilities if they would have had the ability to simply inspect the last word or last few characters instead of the entire phrase.
58
3.4.7 Participant Questionnaire A post-test participant questionnaire collected subjective preferences of participants. All participants showed a preference for Method #4. Eight out of twelve participants suggested that the quality of the speech feedback could be better. All participants found Method #1 “very easy” and Method #4 “moderately easy”. Many comments were also received. One participant felt that the direction of alphabet navigation should be customizable. Another stated that it would be better to have additional letters to jump to, specifically the letter t. This makes sense as t is the topranked consonant in English. Replacing the letter u in the set of characters that facilitate jumping with t is one approach that could improve interaction. Some participants suggested that there should be additional synthesized voices to choose from in terms of accent and gender.
3.5 Summary Four different variations of text entry using the mouse, keyboard, and spoken text were explored. In all four possibilities, the scroll wheel on the mouse was used to navigate a cursor along a linear sequence of characters from a to z. Character selection was done by clicking the primary button of the mouse and a space was inserted by clicking the secondary button. The modeled keystrokes per character (KSPC) range from 6.38 to 2.39. A formal experiment of Method #1 and Method #4 was carried out with twelve blindfolded participants who entered a total of 144 phrases of text. Overall, text entry rates were 2.8 wpm for Method #1 and 4.3 wpm for Method #4. 59
Method #4 engaged the keyboard in navigating the cursor. Five letters (the vowels) were chosen to provide random access and move the cursor. There was a significant improvement in entry rate, but it was not as large as expected. This suggests that there are other factors at play. Specifically, the attention demands in Method #4 to determine which vowel is closest to the desired character. During entry, the visual channel was completely blocked. Accuracy was defined as the degree to which the transcribed text matches the presented text. But this definition does not capture the true errors. There are other mental processes that resulted in participants entering phrases that were different from the presented phrase, yet similar or reasonable to the presented phrase. Further research is needed to explore how to define accuracy in such circumstances. Post experiment analysis revealed that the interaction technique had a significant effect on KSPC. The observed KSPC was 40.3% greater than the modeled KSPC, on average. Keystroke distributions revealed that jumping extensions introduced in Method #4 were extensively exploited by all participants. An analysis was also performed at the word level comparing modeled and observed keystrokes per word (KSPW). This breakdown shed light on the specific interactions that contributed to the overhead found between methods. The keystroke overhead in Method #1 was almost 2.4 times higher than Method #4. The scrolling interactions were the dominant contributors of this overhead.
60
Chapter 4 Eyes-free Text Entry on Touchscreen Devices2 Recently, the mobile industry has seen a rise in the use of touchscreen phones. One market prediction is that approximately 40% of mobile phones will incorporate a touchscreen by the end of 2012 (strategyanalytics.com). Although touchscreen devices are not new, interest in the media and in the HCI community at large has increased since the arrival of the Apple iPhone. Following the iPhone’s release, a wide array of competing products emerged, such as LG's Prada, Samsung's D988, Nokia's N95 and RIM's BlackBerry Storm. Early touchscreen phones were trimmed down versions of desktop computers. They were operated with a stylus, necessitating two-handed interaction and a high level of accuracy. Such accuracy is not always possible in a mobile context. As a result, current touch-based mobile phones use the finger for input. The devices allow direct object manipulation as well as gesture recognition using swiping, tapping, flicking, and even pinching (for devices supporting multi-touch). Such novel styles of interaction afford a
2
The work presented in this chapter has previously been published in the IEEE Toronto International
Conference. Tinwala, H. and MacKenzie, I. S. (2009). Eyes-free text entry on a touchscreen phone. Proceedings of the IEEE Toronto International Conference - Science and Technology for Humanity – TICSTH 2009, pp. 83-88. New York: IEEE.
61
naturalness that is unparalleled by previous indirect methods (e.g., using a stylus, mouse, or joystick). Phones with physical buttons are constrained since all interaction involves hardware pre-configured to support the application software. Once built, the hardware is fixed and cannot be customized further, which limits the scope of interactions possible. Touchscreen phones use software interfaces making them highly customizable and multipurpose. Use of screen space is more flexible and since there is no physical keypad, the screen size can be bigger. As a result, touchscreen phones typically offer a myriad of applications on a single device: phone, texting, e-mail, web browser, calendar, multimedia playback, games, and more. However, the benefits of touchscreen phones come at a cost. Without physical keys, a user's ability to engage the tactile, kinesthetic, and proprioceptive sensory channels during interaction is reduced. The demand on the visual channel is increased and this compromises the “mobile” in “mobile phone”. One goal of this research is to examine ways to reduce the visual demand for interactions with touchscreen phones. The rest of this chapter explores the design of the system followed by an empirical evaluation of eyes-free text input. The subsequent chapter presents a description of the enhancements added to the system to improve interaction, including an algorithm for error correction. These improvements are supplemented with a second study that investigates the potential of eyes-free input with the new enhancements in place. 62
4.1 The Design Space The primary purpose of mobile devices is communication. So, it is important to support alphanumeric entry, even if it is just to enter a phone number or save a contact’s information. With physical buttons, users develop a sense of the buttons, feel them, and over time remember their locations. This tactile and proprioceptive feedback is priceless. Users build a spatial motor-memory map, which allows them to carry out basic phone tasks eyes-free, such as making a call, receiving a call, or putting calls on hold. With experience, many mobile phone users carry out text entry tasks (e.g., text messaging) eyes-free by feeling and knowing their way around the device. Touchscreen phones are different. The feedback afforded by physical buttons is gone. As a result, touchscreen phones are more visually demanding. Users are unable to fulfill this demand when operating the device in a mobile context. There is an overload of the visual channel making it difficult to use the device in a public setting or when engaged in a secondary task, such as walking, attending a meeting, or shopping. Furthermore, the inability to use the device in an eyes-free manner affects people with visual impairments (McGookin et al., 2008). The natural interaction properties of a large touch-sensitive display allows a rich assortment of applications for touchscreen phones, but the increased visual attention demand prevents the added features from translating into mobile contexts. This creates a gap between what was capable with a physical button phone and what is capable with
63
current touchscreen phones. Figure 4-1 serves as a descriptive model to think about the design space and potential elements to consider in bridging this gap.
Figure 4-1. Exploring the design space. Is eyes-free text entry possible on a touchscreen phone? Combining some form of physical feedback, finger tracking, and other feedback modalities to guide the user through tasks can provide the elements required for eyes-free text entry on a touchscreen phone. The main purpose here is to explore whether eyes-free text entry is even possible on a touchscreen phone, and if so, to what extent? The next sections discuss the proposed solution, including the interface and implementation, followed by a detailed description of the evaluation carried out and a discussion of the results.
64
4.2 Finding the Right Mix Technology to enable eyes-free text entry on a touchscreen already exists. It is just a matter of knitting it together. The critical requirement is to support text entry without the need to visually monitor or verify input. Obviously, non-visual feedback modalities are the key. Desirably, the technique should have a short learning curve so that it is usable seamlessly across various devices and application domains. Goldberg and Richardson (1993) were the first to propose eyes-free text entry using a stroke-based alphabet. Their alphabet, Unistrokes, was designed to be fast for experts. A follow-on commercial instantiation, Graffiti, was designed to be easy to learn for novices (MacKenzie and Zhang, 1997). However, a longitudinal study (Castellucci and MacKenzie, 2008) found no significant difference in entry speeds between the two methods. To maintain a short learning curve, the Graffiti alphabet was used (see Figure 4-2). The similarity of most characters to the Roman alphabet allows users to build upon previous experiences. This encourages quick learning and aids retention.
Figure 4-2. The Graffiti alphabet. Since the device will be occluded from view, the feedback must be non-visual. Given the capabilities of current mobile devices, it is easy and computationally cheap to incorporate speech synthesis. For certain interactions, such as strokes that are 65
unrecognized, the built in actuator of the device can be used to provide a short pulse of vibrotactile feedback.
4.3 Design Considerations Previously, many studies investigated the performance of single-stroke text entry using a stylus or pen on touch-sensitive devices (Bharath and Madhvanath, 2008; Wobbrock et al., 2004; Yatani and Truong, 2007). In this work, the interaction uses the finger. Predominantly, the writing experience involves a pen-shaped object (pen, pencil, chalk, etc). It is unclear to what extent this experience transfers to finger-based text entry. There are noticeable differences between the way humans use a finger and a stylus. Previous experiences with writing translate directly to styli. Pen-like devices are generally held between the finger and thumb. Motor movement is generated from the wrist. This is where finger-based input differs. The amount of wrist movement when drawing with the index finger is minimal. For large strokes, the entire arm is moved. For finer strokes, the wrist merely sets the frame of reference, beyond which, input involves motor movements of the finger. Another issue is the drawing surface. Will it be a region of the touchscreen? Will it be the entire screen? Also, how will the design be incorporated so as to minimize interference with existing widgets on the screen? These are all open questions. The prototype presented here uses the entire screen as the drawing surface. This eliminates the need to home into a specific area of the device. This is especially suited for touchscreen devices due to their lack of other physical buttons. The drawing surface is overlaid on 66
existing widgets; so, strokes inked do not interfere with other UI elements. Perhaps a mechanism to enter a “text entry mode” could be useful, which would lock the interface and accept text as input, but this was not implemented. It is useful to set the context of the evaluation at this point. Text entry speeds are not expected to rival those attainable on a soft keyboard, but the ability to enter any amount of text eyes-free on a touchscreen phone is noteworthy. Furthermore, it is unreasonable for this technique to be used for lengthy text entry tasks. Rather, the focus is on tasks that require shorter amounts of text, such as text messaging, adding calendar entries, or making lists and notes. Earlier work in this area found that novice users are able to reach text entry speeds of 7 wpm while experts reached 21 wpm with Graffiti (Fleetwood, 2002).
4.4 Graffiti Input Using a Finger Text involves many chunks – phrases, words, and characters. The prototype application developed allows users to enter text one character at a time. Once the word is completed, it is terminated with a SPACE and appended to the message. Each tier of this organization provides a different form of feedback. Figure 4-3 illustrates the interface along with the evolution of the phrase “the deadline is very close” on the Apple iPhone. The Graffiti alphabet is overlaid on the screen to promote learning. In previous studies, the strokes were displayed on a wall chart or away from the interface (MacKenzie et al., 2006; Wobbrock et al., 2004). This 67
demanded visual attention at a location away from the interface and could potentially affect throughput.
Figure 4-3. Text entry interface – the Graffiti alphabet is overlaid on the screen to aid learning. The translucent stroke map has been enhanced for clarity.
4.5 Text Entry Interaction To enter text, users touch the phone and draw a stroke on the screen. Digitized ink follows the user’s finger until it is lifted. At the end of the stroke, the application attempts 68
to recognize the stroke and identify the intended character. At this point, there are two possible outcomes. If the stroke is recognized, the iPhone speaks the character and appends it to the word entered so far. Alternatively, if the stroke is unrecognized, the iPhone’s vibration actuator pulses once. This response communicates the result of the user’s last action with no dependency on audio or visual feedback. At the end of a word, the user double-taps. This enters a
SPACE
after the current
message, and appends the word entered to the message. The system responds with a soft beep to signal that the word has been appended to the message. Although a single-tap would suffice, a double-tap was chosen to prevent accidental input that may occur while users hold the phone in their hands or when they touch and lift before entering a stroke; a common behaviour when users hesitate. The application also leveraged the iPhone’s built-in accelerometer to allow users to signal message completion to the system. Shaking the phone ends the message and allows the application to proceed further. The utility of this gesture is noteworthy. A physical motion on the entire device translates to a significant signal to the application, i.e., end of message. This interaction has no reliance on the visual or auditory channel. Lastly, a mechanism for error correction was included. Figure 4-4 depicts a ‘delete stroke’. The basic idea is to swipe the finger from right to left drawing a stroke moving west. This gesture was accompanied by a non-speech sound akin to the sound of an eraser rubbing against paper when correcting pencil marks. This way, users can go back and correct an error. 69
Figure 4-4. A delete stroke is formed by a “left swipe” gesture. The following is a summary of the interactions accompanied with non-visual feedback: Recognized stroke:
character is spoken
Double-tap for SPACE:
soft beep
Unrecognized stroke:
vibration
Delete stroke (←):
erasure sound
4.6 Implementation 4.6.1 Hardware Infrastructure The hardware was an Apple iPhone and an Apple MacBook host system (2.4 GHz Intel Core 2 Duo with 2 GB of RAM). The host was used for data collection and processing. The two devices communicated via a wireless link over a private, encrypted, ad hoc
70
network (see Figure 4-5). The wireless link allowed participants to move the device freely in their hands.
Figure 4-5. Hardware used. The iPhone and the host system communicated over a wireless link. Existing networks were not used to avoid reliance on a network infrastructure that could not be controlled. Furthermore, test connections using a router gave rise to latency issues and degraded the quality of the interaction.
4.6.2 Host and Device Software Architecture A host application was developed using Cocoa and Objective C. The development environment was Apple's Xcode. The application listened for incoming connections from the iPhone. Upon receiving a request and establishing a connection, the program reads a series of 500 phrases ranging from 16 to 43 characters (mean = 28.6) (MacKenzie and Soukoreff, 2003). During execution, the host application randomly selected a phrase and presented it to the participant for input. The host application guided the participant through their tasks by presenting stimuli and responding to user events. All events generated by interaction 71
on the iPhone (finger tracking, tapping, shaking, swiping, etc.) were received and processed by the host machine. As the experiment progressed from one phrase to the next, the stimulus on the host device was updated to alert participants of the text to enter (see Figure 4-6).
Figure 4-6. The host interface. Participants were only allowed to see the zoomed in area of the interface. The black region recreates the participant’s digitized ink as it is received. Apart from the area highlighted in green, the remaining user interface elements of the host application were used for debugging and statistical information. The region in black is an OpenGL rendering view that recreated the digitized ink as it is drawn on the device. The interface also displayed a copy of the word and message that were currently entered, along with an {x, y} formatted trace of the stroke. It is important to note that during the interaction, the interface was resized so as to hide all elements except the
72
highlighted ones to prevent any bias in the results. This is especially important for the eyes-free mode. The device application was developed using OpenGL ES and in the same environment as the host application (Xcode). There were two modes. Upon start up, the application attempts to connect to the host. A successful connection is signalled by a soft non-speech tone. After this, all interaction is sent over the wireless link to the host application as user events. If a connection is not made (WiFi off, host application not running, or other issues), a high-pitched tone is sounded and a “rollback” mode is entered. In this mode, no events are transmitted, but the application still functions allowing a user to interact with the device. This was particularly useful for informal evaluations and demonstrations during the design phase. With this setup in place, an empirical evaluation was carried out to test whether eyes-free text entry on touchscreen devices is viable or not. The next section presents the methodology and results of an initial study.
4.7 Methodology 4.7.1 Participants Twelve paid volunteer participants (8 male, 4 female) were recruited from the local university campus from varying disciplines. Participants ranged from 18 to 31 years (mean = 23, SD = 4.9). All were daily users of computers, reporting 2 to 8 hours usage 73
per day (mean = 5.5, SD = 2.2). Two indicated prior experience with Graffiti, describing their expertise as “novice”. Three participants used a touchscreen phone everyday, while others had either no experience or less than fifteen minutes of usage. Eleven participants were right handed.
4.7.2 Apparatus The apparatus was described in Section 4.6.
4.7.3 Procedure and Design Participants
completed
a
pre-test
questionnaire
soliciting
demographic
and
computer/phone usage information (results cited above). The experiment was divided into three parts: training, eyes-on, and eyes-free. The training phase involved entering the alphabet A to Z three times; entering the phrase “the quick brown fox jumps over the lazy dog” twice; and entering one random phrase from the phrase set (MacKenzie and Soukoreff, 2003). The goal was to bring participants up to speed with the Graffiti alphabet. Following this, there were two sessions of entry: 4 blocks × 5 phrases/block eyes-on, followed by 4 blocks × 5 phrases/block eyes-free. The eyes-on and eyes-free modes were not counterbalanced. The experiment was designed as an evaluation of eyes-free text entry, rather than as a comparison of two competing interaction techniques. Learning during the eyes-on mode was seen as a prerequisite and therefore, was understood to contribute to performance on the eyes-free mode. More about this later. 74
Prior to data collection, the experimenter explained the task and demonstrated the software, including the method to enter a
SPACE
(double tap), terminate entry at the end
of a phrase (shake device), and correct errors. Error correction was restricted to the previous character only. This was done to prevent users going back to correct errors made at the beginning of a phrase. The Graffiti alphabet was displayed on the device as an overlay so that users would not have to look away and home in again. Participants were instructed to enter text “as quickly and accurately as possible”. Participants were encouraged to take a short break between phrases if they wished. The software recorded time stamps for each stroke, per-character data, per-phrase data, ink trails of each stroke, and some other statistics for follow-up analyses. Timing for each phrase began when the display was touched to draw the first stroke and ended with the shake of the phone. Certain characters posed difficulty, such as the letter “G”. For this and other such characters, alternative entry methods were demonstrated. Figure 4-7 shows two ways of entering “G”. Preliminary tests had revealed that entering G as on the left was harder than the alternative – drawing a six.
75
Figure 4-7. Ink trails of an unrecognized (left) and recognized (right) “G” stroke. Participants sat on a standard study chair in front of a desk with the host machine at eye level (see Figure 4-8).
Figure 4-8. Participants entering text on the iPhone eyes-on (left) and eyes-free (right). In eyes-free mode, participants could not see the iPhone display. The interaction was two-handed requiring participants to hold the device in their non-dominant hand and enter strokes with their dominant hand. Informal tests suggested that one-handed interaction using the thumb for input is possible, but more error prone. However, this mode was not tested. 76
4.8 Results and Discussion In all, participants entered 240 phrases in the eyes-on mode and 240 phrases in the eyesfree mode.
4.8.1 Speed Raw Entry Speed
The results for entry speed are shown in Figure 4-9. As expected, entry speed increased significantly with practice (F1,3 = 17.3, p < .0001). There was also a difference in speed by entry mode (F1,11 = 6.8, p < .05). The average entry speed for the eyes-on mode was 7.00 wpm, which is the same novice speed reported by Fleetwood et. al (2002). Average entry speed for the eyes-free mode was 8% faster, at 7.60 wpm. The maximum entry speed reached for one participant was 12.00 wpm. As noted earlier, the eyes-free and eyes-on modes were not counterbalanced; thus, the higher entry speed in the eyes-free mode is in part due to practice in the eyes-on mode. However, this explanation is a minor one. What is notable here is that there was no degradation in performance in the eyes-free mode. In fact, there was an increasing trend from one block to the next. This, alone, is an excellent result and attests to the potential benefits of non-visual feedback modalities for touchscreen phones.
77
Figure 4-9. Entry speed (wpm) by entry mode and block. (See text for details of the adjusted results.) Adjusted Entry Speed
Figure 4-9 also reports adjusted entry speed. For this metric, the time for entering strokes that were unrecognized was removed. Note that this does not remove errors in the transcribed text, but just the time for unrecognized strokes. Overall, adjusted entry speed differed between modes (F1,11 = 13.7, p < .005). There is a significant effect of block on adjusted entry speed between modes as well (F1,11 = 21.8, p < .0001). Participants reached an overall adjusted entry speed of 9.50 wpm in the eyes-free mode; an improvement of 15% over the 8.30 wpm for eyes-on. On one trial, a maximum adjusted entry speed of 16 wpm was reached. One flaw in the design of the experiment has yielded results that are on the conservative side, since the time measurement included the interval between the stroke for the last character in a phrase and the final shake of the device to terminate the trial. 78
This extra time tended to push the reported entry speed down somewhat. Overall, the entry speeds were likely about 5-10% faster than reported above. Furthermore, participants sometimes inspected their input before shaking the device in the eyes-on mode, which adds time, thereby decreasing the entry speed reported.
4.8.2 Accuracy and KSPC Participants were not required to maintain synchrony with the presented text. Doing so would be problematic for the eyes-free mode; it would require visual inspection by the participant to verify the current cursor position. As a result, accuracy was analysed in two ways. One analysis used the conventional error rate computed as the “minimum string distance” (MSD) between the presented and transcribed text phrases (Soukoreff and MacKenzie, 2001). The other analysis used keystrokes per character (KSPC) – the number of keystrokes used to generate each character in the transcribed text. KSPC is both a characteristic measure and a dependent measure . In this experiment, it is used as a dependent measure. Of course, “keystrokes” here means “finger strokes”. MSD Error Rates
Figure 4-10 shows the results of the MSD error rate analysis. Overall, error rates were very low at 0.4% (accuracy > 99%). An analysis of variance showed no significant effect for error rate, suggesting that errors are attributed to variation in participant behaviour rather than to differences in the modes under test.
79
Figure 4-10. MSD error rate (%) by block and entry mode. One interesting observation is that errors in both modes decreased by block. This suggests that users got better with the technique as they progressed. However, it is equally probable that users corrected more of their errors in later blocks. Since the error rate is so low, an inspection of the raw data was done to determine the sort of behaviours present. A closer look revealed that the most common error was forgetting to enter a SPACE
at the end of a word (double tap). With practice, users became more alert of this
fact, thus decreasing the error rate across blocks. The SPACE character was forgotten after the most common words such as “the”, “of”, “is”, etc. This behaviour was also found in the LetterScroll experiment for physical devices and is reported in Section 3.4.3 under Word-Level Analysis. Furthermore, if a stroke was unrecognized, it was not logged as an error. Such strokes are accounted for in the KSPC measures, discussed next.
80
KSPC Analysis
The MSD error analysis only reflects errors remaining in the transcribed text. It is important to also consider strokes to correct errors. For this, the KSPC metric provides another sense of accuracy. Every correction adds two strokes – one for the delete operation and one for the corrected character. Strokes that were not recognized by the recognizer are also added on. If entry was perfect, the number of strokes equals the number of characters; i.e., KSPC = 1.0. (Note: the double-tap finger gesture for a
SPACE
was counted as one stroke.) However, participants make errors and corrections, and the recognizer occasionally fails to recognize a stroke. These behaviours take time, which decreases throughput. Thus, reviewing KSPC is useful in judging the accuracy and effort in entering text. Results of this analysis are shown in Figure 4-11. Since the chart uses a baseline of 1.0 and perfect input occurs with KSPC = 1.0, the entire magnitude of each bar represents the overhead for unrecognized strokes or for errors that were corrected.
81
Figure 4-11. Keystrokes per character (KSPC) by block and entry mode. Overall, the average KSPC in the eyes-free entry mode was 9% higher, at 1.36 compared to 1.24 for eyes-on. The trend was consistent and significant between entry modes (F1,11 = 38.3, p < .0001), but not within blocks. The 4th block is the most interesting. The difference in KSPC between modes is greatest. Also, the eyes-on KSPC in the 4th block is the lowest among blocks, while the eyes-free KSPC value in the 4th block is the highest among blocks. These are opposite extremes, which suggest that participants were getting better as they progressed in the eyes-on mode. On the flip side, participants apparently invested more effort in the 4th block in the eyes-free mode. The exact reason is unclear but it can be speculated that this was due to fatigue toward the end of the experiment, which caused participants to be slightly more reckless or error prone.
82
4.8.3 Unrecognized Strokes and Corrections The KSPC analysis considered unrecognized strokes at a phrase level. A deeper character-level analysis was also performed to find problematic characters. Results are shown in Figure 4-12. Letters are sorted by their relative frequency in the English language. Each stroke was captured as an image allowing us to examine traces that were unrecognized. Although the stroke was unrecognized, it was easy to tell what was intended by looking at the images (see Figure 4-13).
Figure 4-12. Unrecognized stroke frequency (%) by character. Error frequency is defined as the number of occurrences of an unrecognized stroke for a given character relative to the total number of unrecognized strokes. This gives a sense of the frequency of an error relative to the remaining unrecognized strokes. These data must be interpreted with care. One may argue that a character was unrecognized more frequently than others, solely because it occurs more frequently in the 83
English language. For this reason, the focus is on characters that correlate with the subjective ratings provided by participants in the post-test questionnaire. In Figure 4-13, the most frequently unrecognized stroke is the character “O”. This character is of special interest because 9 of 12 participants reported the character as problematic. A closer look at the stroke traces reveals why.
Figure 4-13. Stroke traces for the letter “O” for eyes-on (top two) and eyes-free (bottom two) modes. As is apparent, variations in the stroke trace are problematic for the recognizer. Either there is overshoot, undershoot or the begin-end locations are displaced. Participants have the correct mental model, but are unaware of the spatial progress of their finger. Furthermore, as Figure 4-13 illustrates, the problem worsens in the eyes-free mode, as participants had difficulty in judging where drawing begins and ends. Other characters reported as troublesome in the post-test questionnaire included “T”, “E”, and “N”.
84
The number of corrections was not significant between blocks; however, there was a significant main effect found by entry mode (F1,11 = 8.8, p < .05). Overall, the mean number of corrections was 1.64 per phrase. As expected, the number of corrections in the eyes-free mode was 45% higher at 2.00 per phrase, while the eyes-on mode averaged at 1.30 corrections per phrase.
4.8.4 Stroke Size A difference was also found in the size of the strokes made by users. In general, most participants started off drawing large strokes covering a lot of the screen. However, stroke size decreased as they progressed in the experiment. Although the recognizer performs better with a larger number of sample points, it appeared that larger strokes deviated more in overall structure from the desired shape. This was confirmed when it was observed that one participant who made very small strokes had the lowest average unrecognized stroke count per phrase (3.10). (The overall unrecognized stroke count per phrase was 5.40 across both modes.) To investigate a possible “stroke size” effect, a separate analysis of the amount of screen space used during entry was carried out. For this, “Screen Space Used” is defined as the area of a stroke's bounding rectangle as a percentage of the total screen area. The progression of screen space used by entry mode and block is shown in Figure 4-14. While the main effect for entry mode was not significant (F1,11 = 0.014, ns), the main effect for block was (F3,33 = 4.068, p < .05). During the 1st block, participants' sized their finger gestures such that strokes occupied 11.7% of the available screen space, on average. The 85
same statistic for the 4th block is 10.3% – representing a reduction of 14.3% in the amount of screen space used for strokes.
Figure 4-14. Amount of screen space used (%) for strokes by entry mode and block. The effect by block in Figure 4-14 is not a large one, however. This attests to the considerable variation participants exhibit in their personal style and proficiency in entering finger gestures. However, the effect is interesting and may prove more dramatic as expertise develops. No existing research was found that reports such an effect. Hence, it is included here.
4.8.5 Informal Feedback and Questionnaire At the end of testing, participants were asked for observations and feedback on their experience with the entry modes. The most provocative comment was that some participants used partial eyes-free behaviour in the eyes-on mode. Apparently, they 86
quickly learned the strokes and did not need to inspect the strokes on the overlaid map. Instead, their gaze fixated on the presented text displayed on the host machine, and followed the characters as they progressed. They would occasionally glance at the device, for example to visually check a character that was unrecognized. When inquired about this behaviour, participants responded by saying that “the audio feedback was sufficient to let me know where I am”. Many participants felt they could go faster but that doing so may result in unrecognized strokes, which would require additional time in re-entering them. In an effort to prevent unrecognized strokes, participants became more careful in their drawing. This is a classic example of the speed-accuracy tradeoff: speed up and accuracy suffers. Certainly, this effect would be mitigated with a better recognizer or with continued practice. It was also interesting to note that one participant drew at an angle, despite holding the device in the correct orientation. This can be explained by their handwriting style, which translates into the way they inked strokes on the device. To correct this, the participant had to make a conscious effort to ensure that the strokes were drawn straight. This behaviour could be accounted for, perhaps, with a customizable setup parameter to spatially rotate sample points before recognition. The post-test questionnaire solicited comments and responses to five questions. The results are shown in Table 4-1.
87
Question How satisfactory was the stroke recognition? (Poor = 1; Excellent = 5)
Mean (SD) 3.3 (0.49)
How fatigued were you when entering text without looking? (Not fatigued = 1; Very fatigued = 5) How difficult was it to enter text without looking? (Not hard = 1; Very hard = 5)
2.5 (0.6) 2.3 (0.77)
How likely are you to use eyes-free text entry? (Not likely = 1; Very likely = 5)
3.4 (1.3)
Table 4-1. Participant questionnaire. Participants also provided ratings for the most problematic characters. The highest rated character was “O” (9 out of 12), followed by “T” (7 out of 12), and “E” (6 out of 12). A predominant comment was that the experiment was enjoyable, and the eyesfree aspect of the interaction was interesting, and thought provoking. Other comments were generally about the recognizer. One participant stated “there is a severe problem with the letter “E” in the eyes-free mode”.
4.9 Apparatus Limitations As discussed earlier, the poor performance of the recognizer with strokes that were visually correct but were technically insufficient was troublesome. This suggests that the recognition engine needs to be improved or perhaps use other techniques that make it more robust. 88
Additionally, there was an orientation issue for the eyes-free mode. The shake gesture required shaking the device, and placing it back at knee level, followed by homing in on the device with the preferred hand. Occasionally, participants would unknowingly alter the orientation of the device. This altered the set of strokes sent to the recognition engine, which caused an increase in unrecognized strokes. If participants voiced this issue, they were asked to reorient the device and continue. A potential solution may be to utilize the built in accelerometer of the iPhone to detect the device orientation and “normalize” the strokes so that slight deviations in orientation are automatically handled; however, the accuracy of such a mechanism is uncertain.
4.10 Summary This chapter presented a finger-based text entry technique for touchscreen devices combining a single-stroke text entry method (Graffiti) with auditory and vibrotactile feedback. A key feature is the support for eyes-free text entry. No previous work was found that evaluates such an interaction. The primary goal was to explore this design space and to inquire if eyes-free text entry is even possible. From the evaluation, an overall entry speed of 7.30 wpm was found with speeds being 8% higher (7.60 wpm) in the eyes-free mode. This was contrary to expected outcomes. It was expected that the potential difficulty in occluding the device would bring down the overall text entry speed in the eyes-free mode, but the learning gained during the training and eyes-on session contributed to the eyes-free entry speed – despite higher overall error rates in the eyesfree mode. 89
The error rates in the transcribed text were not significant between entry modes. Participants entered text with an accuracy of 99.6%. KSPC analyses revealed that eyesfree text entry required an average of 9% more finger strokes per phrase. Deeper character-level stroke analyses revealed that certain characters had inherent problems with stroke formation or recognition. These point to the limitations and weaknesses of the recognizer but also to behavioural factors for stroke formation in the absence of visual feedback. There was also a 14.3% reduction in the amount of screen space used for finger strokes, from the 1st to the 4th block in the experiment. Overall, the results are promising. The next chapter explores an enhanced version of this application that adds support for error correction. The primary intent is to improve the robustness of the interaction and determine the user’s intention using a dictionary, to provide a resilient mechanism for dealing with unrecognized strokes.
90
Chapter 5 Enhancing the Eyes-free Touchscreen Interaction The previous chapter focused on the design, implementation, and experimental evaluation of a stroke-based interaction for eyes-free text entry on touchscreen devices. This initial analysis and user feedback highlighted some of the shortcomings of the interaction. This chapter explores these problems and provides improvements that enhance the interaction. The redesigned interaction is followed by an empirical evaluation.
5.1 The Redesign Process 5.1.1 Issues Found The first issue with the interaction was speech feedback. Although it was well delivered, informing the user of every character with speech was more than what was needed – even for the eyes-free mode. In addition, users invested time in confirming the character that was entered after each stroke. This added significantly to the time taken to complete text entry for a phrase, thus affecting throughput. Further, there is no feedback at the end of a word, making it impossible to determine what word was entered in the eyes-free mode. This can be an issue during the evolution of text, increasing the potential of forgetting where the user is in a sentence. Second, the interaction provided vibrotactile feedback to users when an unrecognized stroke was encountered. So, when a stroke is entered that has no match in 91
the alphabet, the device vibrated to alert the user and allowed for repeated attempts. Users acknowledged this as a useful feature during training, but found it frustrating and timeconsuming during the experiment. The constant vibrations generated due to lack of experience with the alphabet could be useful for training but are cumbersome for actual text entry. And lastly, the lack of any dictionary-based support meant that the system was not able to provide any assistance to the user when entering text. A feature such as this can potentially correct many of the errors made and enhance the robustness of the interaction. Such ‘system help’ would be invaluable for text entry in both the eyes-on and eyes-free modes.
5.1.2 Improvements and Problems Results from the previous experiment made clear that certain changes in the interaction could improve performance and ease of use. Primarily, user feedback had to be approached differently. As the first enhancement, speech feedback must be shifted from the character-level to the word-level, i.e., users are alerted to the word that has been entered, via speech, when a
SPACE
character is inserted (double-tap). This is a viable
alternative as users are able to learn and retain the Graffiti alphabet fairly easily and quickly. In addition, the vibrotactile feedback had to be reduced or eliminated such that its absence would not affect the overall experience of the interaction. In other words, a
92
different approach is needed that either drastically decreases the need for such feedback or relocates it so that its frequency reduces to an acceptable level. One of the problems with rethinking the interaction is that new problems emerge. For instance, consider the removal of vibrotactile feedback. This raises the question: how is the user informed of an unrecognized stroke? Secondly, variations in handwriting can result in invalid recognition, or misrecognition. Shifting feedback from the characterlevel to the word-level takes away from this information, as users may not be aware that the stroke they entered has been misrecognized as some other letter. One approach is to provide speech feedback every time a character is input that is different from the expected character, but this is problematic because it breaks the natural flow of text entry. Furthermore, it is not possible to know for certain what the expected character is when there is no stimulus text, e.g., during real world use. The next section addresses these issues and discusses the solutions that are employed to overcome potential drawbacks of these changes.
5.2 Exploring the Enhancements 5.2.1 Interaction Modifications Speech and Vibrotactile Feedback
The new interaction provides speech feedback at the word-level. As each character is entered, users receive a short, non-speech ‘click’ sound. The purpose of this sound is to register that a stroke was received, regardless of it being recognized or not. Text entry 93
progresses with a sequence of characters that are acknowledged with this sound. Once the word is complete, users double-tap to enter a
SPACE
character. At this point, the system
speaks out the last word entered. As a result, users are able to progress through text entry without interruption. This chunks the text at a word level alerting users to the last entered word instead of the last entered character. It is expected that this will improve the flow of the interaction and improve throughput. The vibrotactile feedback was useful but troublesome when it was constantly fired, especially for characters that posed a problem. For instance, the letter “O” was not recognized well in eyes-free mode and this triggered many vibrations. As a result, it was removed. A correction algorithm was employed to handle such situations. An in-depth analysis of this algorithm is provided in Section 5.2.2. Phrase and Word Navigation
In the last experiment, a shake gesture was used to signal message completion. The shake gesture moved users from one phrase to the next. However, this is problematic because it does not capture the true text entry time. Users tended to check the text that was entered before shaking the device. The delay between these two times is minor but decreases throughput by an average of 5-10%. Instead, the shake gesture was eliminated. In this version, the system automatically moves the user from one phrase to the next when the last character of the phrase is received. Consequently, text entry data are collected in word chunks. In the previous experiment, data were collected at the phrase level. This made it difficult to analyze how 94
individual characters affected the interaction. By dividing data collection into individual words, it is possible to compare the presented character with the entered character. Consider the following example: Presented:
physics and chemistry are hard
Entered:
physsicsamd chegistrx arehard
When analysing this data at the phrase level, it is problematic to determine where one word ends and the next one begins. This occurs when users forget to enter a
SPACE,
and is common when entering text eyes-free. Although it is not possible to completely eliminate this problem, the likelihood can be decreased. The host application was modified to guide users through phrase entry, as well as word entry. The application now highlights the word that must be entered. Figure 5-1 illustrates this by highlighting the word “apartments”. This promotes and reminds users to enter a SPACE character to move the highlight to the next word.
Figure 5-1. Modified interface with word highlighting. 5.2.2 Dictionary-based Error Correction Section 5.1.2 outlined some of the problems that are associated with these modifications, especially the lack of feedback for unrecognized and misrecognized characters. To deal 95
with this, an algorithm was designed that achieved two goals. First, it handles errors that occur when a character is not recognized or invalid recognition takes place. Second, it assists users in finding the right word using a dictionary if multiple words match the entered text. Several methods are incorporated into the algorithm to deal with such errors and provide suggestions. Handling Stroke Errors
When an unrecognized stroke is encountered, it is replaced with a period symbol. As an example consider the word “hello” where the first “l” is unrecognized. The user is not interrupted at all. Instead the unrecognized letter “l” is replaced with a period, “he.lo”. In the event of a misrecognized stroke, no changes are made. The application simply accepts the stroke as is because at this point, there is no way to tell if the stroke has been misrecognized or not. For instance, consider again the word “hello”, where the second occurrence of the letter “l” is misrecognized as an “i”. In this case, the text looks like “helio”. If we combine the two errors, the text looks like “he.io”. The next step is to handle these errors. Once the user completes entry for a word and double taps to insert a space, the word is passed on to the algorithm for error checking and handling. The algorithm routines fire to check and correct any errors that may have taken place. The correction algorithm takes as input two elements of data. First, a 9,000 word dictionary with letter frequencies and second, the word that was entered by the user with unrecognized strokes replaced by a period and misrecognized strokes left as 96
is. The dictionary was obtained from the British National Corpus (BNC, 2009) and has the following form: the 5776384 of 2789403 and 2421302 a 1939617 in 1695860
Each word is separated by a space and its frequency in the language corpus. The operations that take place are discussed next. Regular Expression Matching
The first task involves narrowing the search space. It is assumed that the user has entered the correct length of the word. Based on this, the search is conducted on all words in the dictionary that have a matching length. So, for the word ‘hello’, a search is conducted on all words that have length 5. Suppose that the word ‘hello’ was entered as ‘hel.o’ – suggesting that one ‘l’ was unrecognized. The algorithm searches for all words that match the search word ‘hel.o’ such that the character with a period is replaced by any matching character in the dictionary. This results in a single match, the word ‘hello’. Any other word with unrecognized characters is dealt with similarly. If the spelling is correct and some of the characters are unrecognized, regular expression matching provides a resilient mechanism for identifying the correct word.
97
However, what if there are spelling mistakes or misrecognized characters in the text? In these cases, an alternative technique must be employed using minimum string distance. Minimum String Distance (MSD) Searching
The minimum string distance (MSD) between any two strings is the minimum number of primitives - insertions, deletions, or substitutions - to transform one string into the other. Using this metric, it is possible to detect misrecognized characters and find matching words. Consider the following example where the word ‘heggo’ is transformed into the word ‘hello’: heggo helgo hello Done.
The above transformation requires two substitute operations in order to transform ‘heggo’ into ‘hello’, hence, it has an MSD value of 2. Assuming the user entered the correct length of the word, it is possible to narrow the search space drastically and find a viable match. Note that in this algorithm, the focus is on substitution primitives due to the assumption that the user has entered the correct length of the word.
98
However, the problem with this method is determining the bounds of the MSD value. For instance, if we search for all words that have an MSD value of 5 or less for the input word ‘heggo’, the result obtained are all 5 letter words in the dictionary. This is because the word ‘heggo’ is 5 characters long. In other words, it is impossible to tell how many misrecognized characters exist in the entered text. An MSD value of 1 may find nothing. On the other hand, searching for all words that fit into an MSD value of 4 or more may result in too many inappropriate matches. To tackle this issue, data from the first experiment were used to develop a heuristic approach. The resulting MSD mapping is a function of the word length and is done as follows: if wordLength <= 4
// 1, 2, 3, or 4 characters
set MSD value to 1 else if wordLength <= 6
// 5 or 6 characters
set MSD value to 2 else if wordLength <= 8
// 7 or 8 characters
set MSD value to 3 else set MSD value to FLOOR(wordLength / 2)
For words that are up to 4 characters in length, the algorithm searches for 1 misrecognized character of text. For words that are either 5 or 6 characters long, the algorithm allows for 2 misrecognized characters of text, and so on. For words that are 9 characters or more, the number of allowable misrecognized characters is the floor of half the word length. There is one caveat. In the event no matching words are found during
99
this search, the MSD value is incremented by one and the search is repeated. This modifies the mapping for words with length less than 9 as follows: if wordLength <= 4
// 1, 2, 3, or 4 characters
set MSD value to 2 else if wordLength <= 6
// 5 or 6 characters
set MSD value to 3 else if wordLength <= 8
// 7 or 8 characters
set MSD value to 4 else set MSD value to FLOOR(wordLength / 2) Combining the Results
One final step remains. The algorithm performs two searches: regular expression matching, and MSD searching. Once the results of both searches are obtained, they need to be merged. The merge operation is done as follows: listA = words found using regular expression matching listB = words found using MSD matching resultList = listA U listB, sorted by word frequencies
Recall that the dictionary in use also contains the word frequencies of each word. Based on this frequency, the results are sorted so that the word with the highest frequency (and thus, highest probability) appears at the top of the list. Also, duplicates are eliminated. Table 5-1 presents a sample of words and their suggestions found by the correction algorithm, sorted by their frequencies. 100
Search Key [word] hel.o [hello] compu..r [computer] begauze [because] ap.lg [apple] .uitas [guitar] siz..rs [sisters] poeans [oceans]
Matches Found hello, helen, helps computer, composer because apply, apple guitar, quotas sisters, singers poland, romans, oceans chs..er [chapter] chapter, chamber, charter, cheaper, chester Table 5-1. Sample list of search words and results. 5.2.3 Tying it Together with an Auditory Display The modified interaction and dictionary supported assistance provide the backbone of the word-level interaction described. These modules need to be connected to the interaction in a seamless manner. Text entry proceeds character by character. Each stroke produces a non-speech ‘click’ sound confirming that the character was registered. The character may be recognized, unrecognized, or misrecognized. Regardless of the outcome, the feedback remains the same and the user is allowed to proceed. Upon completion of a word, the user taps twice on the surface. This results in a SPACE character. Before the word is appended to the message, the correction algorithm validates the word. If the word exists in the dictionary as entered by the user, it is accepted. This is supported by speech feedback where the system speaks the word entered. If the word is not found in the dictionary, then it may contain one or more errors in the form of 101
unrecognized strokes or misrecognized strokes. Misrecognized strokes can occur due to a spelling error or due to an invalid stroke. The recognizer is susceptible to such errors if the orientation of the device changes. The input word is provided to the correction algorithm, which attempts to rectify the error(s). After matching, searching, and consolidating the results, the system speaks out the word and appends it to the message, if there is only one word in the result set. In the event that there are multiple words, the device enters a “PLAYBACK MODE”, providing an auditory display. During this mode, the words in the result set are spoken aloud one after the other. Stroke recognition for the Graffiti alphabet on the device is now suspended. Only three strokes are supported during playback. These are summarized below: North stroke (upward):
restart playback
Delete stroke (left swipe):
clear the word and attempt re-entry
Single tap (touch):
accept the last played word
Each word is appended with a 600 milliseconds period of silence to provide the user with sufficient time to decide whether to accept or reject the last spoken word. Word playback is cyclical. The start of each cycle is signalled with a two-tone bell. While in this mode, users can either restart playback by drawing a north stroke, discard the word with a delete stroke; or select the word with a single tap. Word selection is confirmed by replaying the selected word.
102
Given the error correction algorithm and interaction improvements, an experiment was carried out that tested eyes-free text entry interaction with three different feedback modes. These conditions are described next, followed by the methodology and results of the experiment.
5.3 Feedback Modes To test the enhancements and correction algorithm, three different feedback modes were devised. The first is called Immediate. In this mode, users receive speech feedback for each character of transcribed text. This mode is the same as the feedback provided in the character-level feedback in the previous experiment with the addition of word-level audio when a
SPACE
is inserted. This mode was tested to serve as a baseline for the other two
conditions. The second feedback mode is called OneLetter. In this mode, users are required to enter a valid first stroke. In other words, the system will not allow any words where the first stroke is unrecognized. For example, the word ‘hello’ requires the letter ‘h’ as the first character. OneLetter does not enforce the user to enter an ‘h’; however, it does require that a valid stroke is entered. So, any letter of the alphabet is acceptable. If a stroke is input that is not recognized, the system prevents the user from proceeding further and signals this with a pulse of vibration. Once the first letter has been input, the remaining letters proceed as described with only a ‘click’ sound for each character, irrespective of the outcome of recognition (unrecognized or misrecognized), until a SPACE
is inserted. The motivation behind this mode is to narrow down the search space 103
and to improve the probability of finding the correct word. A sample entry for the word ‘hello’ without errors would be supported by speech and non-speech feedback as follows: ‘h----’. Next, a space insertion would check that ‘hello’ exists in the dictionary. Since it does, the system would then speak the word: ‘hello’. Finally, the last mode is called Delayed. In this mode, there are no restrictions on the user. The user begins text entry for a word and proceeds all the way until a SPACE is inserted. At this point, the correction algorithm tests the entry and provides potential matches in the event of errors or collisions. Feedback for the word ‘hello’ would be the same as OneLetter, except the first ‘h’ would be replaced by a to give: ‘---’. The following
SPACE
character for terminating the
word would cause the system to speak the word ‘hello’.
5.4 Methodology 5.4.1 Participants Twelve paid volunteer participants (10 male, 2 female) were recruited from the local university campus. Participants ranged from 18 to 40 years (mean = 26.6, SD = 6.8). All were daily users of computers, reporting 2 to 12 hours usage per day (mean = 6.7, SD = 2.7). Six of the participants used a touchscreen phone regularly (“several times a week” or “everyday”). Nine participants were right handed. Participants had no prior experience with the system. Eight participants had tried Graffiti before, but none was an experienced user. 104
5.4.2 Apparatus The apparatus is the same as used in the previous chapter with enhancements described in sections 5.2 and 5.3.
5.4.3 Procedure The experiment was performed in a quiet room. A standardized height-adjustable study chair was used. Participants were asked to adjust the level as low as needed in order to comfortably place the device and their hands under the table. Prior to data collection, participants completed a pre-test questionnaire soliciting demographic data. The experiment began with training similar to the one described in section 4.7.3. The goal was to bring participants up to speed with the Graffiti alphabet. The software was demonstrated and questions were answered before the start of testing. Training was followed by four blocks of entry each in three different test conditions: Immediate, OneLetter, and Delayed – all eyes-free. Similar to the previous experiment, error correction was restricted to the last character only to reduce variability across participants. Participants were asked to enter text “as quickly and accurately as possible”. Each participant took between 50 minutes to 1 hour to test all three conditions of the experiment. The interaction was two handed requiring participants to hold the device in their non-dominant hand and perform text entry with their dominant hand. At all times, the device was occluded from view, except when participants took a break.
105
In this experiment, the data analysis toolkit was expanded. Data were collected at various levels of detail: raw, stroke level, character level, word level, phrase level, and block level. Timing for each phrase began when the finger was first touched to draw the first stroke and ended with the insertion of a SPACE character following the last word.
5.4.4 Design The experiment was a 3 x 3 within-subjects design. The two independent variables were feedback mode (Immediate, OneLetter, Delayed) and block (1, 2, 3). All three conditions were carried out eyes-free and were counterbalanced. Participants were allowed to take short breaks between phrases or between conditions, if needed. Blocks were administered one after the other with optional breaks. Aside from training, the total amount of entry was 12 participants x 3 entry modes x 3 blocks x 4 phrases/block = 432 phrases. Phrases were used from MacKenzie and Soukoreff’s phrase set (MacKenzie and Soukoreff, 2003).
5.5 Results and Discussion A range of dependent variables were measured through the course of the experiment. The basic metrics for speed and accuracy are presented. This is followed by additional investigations of the utility of the correction algorithm and a closer look at the quality of the generated word suggestions.
106
5.5.1 Entry Speed The results for entry speed are shown in Figure 5-2. As expected, entry speed increased significantly across blocks with practice (F2,18 = 6.2, p < .05). There was also a difference in speed by entry mode (F2,18 = 32.3, p < .0001).
Figure 5-2. Entry speed (wpm) by entry mode and block. The average entry speed for the Immediate mode was 8.34 wpm, which is 19% higher than the 7.00 wpm novice speed reported by Fleetwood et al. (2002). Average entry speed for the OneLetter mode was 27% faster, at 10.62 wpm. The highest average entry speed was obtained in the Delayed mode at 11.05 wpm, which is 32.5% faster than Immediate and 4% faster than OneLetter. The effect of group on entry speed was not statistically significant (F2,9 = .131, ns), indicating that counterbalancing worked. Post 107
hoc analysis using the Scheffé test revealed that differences were significant between Immediate-OneLetter and Immediate-Delayed mode pairings (p < .0001). The maximum entry speed results also shed light on how well the interaction can perform. The OneLetter mode obtained the highest rate at 21.5 wpm, followed closely by the Delayed mode at 20.76 wpm. The Immediate mode was further behind at 16.92 wpm. This variation, alone, is an excellent result and attests to the strong gains provided by the new enhancements. The entry rates highlighted in Figure 5-2 reveal some interesting findings. Unlike the entry rate for the Immediate mode, the OneLetter and Delayed modes do not have a smooth upward growing rate. It is difficult to assess why from the graph alone. At first, it seemed that perhaps the variation in suggestions provided by the algorithm caused the somewhat sporadic rates. Note that Figure 5-2 also provides two other text entry rates: Adjusted OneLetter, and Adjusted Delayed. These rates are produced by removing all time that was spent in playback mode. In other words, the adjusted rates pretend as though there was always a single word that matched the participant’s input and so no time was invested in dealing with errors or collisions. If the algorithm were responsible for the unexpected text entry rate curves, the adjusted rates should have been smoother, increasing curves. But, as the graph shows, they are just amplified versions of the actual rates.
108
After looking more closely at the data, it was found that these rates were a result of the counterbalancing effect. The conditions were split into three different orders. Thus, there were three groups of participants. The order of conditions presented to each group was as follows: Immediate, OneLetter, Delayed
[Group 1]
OneLetter, Delayed, Immediate
[Group 2]
Delayed, Immediate, OneLetter
[Group 3]
There was one critical aspect of the Immediate condition. It provided characterlevel feedback, plus word-level feedback. Unrecognized strokes were supported with vibrotactile feedback, and misrecognized strokes were instantly identified. As a result, the experience of entering text with the Immediate condition served as a training ground for the other two conditions. Based on the division of the groups, there were 8 participants that entered text in the Delayed mode before entering text in the Immediate mode (Group 2 and Group 3). In other words, these two groups were less experienced with Graffiti than Group 1. This lack of experience resulted in higher error rates. To better illustrate this effect, Figure 5-3 presents the progression of the observed entry rates from one block to the next for each entry mode, divided by groups.
109
Figure 5-3. Interaction line plot for entry speed (wpm). Effect: Mode x Block x Group The blue line (Group 1) has the most predictable and expected growth in each mode from one block to the next: each following block is better than the previous one. Further, the rise from Immediate to OneLetter shows that participants are getting better. The same effect cannot be found with the red line (Group 2) or the green line (Group 3). Although overall rates increased as expected, within block fluctuations are not as consistent. This variation reflects the lack of experience with the alphabet with lower text entry rates from one block to the next.
5.5.2 Accuracy This section discusses the accuracy of text entry based on error rates computed using minimum string distance (MSD) values. During entry, text transforms as follows. First, a stimulus of text is presented to the participants. This is the presented text. Next, the 110
participant transcribes the text by performing text entry. This is the entered text. Finally, the correction algorithm performs error correction on the entered text on a word-by-word basis. The string that represents the corrected text becomes the accepted text. Corrected error rate is computed using the following formula: Corrected Error Rate = Corrected MSD / Length of Phrase
Corrected MSD is the minimum string distance between the presented text and the text accepted after error correction is carried out by the correction algorithm. Raw MSD is the minimum string distance between the presented text and the entered text. Note that Corrected MSD is the same as Raw MSD for the Immediate condition as there was no error correction performed in this mode. Figure 5-4 illustrates the error rates obtained for each mode by block. Overall, error rates were very low at 4.3% (accuracy > 95%). An analysis of variance showed that entry mode had a significant effect on error rate (F2,18 = 8.2, p < .005). Delayed had the highest mean error rate at 7.04%. This was twice as high as OneLetter at 3.53% and 2.8 times as high as Immediate, at 2.50%. The effect of group on error rate was not statistically significant (F2,9 = 1.17, ns), indicating that counterbalancing worked. Although a block effect was expected, none was found. This is partially due to the fact that participants did not have a direct influence on the correction algorithm. The correction algorithm is designed to cater to individual differences, which prevented any block effect from being observed.
111
Post hoc analysis using the Scheffé test revealed that differences were significant between Immediate-Delayed and OneLetter-Delayed mode pairings (p < .0001). The variation in error rates between the Immediate-OneLetter pairing was insignificant. This suggests that the correction algorithm was more successful at correcting errors when the first letter of a word was a valid stroke. This is best observed in the differences between OneLetter and Immediate in block 3 of the figure; they are marginal.
Figure 5-4. Corrected error rate (%) by entry mode and block. To get a clear picture of how frequently errors occurred, a potential error rate frequency can be calculated. The average length of a phrase in the experiment was 30 characters. Table 5-2 summarizes the potential of errors by each condition.
112
Mode Immediate OneLetter Delayed Overall
Potential 1 error every 40 characters (or 0.74 errors per phrase) 1 error every 28 characters (or 1.06 errors per phrase) 1 error every 14 characters (or 2.11 errors per phrase) 1 error every 23 characters (or 1.30 errors per phrase) Table 5-2. Potential error rate frequency
Table 5-3 presents a sample of some of the presented phrases along with the transcribed and accepted (i.e., corrected) text gleaned from the study. The variation in the transcribed and corrected text gives a sense of the strength of the correction algorithm. Presented elephants are afraid of mice question that must be answered the fax machine is broken three two one zero blast off fall is my favorite season do not walk too quickly stability of the nation
Entered Accepted e.e.hancs are a.ratd .. m..e elephants are afraid of mice ....tion tha. must be answered question that must be answered th. fax machin. is brpken the fax machine is broken three .w. .ne zer. b.ast of. three two one zero blast off fal. is m. fau.ritg seas.. fall is my favorite season d. n.t wa.. too quic.lo do not walk too quickly stadilit. .. the nati.n stability of the nation
Error Rate 9 0 5 0 3 0 6 0 7 0 6 0 5 0
Table 5-3. Presented text compared with entered and corrected text. 5.5.3 KSPC Analysis Similar to the previous experiment, an analysis of KSPC was carried out to identify the effect of error correction strokes on overall effort of text entry. If entry was perfect, the number of strokes equals the number of characters; i.e., KSPC = 1.0. (Note: the doubletap finger gesture for a
SPACE
was counted as one stroke.) Results of this analysis are
shown in Figure 5-5. Since the chart uses a baseline of 1.0 and perfect input occurs with
113
KSPC = 1.0, the entire magnitude of each bar represents the overhead for unrecognized strokes or for errors that were corrected.
Figure 5-5. Keystrokes per character (KSPC) by block and entry mode. Overall, the average KSPC was 1.27. KSPC for the Immediate mode was highest at 1.45. OneLetter was 16.4% lower at 1.21, while Delayed was 19.2% lower than Immediate, at 1.17. The trend was consistent and significant between entry modes (F2,18 = 51.8, p < .0001), but not within blocks. Post hoc analysis using the Scheffé test revealed that differences were significant between Immediate-OneLetter and ImmediateDelayed mode pairings (p < .0001). The variation in KSPC between the OneLetterDelayed pairing was insignificant. This is expected as these two modes are very similar and vary only in the requirement of one valid stroke per word. Also, KSPC remained the same or decreased from one block to the next in the Immediate and OneLetter mode. For
114
the Delayed mode, KSPC increased, albeit slightly, from the first block to the second. It remained the same as the second in block three. The OneLetter and Delayed modes were designed so as to decrease the effort required on the user’s part. This is clearly reflected from the results. One interesting observation is that the average KSPC found for the eyes-free mode in the previous experiment was 1.36. This is about 7% less than the value found in the Immediate mode of this experiment. (Note that the Immediate mode was very similar to the eyes-free mode of the previous experiment. The only difference was the addition of word-level feedback in this experiment.) There are two reasons for this. First, counterbalancing the conditions resulted in 8 out of twelve participants being tested on a mode that was not Immediate first. As a result, the impact of learning is reduced since OneLetter and Delayed do not provide character-level feedback (Delayed provides character-level feedback for the first letter only, which does not contribute as highly to learning). Second, the previous experiment did not counterbalance the two conditions. In Chapter 4, experience with Graffiti in the eyes-on mode was a prerequisite for testing in the eyesfree mode. As a result, the learning from the eyes-on mode carried forward to the eyesfree mode producing a lower KSPC value than the one obtained here.
5.5.4 System Help The idea of the correction algorithm was to enhance the text entry experience by providing a robust mechanism to deal with text entry errors. In other words, the burden of correcting invalid strokes or re-entering unrecognized strokes was shifted from the user to 115
the system. Transcribed text took two forms: entered text, and accepted text. The difference between these two forms of text is the result of the correction algorithm (see Figure 5-6).
Figure 5-6. The life-cycle of transcribed text. Accepted text is the final result of text entry. System help is a number that states the percentage of text that was entered incorrectly but was successfully corrected by the correction algorithm. So, this is the percentage of the phrase that was formed with unrecognized or misrecognized characters, but was eventually corrected by the algorithm. The results are depicted in Figure 5-7.
116
Figure 5-7. System help (%) by entry mode and block. Note that the Immediate mode is not shown as it did not support correction. As shown, the correction algorithm played a major role in rectifying participant’s text entry. The average of the OneLetter and Delayed modes is 14.90%. This number suggests that 14.90% of errors in the raw text were corrected by the correction algorithm. For a 30 chracter phrase, this is equivalent to 4.5 characters, or one word of text. In the Delayed mode, 15.56% of text was corrected. This value was at 14.23% for the OneLetter condition. The amount of raw errors is not small. This is expected as the lack of audio feedback at the character-level in the OneLetter and Delayed modes make it impossible for participants to verify input. Once again, a side effect of counterbalancing results in the inability to learn the Graffiti alphabet. Although an ANOVA across all three entry modes would result in statistical significance, the results are not very meaningful since the Immediate mode is always at 117
0%. However, post hoc analysis using Scheffé tests reveal that there is no significance of system help between the OneLetter-Delayed pairing either. Figure 5-8 presents one final illustration of how errors were tackled. Raw error rate refers to the errors in the entered text. Corrected error rate refers to the errors left in the text (i.e., accepted text) after the system has attempted to make corrections.
Figure 5-8. Raw and corrected error rates. The Immediate mode had no system specific error correction so both raw and corrected error rates are equal. However, the stark difference in magnitude between the raw and corrected error rates for the other two modes signify the degree to which errors were corrected by the algorithm. The key is to ensure that corrected error rates are low. From the graph, error correction worked best in the Delayed mode with a net improvement of 12.51%. The net improvement in the OneLetter mode was 11.61%. The difference is minor, and the final corrected error rate is more significant than the net 118
improvement in error rate. According to this approach, error correction was most valuable in the OneLetter mode. This is likely due to the requirement of a valid first character. Getting one character correct dramatically narrows the search space and increases the likelihood of finding the intended word.
5.5.5 Word Level Analysis The OneLetter and Delayed modes rely on a correction algorithm to fix user errors. As a result, often times multiple matches are found for each word, forming a list. This discussion examines the size of lists and the position of the intended word. Since the Immediate mode had no such correction mechanism, there is no list size data or word position data associated with it. Candidate List Size
Figure 5-9 presents the average candidate list size found per word by block and mode. Overall, list size was 2.43 words. List size for OneLetter averaged at 1.89 words. Not requiring a valid stroke for the first character (the Delayed mode) resulted in a 56% higher list size, at 2.96 words per list. Similar to the System Help metric, results for list size are significant by entry mode (F2,18 = 30.5, p < .0001), but this is largely due to the fact that list size for the Immediate mode is always zero. The comparison of interest is whether lists size was significantly different between the OneLetter and Delayed mode pairings. Post-hoc analysis using the Scheffé test revealed that this pairing was indeed significant (p < .05). 119
Figure 5-9. Mean list size per word by block. OneLetter fluctuated by small amounts, decreasing in value across blocks. The implication of a shorter list is that a higher percentage of characters in each word were being entered correctly, so there is a slight learning effect visible. The audio feedback from the first character reinforced the alphabet and allowed users to learn the nuances when the stroke was unrecognized. The same trend is not visible in the Delayed mode. Most likely, this is due to the lack of any character-level feedback, thus inhibiting any meaningful learning to take place. Word Position
A second word-level metric of interest is word position. For the candidate lists that are generated, word position provides a sense of how well the results are organized; see Figure 5-10.
120
Figure 5-10. Word position in candidate list by block and mode. The candidate list size for the Immediate mode was always one. Thus, the magnitude of the bars in Figure 5-10 represents the overhead induced by OneLetter and Delayed.
Entry
mode
had
a
significant
effect
on
the
word
position
(F2,18 = 17.2, p < .0001). Post-hoc analysis using a Scheffé test revealed that the OneLetter-Delayed pairing was not significant (p > .05). Furthermore, word position was not significant across blocks. Overall, word position averaged at 1.10. Although the average word position provides an idea of where the target word was positioned in the list, it is not a very powerful metric as an ordinary count. Perhaps a better approach is to ask the question: how frequently was the intended word in position 1 (or 2, or 3, …, and so on). Since the overall average word position is just 1.10, it is obvious that most words probably occurred close to the top of the list. For this reason,
121
Figure 5-11 deconstructs the word position frequency for positions up to word 3, and accumulates the frequency for the words at position 4 or greater into one data element.
Figure 5-11. Word position frequency by entry mode. The results in Figure 5-11 show that about 77% of the words in OneLetter were at position 1, while this figure for the Delayed mode sits at 82%. The relationship changes for words at position 2. For OneLetter, 19% of the words were at position 2, compared to 11% for the Delayed mode. Perhaps the most important finding here is that 94.2% of the intended words were either in position 1 or position 2 of the candidate list (95.3% for OneLetter and 93.2% for Delayed). Only 2.7% of all words were at position 3, and 3.1% of words were at position 4 or higher. This is strong evidence to show that the regular expression matching and MSD searching characteristics of the correction algorithm work well to provide a resilient mechanism for handling unrecognized and misrecognized strokes. 122
5.6 Summary This chapter presented an enhanced version of the eyes-free text entry interface first outlined in Chapter 4. The interaction was redesigned by modifying the audio feedback delivery method. Feedback was shifted from character-level to word-level feedback, providing speech output at the end of each word. This shift of feedback from character-level to word-level is specifically problematic for text entry in an eyes-free context as it takes away from a user’s ability to tell what stroke was entered last. As a result, the potential for errors increases. An alternative mechanism was devised to deal with this issue. At the end of each word, there is a chance that some of the transcribed strokes were unrecognized or misrecognized. This can be due to the weakness of the recognizer or faults in user input. The transcribed text (with the errors) is passed through a dictionary-based error correction algorithm. The primary purpose of this algorithm is to use regular expression matching and a heuristically determined minimum string distance search to generate a candidate list of possible words based on the transcribed word. The generated candidate list contains words sorted by their frequency in the English language. The candidate list was supported with a method to navigate the suggestions by entering a “PLAYBACK MODE” if there were multiple results. Since the task is carried out eyes-free, the display is auditory. It provides basic gesture-based functionality such as restarting playback of the suggestions, cancelling playback (and thus re-entering the word), and selecting the last spoken word. 123
To test this new interaction, three text entry modes were designed for a comparative analysis. The first one was called Immediate. In this mode, text entry progressed with character-level feedback plus word-level feedback. The correction algorithm was not employed, thus a “PLAYBACK MODE” was never engaged. The second mode was called OneLetter. In this mode, users were required to enter a valid stroke for the first character of the word. There are two reasons for this approach. Firstly, the character-level feedback for the first word promotes learning and allows users to improve as they progress. Second, getting the first stroke correct provides the correction algorithm with an advantage by narrowing the search space. Finally, the third mode was called Delayed. In this mode, no character-level feedback is provided. There is a higher degree of reliance on the correction algorithm as no reinforced learning occurs. Feedback is at the word-level. A 3 x 3 within-subjects design was used to conduct the experiment. The test conditions were counterbalanced across participants. The overall text entry speed achieved was 10.00 wpm. The Delayed mode had the highest mean entry speed (11.72 wpm), followed by OneLetter (11.02 wpm) and finally, Immediate (8.34 wpm). Error rates were also measured. These were relatively low for the Immediate and OneLetter conditions at 2.46% and 3.53%, respectively. Delayed was higher at 7.04%. Finally, an analysis of “system help”, i.e., the percentage of text corrected by the system, and a word level analysis of the candidate list size and target word position was also carried out. The trend of results suggest that the OneLetter mode provides a 124
reasonable balance between speed and accuracy. However, the Delayed mode can allow users to reach their maximum potential. Ideally, the OneLetter mode can be engaged for initial training and use until users have reached sufficient skill. Once proficient, the Delayed mode can be used instead.
125
Chapter 6 Conclusion This thesis explored eyes-free text entry interactions for mobile devices. The work was divided into two branches: physical button devices and touch sensitive devices.
6.1 Physical Button Devices Chapter 3 studied eyes-free interactions for mobile devices with physical buttons. This investigation was driven with the goal of incorporating appropriate feedback mechanisms while using existing hardware to enable eyes-free use. In this work, potential interaction techniques were coupled with language models to carry out a priori analyses. The resulting technique was called LetterScroll with a focus on text entry for visually impaired users. As a result of a priori analyses, four different variations of LetterScroll were implemented. In all four possibilities, the scroll wheel on the mouse was used for navigating the character set. Three of these methods required additional keys that were employed using the keyboard. The keystrokes per character (KSPC) metric was used to determine the effort needed in each variation of LetterScroll. Out of the four versions, two variations were selected for formal evaluation (Method #1, Method #4) with twelve blindfolded participants. Method #1 traversed the characters using the mouse wheel and buttons with no additional keys. In Method #4 cursor navigation was supported by jumping to vowels 126
using keys on the keyboard, in addition to the mouse. Traversing each character resulted in speech feedback. Character selection and word advancement was supported by similar feedback. Overall, the KSPC observed was 40.3% higher than the modeled KSPC. The ability to jump was extensively utilized by all participants as it required lower physical effort. However, this did not immediately translate into increased throughput as participants spent time deciding what vowel to jump to, or whether to scroll the mouse wheel forward or backward. Although overall text entry rates were not high at 2.8 wpm for Method #1, and 4.3 wpm for Method #4, they are within the range of entry rates usually acquired for persons with visual impairments. The lower KSPC in Method #4 of the pre-experiment models suggested a larger potential improvement in entry rate between the two conditions, but this did not manifest itself in the experiment. This is likely due to the lack of cognitive parameters, such as attention demands, in the model.
6.1.1 Future Work Visual feedback can take many different forms. This increases the richness of the medium. The use of color, highlighting, flash, shake, and other animating screen stimuli allows large amounts of data to be communicated with short and quick feedback mechanisms. The power of such feedback is tantamount in understanding the limitations of auditory displays. Primarily, audio feedback takes the form of speech or non speech sound. The use of earcons, sounds of varying pitch and frequency, provide the mechanisms needed to 127
provide the feedback needed. However, auditory feedback is limited by the temporal nature of how humans process sound. If the rate of speech increases too high, the ability to comprehend degrades. One approach is to build enhancements into the existing system (LetterScroll) so that the reliance on speech is decreased. This also decreases the cognitive load imposed on the user. A language model that is derived from a corpus of the target task can be used to add word completion. For instance, if the system will be used for SMS messaging, a language model can be built from a corpus of text messages to determine the most likely word that would follow a stem of text. A highly frequent phrase may be ‘will call you later’. So, when the user enters the word ‘will’, the system can provide a suggestion for the word ‘call’, if the user invokes it. Also, a recurring task is the need to check the last word or last character entered. This allows the user to re-establish her position in the phrase. Another area of improvement would allow users to customize and expand the jump-able characters to their liking. If this technique were implemented on a mobile phone, it could provide a hybrid form of text entry that has some or all buttons of the keypad assigned to individual letters, which is supported by a scroll wheel or joystick to traverse the alphabet. Such a design would balance the design of scrolling and jumping and potentially decrease the cognitive load on the user and increase throughput since many of the characters could be accessed by remembering their positions on the device.
128
6.2 Touch Sensitive Devices The second branch explored eyes-free interaction within the domain of touch sensitive devices. The user studies presented in Chapters 4 and 5 show that eyes-free text entry is possible and can reach throughput levels that equal or exceed eyes-on use. In the first attempt at this problem, Graffiti was used to provide a stroke-based input mechanism for text. Performed with a finger, this is akin to touch gestures. These gestures were supported by vibrotactile and auditory speech and non-speech feedback. Each recognized stroke resulted in a speech sound announcing the character that was input. Unrecognized characters caused a pulse of vibration through the device allowing users additional opportunities to re-enter text. A key feature of such a design paradigm is the support for eyes-free text entry. An evaluation comparing eyes-on and eyes-free text entry found that eyes-free entry rates were 8% higher than eyes-on entry rates, at 7.60 wpm. These rates are in the neighbourhood of those for walk-up Graffiti use. This was partially due to practice but the expectation that device occlusion would inhibit throughput was invalid. Participants did as well or better when entering text in the eyes-free mode suggesting that the feedback mechanisms provided were sufficient. Although eyes-free entry had an error rate 1.5 times greater than eyes-on, this was an insignificant increase of only 0.17%. The work presented in Chapter 4 shed light on the pros and cons of the interaction providing insights that formed the basis of the enhanced version of the interaction presented in Chapter 5. For instance, speech feedback was adequately used for text input. 129
However, character-level speech feedback seemed to be overwhelming. Similarly, participants initially identified vibrotactile feedback as a great feature during training, but as a frustrating consequence of unrecognized strokes in their post-experiment comments. Clearly, this is due to the lack of experience with the Graffiti alphabet. One approach to solving this problem is to use a language model and a dictionary to deal with user errors. Armed with these observations and ideas, an enhanced version of this interaction was conceived. The first most notable modification was the shift of speech feedback from character-level to word-level. This modification decreases the time for each character as users needn’t confirm each character entered. However, this shift also introduced new problems of dealing with unrecognized and misrecognized characters. To make up for this gap, a correction algorithm was used at the end of each word to correct any user errors. This algorithm used regular expression matching and minimum string distance searching against a 9,000 word dictionary to find likely matches. If a single match was found, the system accepted the word and spoke it aloud. In the event of multiple matches, a “PLAYBACK MODE” was engaged wherein users were presented with an auditory display of suggestions. The auditory display provided three basic functions: repeat playback from the beginning, cancel entry for this word, and accept the last played back word. To test the enhanced version, a repeated-measures user study was carried out. There were three conditions: Immediate, OneLetter, and Delayed. The Immediate 130
condition formed a baseline that was very similar to the interaction presented in Chapter 4. The OneLetter condition provided character-level speech feedback for the first character of each word and required that this first character was a valid letter. The intent was to improve search results in the event of collisions and provide some amount of Graffiti learning. Finally, the Delayed condition provided only word-level feedback. The overall text entry speed attained was 10.00 wpm, which was 34% higher than the rates obtained in the previous study. Entry rate and error rate results were significant across conditions with the baseline (Immediate) condition attaining 8.34 wpm and an error rate of 2.46%. The Delayed mode garnered the highest entry rate, at 11.72 wpm. However, it also had the highest error rate at 7.04%. The OneLetter mode provided a reasonable balance with an entry rate of 11.0 wpm and an error rate of 3.53%. The data analysis toolkit used in this study had additional features allowing for an observation of a variety of other metrics. One such metric of interest was “system help”, which indicated the extent to which the system was responsible for correcting errors in user transcribed text. Overall, the system corrected 14.9% of user transcribed text. This metric was lower for the OneLetter mode, at 14.2%, compared to 15.6% for the Delayed mode. Finally, a word level analysis of candidate list size and word position was performed. It was found that list size for the Delayed mode was 56% higher than the OneLetter mode at 2.96 words per list. The mean list size for the OneLetter mode was 1.96 words per list. However, the most interesting finding was of word position in the 131
list. Overall, the target word was at position 1.1 in the list. A breakdown by position showed that in over 94.2% of the cases, the intended word was either at position 1 or at position 2. In fact, 80% of the time, the intended word was the first word on the list pointing to the effectiveness of the correction algorithm. The results of this study are highly promising suggesting that a viable eyes-free text entry solution can be built for touch sensitive devices. Further, the Immediate and OneLetter modes can be used effectively as training agents. Once users become aware of the nuances of Graffiti and have sufficient experience with the alphabet, the Delayed mode can be engaged for reaching maximum throughput.
6.3 Comparisons with Other Research For each of the two interaction paradigms, a unique eyes-free approach was presented. As a closing note, it is useful to view the results of this work with respect to related text entry techniques on mobile devices. The most common of these is multitap. Arif and Stuerzlinger (2009) present a survey outlining the results of text entry experiments that investigate various interaction methods. In their survey, multitap averages are about 9.9 wpm. However, this figure is derived from all types of user populations: novice, average, and expert. Text entry rates for novices alone using multitap are about 7.5 wpm. LetterScroll averaged 4.8 wpm for Method #4, which is more than half the throughput for multitap. This is not bad considering that text entry with LetterScroll is carried out eyesfree and users are partially limited by the speed of audio feedback. 132
In a similar vein, comparing text entry rates observed in the second and third experiments with other research allows this work to be viewed from a broader lens. In the same survey, Arif and Stuerzlinger also report the average text entry rate for stylus-based input at 11.6 wpm. The final experiment of this thesis found entry speeds between 11-12 wpm for interactions supporting error correction reflecting the fact that the use of alternative modalities and error correction in the third experiment were sufficient to allow users to reach comparable throughput levels, despite performing the task eyes-free.
133
Bibliography Amar, R., Dow, S., Gordon, R., Hamid, M. R. and Sellers, C. (2003). Mobile advice: An accessible device for visually impaired capability enhancement. Extended Abstracts of the ACM Conference on Human Factors in Computing Systems – CHI 2003, pp. 918-919. New York: ACM. Arato, A., Juhasz, Z. and Blenkhorn, P. (2004). Java powered braille slate talker. Proceedings of the Ninth International Conference on Computers Helping People with Special Needs – ICCHP 2004, pp. 506-513. Berlin: Springer. Arif, A. S. and Stuerzlinger, W. (2009). Analysis of text entry performance metrics. Proceedings of the IEEE Toronto International Conference - Science and Technology for Humanity – TIC-STH 2009, pp. 100-105. New York: IEEE. Bharath, A. and Madhvanath, S. (2008). Freepad: A novel handwriting-based text input for pen and touch interfaces. Proceedings of the ACM Conference on Intelligent User Interfaces – IUI 2008, pp. 297-300. New York: ACM. Blenkhorn, P. and Evans, G. (2004). Six-in-braille input from a qwerty keyboard. Proceedings of the Ninth International Conference on Computers Helping People with Special Needs – ICCHP 2004, p. 624. Berlin: Springer. BNC
(2009).
British
national
corpus
(ftp://ftp.itri.bton.ac.uk/).
134
of
the
english
language.
BNC
Castellucci, S. J. and MacKenzie, I. S. (2008). Graffiti vs. Unistrokes: An empirical comparison. Proceeding of the ACM Conference on Human Factors in Computing Systems – CHI 2008, pp. 305-308. New York: ACM. Fleetwood, M. D., Byrne, M. D., Centgraf, P., Dudziak, K. Q., Lin, B. and Mogilev, D. (2002). An evaluation of text-entry in palm os - graffiti and the virtual keyboard. Proceedings of the 46th Annual Meeting of the Human Factors and Ergonomics Society – HFES 2002, pp. 617-621. Santa Monica, CA: Human Factors and Ergonomics Society. Goldberg, D. and Richardson, C. (1993). Touch-typing with a stylus. Proceedings of the ACM Conference on Human Factors in Computing Systems - CHI 1993, pp. 80-87. New York: ACM. Gong, J. and Tarasewich, P. (2005). Alphabetically constrained keypad designs for text entry on mobile devices. Proceedings of the ACM Conference on Human Factors in Computing Systems – CHI 2005, pp. 211-220. New York: ACM. Guerreiro, T., Lagoa, P., Nicolau, H., Santana, P. and Jorge, J. (2008). Mobile text-entry models for people with disabilities. Proceedings of the 15th European Conference on Cognitive Ergonomics, pp. 1-4. New York: ACM. Guiard, Y. (1987). Asymmetric division of labor in human skilled bimanual action: The kinematic chain as a model, Journal of Motor Behavior, 19, 486-517.
135
Hwang, S. and Lee, G. (2005). Qwerty-like 3x4 keypad layouts for mobile phone. Extended Abstracts of the ACM Conference on Human Factors in Computing Systems – CHI 2005, pp. 1479-1482. New York: ACM. Kane, S. K., Bigham, J. P. and Wobbrock, J. O. (2008). Slide rule: Making mobile touch screens accessible to blind people using multi-touch interaction techniques. Proceedings of the ACM Conference on Computers and Accessibility – ASSETS 2008, pp. 73-80. New York: ACM. Li, K. A., Baudisch, P. and Hinckley, K. (2008). Blindsight: Eyes-free access to mobile phones. Proceeding of the ACM Conference on Human Factors in Computing Systems – CHI 2008, pp. 1389-1398. New York: ACM. Lyons, K., Starner, T., Plaisted, D., Fusia, J., Lyons, A., Drew, A. and Looney, E. W. (2004). Twiddler typing: One-handed chording text entry for mobile phones. Proceedings of the ACM Conference on Human Factors in Computing Systems – CHI 2004, pp. 671-678. New York: ACM. MacKenzie, I. S. (2002a). Kspc (keystrokes per character) as a characteristic of text entry techniques. Proceedings of the Fourth International Symposium on HumanComputer Interaction with Mobile Devices, pp. 195-210. Heidelberg, Germany: Springer-Verlag. MacKenzie, I. S. (2002b). Mobile text entry using three keys. Proceedings of the Second Nordic Conference on Human-Computer Interaction – NordiCHI 2002, pp. 2734. New York: ACM. 136
MacKenzie, I. S., Chen, J. and Oniszczak, A. (2006). Unipad: Single-stroke text entry with language-based acceleration. Proceedings of the Fourth Nordic Conference on Human-Computer Interaction – NordiCHI 2006, pp. 78-85. New York: ACM. MacKenzie, I. S. and Soukoreff, R. W. (2003). Phrase sets for evaluating text entry techniques. Extended Abstracts of the ACM Conference on Human Factors in Computing Systems – CHI 2003, pp. 754-755. New York: ACM. MacKenzie, I. S. and Tanaka-Ishii, K. (2007). Text entry using a small number of buttons. In: MacKenzie, I. S. & Tanaka- Ishii, K., (Eds), Text entry systems: Mobility, accessibility, universality, pp. 105-121. San Francisco, Morgan Kaufmann. MacKenzie, I. S. and Zhang, S. X. (1997). The immediate usability of graffiti. Proceedings of Graphics Interface 1997, pp. 129-137. Toronto: Canadian Information Processing Society. McGookin, D., Brewster, S. and Jiang, W. (2008). Investigating touchscreen accessibility for people with visual impairments. Proceedings of the Fifth Nordic Conference on Human-Computer Interaction - NordiCHI 2008, pp. 298-307. New York: ACM. Pavlovych, A. and Stuerzlinger, W. (2003). Less-tap: A fast and easy-to-learn text input technique for phones. Proceedings of Graphics Interface 2003, pp. 97-104. Toronto: Canadian Information Processing Society. 137
Ryu, H. and Cruz, K. (2005). Letterease: Improving text entry on a handheld device via letter reassignment. Proceedings of the Australia Conference on ComputerHuman Interaction – OZCHI 2005, pp. 1-10. Narrabundah, Australia: Computer-Human Interaction Special Interest Group (CHISIG) of Australia. Sánchez, J. a. A., F. (2006). Mobile messenger for the blind. Proceedings of the Ninth ERCIM Workshop on User Interfaces for All, pp. 369-385. Heidelberg, Germany: Springer. Silfverberg, M., MacKenzie, I. S. and Korhonen, P. (2000). Predicting text entry speed on mobile phones. Proceedings of the ACM Conference on Human Factors in Computing Systems – CHI 2000, pp. 9-16. New York: ACM. Soukoreff, R. W. and MacKenzie, I. S. (2001). Measuring errors in text entry tasks: An application of the levenshtein string distance statistic. Extended Abstracts of the ACM Conference on Human Factors in Computing Systems – CHI 2001, pp. 319-320. New York: ACM. Tinwala, H. and MacKenzie, I. S. (2008). Letterscroll: Text entry using a wheel for visually impaired users. Extended Abstracts of the ACM Conference on Human Factors in Computing Systems – CHI 2008, pp. 3153-3158. New York: ACM. Tinwala, H. and MacKenzie, I. S. (2009). Eyes-free text entry on a touchscreen phone. Proceedings of the IEEE Toronto International Conference - Science and Technology for Humanity – TIC-STH 2009, pp. 83-88. New York: IEEE.
138
Wigdor, D. and Balakrishnan, R. (2003). Tilttext: Using tilt for text input to mobile phones. Proceedings of the ACM Symposium on User Interface Software and Technology – UIST 2003, pp. 81-90. New York: ACM. Wobbrock, J. O., Chau, D. H. and Myers, B. A. (2007). An alternative to push, press, and tap-tap-tap: Gesturing on an isometric joystick for mobile phone text entry. Proceedings of the ACM Conference on Human Factors in Computing Systems – CHI 2007, pp. 667-676. New York: ACM. Wobbrock, J. O., Myers, B. A., Aung, H. H. and LoPresti, E. F. (2004). Text entry from power wheelchairs: Edgewrite for joysticks and touchpads. Proceedings of the ACM Conference on Computers and Accessibility – ASSETS 2004, pp. 110117. New York: ACM. Wobbrock, J. O., Myers, B. A. and Chau, D. H. (2006). In-stroke word completion. Proceedings of the ACM Symposium on User Interface Software and Technology - UIST 2006, pp. 333-336. New York: ACM. Yatani, K. and Truong, K. N. (2007). An evaluation of stylus-based text entry methods on handheld devices in stationary and mobile settings. Proceedings of the ACM Conference on Human Computer Interaction with Mobile Devices and Services – MobileHCI 2007, pp. 487-494. New York: ACM. Yfantidis, G. and Evreinov, G. (2006). Adaptive blind interaction technique for touchscreens, Universal Access in the Information Society (UAIS), 4, 328-337.
139