research proposal

Viewer
Transcript

RESEARCH PROPOSAL

Multimodal Co-located Collaboration By Edward Tse

THE UNIVERSITY OF CALGARY CALGARY, ALBERTA JANUARY 2006

Page 2

Abstract This research is concerned with the design and development of technologies to support multimodal co-located collaboration, a largely unexplored area in Human-Computer Interaction. People naturally perform multimodal interactions in everyday real world settings, as they collaborate over visual surfaces for both mundane and critical tasks. For example, ethnographic studies of military command and control, air traffic control, airplane flight decks and underground subway routing has shown that team members often use multiple hands, gestures and speech simultaneously in their communications and interactions. While a new generation of research technologies now support co-located collaboration, they do not yet directly leverage such rich multimodal interaction. Even though a rich behavioural foundation is emerging that informs the design of co-located collaborative technologies, most systems still typically limit input to a single finger or pointer. While a few now consider richer touch input such as gestural interaction, speech is ignored. This problem is partly caused by the fact that single point interaction is the only input easily accessible to researchers through existing toolkits and input APIs. Thus, researchers would have to “reinvent the wheel” if they are to achieve rich gesture and speech interactions for multiple people. Finally, co-located collaborative systems designers do not have a corpus of systematic guidelines to inform the use of rich multimodal over a large shared display. In this research, I will distil existing theories, models and ethnographic studies on co-present collaboration into behavioural foundations that describe the individual and group benefits for using gesture and speech multimodal input in a large display co-located setting. Next, I will develop a toolkit that will facilitate rapid prototyping of multimodal co-located applications over large digital displays. Finally, using these applications, I will conduct a number of studies exploring design in a multimodal setting. I will use study results to validate or refute my design premises and to refine the guidelines for designers of future multimodal co-located systems. Anticipated contributions are: a distillation of behavioural foundations outlining the individual and group benefits of multimodal interaction in a co-located setting, an input toolkit allowing researchers to rapidly explore rich multimodal interactions in a co-located environment, and the development and evaluation of several multimodal co-located applications built atop commercial applications and from the ground up. These evaluations will be used to form and/or validate a set of design implications.

Edward Tse – Thesis Proposal – Version 32 – 3/15/2006

Page 3

1

Research Proposal

This research is concerned with the design and development of information technologies to support multimodal co-located collaboration over large wall and table displays. By multimodal input, I mean interaction using rich hand gestures and speech. By co-located collaboration, I mean small groups of two to four people working together. By large display interaction, I mean display technologies designed to be viewed by multiple people (e.g., projectors, plasma displays). Consider everyday life. Co-located collaborators often work on artefacts placed atop physical tabletops, such as maps containing rich geospatial information. Their work is very nuanced, where people use gestures and speech in subtle ways as they interact with artefacts on the table and communicate with one another. With the advent of large multi-touch surfaces, researchers are now applying knowledge of colocated tabletop interaction to create appropriate technical innovations in digital table design. My research focus is on advancing our understanding of multimodal co-located interaction, specifically on the feasibility and potential benefits and problems of multimodal co-located input. My motivation for this thesis can be summarized as follows: 1. Co-located collaborators can leverage the power of digital displays for saving and distributing annotations, for receiving real-time updates, and for exploring and updating large amounts of data in real time. 2. Multimodal interaction allows people to interact with a digital surface using the same hand gestures and speech utterances that they use in the physical environment 3. An important side effect of multimodal interaction is that it provides improved awareness to people working together in a co-located environment. To investigate this thesis, my research will first examine the work practices and findings reported in various ethnographic studies of safety critical environments; e.g., air traffic control, military command and control, underground subway traffic management and hospital emergency rooms. I will examine what types of speech and gesture interactions are used, how it is performed (e.g., simultaneously vs sequentially), how conflicts are handled (e.g., turn taking protocols), and how these natural activities can be supported by technology. Ultimately, this research will involve investigating and bridging a range of

Edward Tse – Thesis Proposal – Version 32 – 3/15/2006

Page 4

perspectives: human computer interaction, human factors, social factors/psychology, cognitive psychology and technological applications. My research addresses the following fundamental limitations now found in the co-located setting: 1. Traditional desktop computers are unsatisfying for highly collaborative situations involving multiple co-located people exploring and problem-solving over rich digital spatial information. 2. Even if a large high resolution display is available, one person’s standard window/icon/mouse interaction – optimized for small screens and individual performance – becomes awkward and hard to see and comprehend by others involved in the collaboration 3. Ethnographic studies illustrate how the ‘single user’ assumptions inherent in current large display input devices limit collaborators who are accustomed to using multiple fingers and two-handed gestures often in concert with speech. In this research proposal, I first set the scene by briefly summarizing existing research on multimodal and co-located collaboration through several ethnographic studies and technology implementations. Second, I describe the context of this research as it relates to the field of human-computer interaction. Third, I outline the specific research problems that I will investigate and a corresponding list of objectives that I will address.

I then conclude with a discussion of my progress so far, and the anticipated

significance of this research.

1.1 Terminology This section clarifies terms and phrases used in this research proposal to avoid ambiguities. Interaction: When I use the term interaction, I am particularly referring to the actions people use to communicate and collaborate with each other in small groups of two to five people. Co-Located Interaction: These are interactions that occur in the same enclosed space over a plurality of digital wall and table surfaces. Multimodal Interaction: This describes the explicit natural actions performed by people working together in a co-located environment. Examples include, hand and arm gestures, speech acts, eye gaze, and torso orientation.

Edward Tse – Thesis Proposal – Version 32 – 3/15/2006

Page 5

Multimodal Input: This term specifically refers to a computing system being aware of and responding to the natural multimodal interactions of multiple people through any type of sensing technology. Postures: Postures are the natural hand and arm configurations that can be observed at any instant in time. For example, a fist is a hand posture that can be recognized at some moment in time. Gestures: While most gesture recognition engines focus on supporting the complex movements of a single point, gestures in this thesis represent the simple movement of postures. The focus is on simple affine transformations of postures (scale, rotation, translation placed or lifted from a digital surface). More complex gesture movements can be achieved by creating sequences of simple gestures. Speech: This term specifically refers to the natural verbal communication that occurs in small group interaction. Gaze: The locative references that are not covered by hand and arm gestures. Examples include: eye-gaze, torso orientation and head movements.

1.1.1 What This Thesis is Not Exploring The focus of this thesis is on the interaction issues of multimodal input and not on multimodal output. While it is likely that I will use both visual displays and audio as output, I will not explore the use of haptic force-feedback or olfactory output devices. This thesis is about the interaction possibilities of multimodal input rather than the development of robust commercial recognition systems. Thus, whenever possible I will use existing recognition systems rather than trying to build and refine my own recognition engine from the ground up. For example, I will likely use an existing speech recognition engine rather than build one myself. Also, I will only explore the natural gestures that occur in everyday conversations and collaborations. This means that I will try to avoid the use of complex gestures patterns that would be completely imperceptible in a regular conversation setting. Finally, this thesis is a research exploration into multimodal co-located interaction. I will not attempt to deliver products that are up to the standards of a commercial or industrial system.

Edward Tse – Thesis Proposal – Version 32 – 3/15/2006

Page 6

State of the art military command and control systems in action

What commanders prefer

Figure 2. Paper maps preferred over electronic displays in military command and control, McGee, 2001

1.2 Background People naturally use speech and gestures in their everyday communications over artefacts. Consequently, researchers are now becoming interested in exploiting speech and gestures in computer supported cooperative work systems. In this section, I provide a brief background to some of the ethnographic studies, mostly drawn from observations of safety critical environments, which form the motivations and foundations for my work in multimodal co-located interaction. Next, I extract implications for design for multimodal co-located system development.

Finally, I will explore technological explorations of

multimodal and co-located systems.

1.2.1 Ethnographic and Empirical Studies Ethnographic studies of mission critical environments such as military command posts, air traffic control centers and hospital emergency rooms have shown that paper media such as maps and flight strips are preferred even when digital counterparts are available [Cohen, 2002, Cohen 1997, Chin, 2003, Hutchins, 2000]. For example, Cohen et. al.’s ethnographic studies illustrate why paper maps on a tabletop were preferred over electronic displays by Brigadier Generals in military command and control situations [Cohen, 2002]. The ‘single user’ assumptions inherent in the electronic display’s input device

Edward Tse – Thesis Proposal – Version 32 – 3/15/2006

Page 7

Figure 1. Brigadier Generals using a map simultaneously with rich hand postures. From [Cohen, 2002] and its software limited commanders, as they were accustomed to using multiple fingers and two-handed gestures to mark (or pin) points and areas of interest with their fingers and hands, often in concert with speech [Cohen, 2002, McGee, 2001]. Several ethnographic researchers have focused on how gesture and speech provide improved awareness to group members in a co-located environment. Proponents of multimodal interfaces argue that the standard windows/icons/menu/pointing interaction style does not reflect how people work with highly visual interfaces in the everyday world [Cohen, 2002]. Results of empirical studies indicate that the combination of gesture and speech is more efficient and natural. For example, comparisons of speech/gestures vs. speech-only interaction by individuals performing map-based tasks showed that multimodal input resulted in more efficient use of speech (23% fewer spoken words), 35% less disfluencies (content self corrections, false starts, verbatim repetitions, spoken pauses, etc.), 36% fewer task performance errors, and 10% faster task performance [Oviatt, 1997]. These empirical and ethnographic studies provide motivation for multimodal support in co-located environments and have consequently led to specific design implications.

1.2.2 Implications to Design In this thesis I focus on group interaction theories that specifically handle issues of group communication, gesture and speech activity and apply them to the design of a digital tabletop. This section begins with low level implications that deal specifically with the mechanics of gesture and speech input and then moves into high level theories influencing group work. Deixis: speech refined by gestures. Deictic references are speech terms (‘this’, ‘that’, etc.) whose meanings are disambiguated by spatial gestures (e.g., pointing to a location). A typical deictic utterance is

Edward Tse – Thesis Proposal – Version 32 – 3/15/2006

Page 8

“Put that…” (points to item) “there…” (points to location) [Bolt, 1980]. Deixis often makes communication more efficient since complex locations and object descriptions can be replaced in speech by a simple gesture. For example, contrast the ease of understanding a person pointing to this sentence while saying ‘this sentence here’ to the utterance ‘the 5th sentence in the paragraph starting with the word deixis located in the middle of page 3’. Furthermore, when speech and gestures are used as multimodal input to a computer, Bolt states [1980] and Oviatt confirms [1997] that such input provides individuals with a briefer, syntactically simpler and more fluent means of input than speech alone. Complementary modes. Speech and gestures are strikingly distinct in the information each transmits. For example, studies show that speech is less useful for describing locations and objects that are perceptually accessible to the user, with other modes such as pointing and gesturing being far more appropriate [Bolt, 1980, Oviatt, 1999]. Similarly, speech is more useful than gestures for specifying abstract or discrete actions (e.g., Fly to Boston). Simplicity, efficiency, and errors. Empirical studies of speech/gestures vs. speech-only interaction by individuals performing map-based tasks showed that parallel speech/gestural input yields a higher likelihood of correct interpretation than recognition based on a single input mode [Oviatt, 1999]. This includes more efficient use of speech (23% fewer spoken words), 35% less disfluencies (content self corrections, false starts, verbatim repetitions, spoken pauses, etc.), 36% fewer task performance errors, and 10% faster task performance [Oviatt, 1997]. Natural interaction. During observations of people using highly visual surfaces such as maps, people were seen to interact with the map very heavily through both speech and gestures. The symbiosis between speech and gestures are verified in the strong user preferences stated by people performing mapbased tasks: 95% preferred multimodal interaction vs. 5% preferred pen only. No one preferred a speech only interface [Oviatt, 1999]. Gaze awareness. People monitor the gaze of a collaborator [Heath, 1991, Gutwin, 11]. It lets one know where others are looking and where they are directing their attention. It helps monitor what others are doing. It serves as visual evidence to confirm that others are looking in the right place or are attending one’s own acts. It even serves as a deictic reference by having it function as an implicit pointing act [Clark, 1996]. Gaze awareness happens easily and naturally in a co-located tabletop setting, as people are seated in a way where they can see each other’s eyes and determine where they are looking on the tabletop.

Edward Tse – Thesis Proposal – Version 32 – 3/15/2006

Page 9

Mechanics of collaboration. In terms of the low level mechanics, Wu breaks up gestural interaction into three phases, gesture registration (starting posture), gesture relaxation (dynamic phase) and gesture termination [Wu, 2006]. Thus gestural interaction must not require rigid postures to be held throughout continuously; rather they should be flexible in the time between the starting posture and gesture termination. McNeil explains how cognitive science proves that gesture and speech originate from the same cognitive system in the human mind and that there are various different types of gestures: deictic, iconic, cohesive, beat and metaphoric [McNeil, 1992]. This shows how the deictic pointing gestures done with current point to click interfaces encompass a very small portion of the gestures that people use in everyday conversation. Consequently, a system needs to understand how rich gestures are used in accordance with speech so that the gesture type can be determined. Consequential communication. Gutwin describes how speech and gestural acts provide awareness to group members through consequential communication [Gutwin, 2004, Segal, 1994]. For example, researchers have noticed that people will often verbalize their current actions aloud (e.g., “I am moving this box”) for a variety of reasons [Hutchins, 1997, Heath, 1991, Segal, 1994]: • to make others aware of actions that may otherwise be missed, • to forewarn others about the action they are about to take, • to serve as an implicit request for assistance, • to allow others to coordinate their actions with one’s own, • to reveal the course of reasoning, •

to contribute to a history of the decision making process. Distributed cognition. Clark’s presents a theoretical foundation of communication, where it serves

as an activity for building and using common ground [Clark, 1996]. While much of human computer interaction is focused on understanding cognition and factors within an individual, both Clark and Hollan emphasize a need to understand distributed cognition in a team setting [Hollan, 2000]. This means that researchers should consider the whole team as a cognitive system and use their communicative acts to understand patterns of information flow within the system [Hutchins, 2000].

1.2.3 Technological Explorations Multimodal Single User: Technological explorations of single user multimodal interaction begin as early as 1980 with Bolt’s Put That There multimodal system. Individuals could interact with a large

Edward Tse – Thesis Proposal – Version 32 – 3/15/2006

Page 10

Figure 3. Multimodal technological explorations (left) Put that there (right) STARS Tabletop Gaming display via speech commands qualified by deictic reference, e.g., “Put that…” (points to item) “there…” (points to location) [Bolt, 1980]. Bolt argues and Oviatt confirms [Oviatt, 1999] that this multimodal input provides individuals with a briefer, syntactically simpler and more fluent means of input than speech alone. While Bolt’s put that there system focused on deictic gestures in mid air, researchers have also explored direct touch manipulation using single touch surfaces such as the Smart Board by Smart Technologies (www.smarttech.com). McGee explored single point multimodal interaction over physical maps laid over a vertical Smart Board [McGee, 2001]. Similarly, Magerkurth (2004) explored single point multimodal interaction in a game environment over a Smart Board laid horizontally on a table surface. Although the figures in both these papers show multiple people, the system only supported a single input point at any time, thus collaborators had to take turns using the system. Single Display Groupware Toolkits: The increased interest in developing applications supporting multiple users over a single shared display has led to the development of several toolkits to support the rapid prototyping of Single Display Groupware (SDG) applications. Collaboration on a single display using multiple mice and keyboards allows researchers to rapidly explore co-located collaboration using low cost input devices. The Multiple Input Devices (MID) Toolkit was a toolkit that allowed input from multiple mice connected to the same computer to be recognized as separate streams, this is the most basic task required in all multi user applications [Bederson, 1998]. The SDG Toolkit extended the principles of the MID Toolkit by automatically drawing multiple cursors and providing a way to rapidly prototype multi user analogues of single user widgets [Tse, 2005]. Recent interest in large displays has led to the development of input toolkits to support rapid prototyping over large surfaces (e.g., Diamond Touch

Edward Tse – Thesis Proposal – Version 32 – 3/15/2006

Page 11

Toolkit, DViT Toolkit [Tse, 2005]) and toolkits to support the manipulations of objects over a table surface (e.g., Diamond Spin [Vernier, 2004], Buffer Framework [Miede, 2006]). Tabletop Input Devices: While most touch-sensitive display surfaces only allow a single point of contact, rich gestural interaction can only be achieved through digital surfaces that support richer multi touch interactions. However, the few surfaces that do provide multi-touch have limitations: Some, like SmartSkin [Rekimoto, 2002], are generally unavailable. Some limit what is sensed: SmartBoard’s DViT (www.smarttech.com/dvit) and Han’s Frustrated Total Internal Reflectance [Han, 2005] system recognize multiple touches, but cannot identify which touch is associated with which person. Others have display constraints: MERL’s DiamondTouch [Dietz, 2001] identifies multiple people, knows the areas of the table they are touching, and can approximate the relative force of their touches. My research will need to work around these limitations to explore the rich multi user gesture and speech interactions on a digital surface. Rich Gesture Input: Beyond the simple deictic reference, researchers have explored multi finger and whole hand rich gestural input often without the use of speech. Baudel explored the remote control of digital artefacts using a single hand connected to a data glove. By performing a sideways pulling gesture using the data glove a person could advance to the next slide of a presentation. Recently, researchers have explored rich gestural interaction directly on a tabletop surface, this provides the added benefit of gestures that are augmented with spatial references. Wu’s tabletop gestures included: a whole arm to sweep artefacts aside, a hand to rotate the table, a two finger rotation gesture, two arms moving together to gather artefacts and the back of a hand to show hidden information [Wu, 2003]. These gestures form the basis of the design of the rich multimodal gestures in my thesis as they produce improved awareness for group members as each gesture results in meaningful actions on the digital surface. Distributed Co-located Systems: Researchers have also explored applications that span across multiple computers in the same co-located area. Roomware environments enhance typical artefacts in a room such as tables, walls and chairs with digital interactive surfaces. The Beach architecture supported the communication between multiple input devices (e.g., interactive wall and table displays, Personal Digital Assistants and personal tablets). Tang also explored the concept of mixed presence groupware that allowed multiple groups of co-located individuals to work with other groups of co-located individuals located over a distance [Tang, 2005]. Important distinctions between previous work and my future research directions

Edward Tse – Thesis Proposal – Version 32 – 3/15/2006

Page 12

Keyboard and mouse interaction is unsatisfying for highly collaborative applications, especially those that occur in co-located environments [Gutwin, 2004]. While single point touch interaction over a Smart Board is an improvement due to the movements of people’s arms and bodies, they still reduce the expressive capabilities of people’s hands and arms to deictic pointers. Designers of multimodal co-located systems need to consider multiple people simultaneously using rich bimanual multi-postured gesture interaction, the system must be aware of alouds - the meaningful speech phrases not directed to any individual member but used to provide awareness of users’ current actions and intentions. However, the fundamental problem is that current input technologies either do not support these rich multimodal interactions or they require programmers to develop such software from the ground up. These hurdles must be overcome before even the most basic multimodal co-located applications can be developed and the empirical work explored to understand the nuances and potential solutions for multimodal co-located interaction.

1.3 Research Context This research investigates multimodal co-located collaboration. Figure 1 illustrates how this research fits into the broad context in human-computer interaction (HCI). Within HCI, my research is contained in computer-supported cooperative work (CSCW).

The next refinement narrows my primary focus to

technologies that support co-located collaborative work. My focus can further be streamlined to include only those co-located collaborative work practices that use multimodal gesture and speech input. My research builds upon the lessons learned from ethnographic studies of safety critical applications and applies these lessons to the design of general multimodal co-located systems. At the beginning, I will narrow my research to explore multimodal co-located interaction over interactive table tops leaving the possibility of exploring multimodal co-located wall or tablet interaction open.

Edward Tse – Thesis Proposal – Version 32 – 3/15/2006

Page 13

Figure 2: The context of my research

1.4 Research Objectives I will address the above mentioned problems by linking previous literature and ethnographic studies to my own studies of co-located multimodal environments. This synthesis will be used to inform the design and development of several multimodal co-located systems. However, as the research problems described above are inter-related and their respective research findings unknown, the findings obtained for each research problem may affect what and how the next research step should proceed.

Therefore, the

objectives described below are subject to revision depending on the outcome of the preceding research stage. 1. I will distill existing theories and ethnographic studies into a set of behavioural foundations that inform the design of multimodal co-located systems and list individual and group benefits. This objective will be achieved by performing a survey of existing theories of team work and ethnographic research in safety critical environments. I will examine interaction in real world situations, paying particular attention to the speech and gesture acts used to produce a list of individual and group benefits of multimodal co-located interaction. This summary will outline some of the benefits and provide

Edward Tse – Thesis Proposal – Version 32 – 3/15/2006

Page 14

motivation for adding multimodal interaction to co-located environments. It will form the basis of the design of our multimodal co-located applications and will be used in the evaluation of multimodal colocated systems in this thesis. 2. I will develop a toolkit to allow us to rapidly prototype multimodal co-located interactive systems Using some of the experience gained from my Master’s Thesis [Tse, 2005] on building toolkits to support application development using multiple mice and keyboards, I will develop a software toolkit that will facilitate the rapid prototyping of responsive and demonstrative multimodal gesture and speech applications in a co-located environment. This objective consists of 3 sub-goals: 1. I will develop a gesture recognizer that recognizes different hand postures (e.g., arm, hand, five finger, fist, etc) and their respective dynamic movements (e.g., two fingers moving apart) for multiple people (up to four) on a co-located table top display. 2. I will develop a multimodal integrator that accepts both speech and gesture commands from multiple people and is able to integrate commands to a single computer. I will use existing speech recognition technology to recognize voice commands.

This toolkit will support multiple computers over a

network because hardware and software limitations often require that multiple large displays or input devices be controlled by separate computers. 3. I will develop tools to simplify the adaptation of commercial single user applications to a multimodal co-located environment.

This will allow one to rapidly prototype rich multimodal applications

without the need to develop a working commercial system from the ground up and will facilitate the exploration and further understanding of multimodal co-located application development. To begin, this toolkit will be designed to support gestures on existing table top input devices (e.g., Diamond Touch, Smart DViT), however, there may be plans to extend this infrastructure to support other multimodal input devices at a later time. To evaluate the toolkit build the applications described in Objectives 3 and 4, and I will get others to develop multimodal co-located systems using my toolkit. 3. I will develop and evaluate multimodal co-located wrappers over existing commercial applications to further my understanding and inform the design of true multi-user multimodal interactive systems.

Edward Tse – Thesis Proposal – Version 32 – 3/15/2006

Page 15

Using the design implications and behavioural foundations developed in Objective One and the prototyping toolkit developed in Objective two, I will develop several multi user, multimodal co-located interface wrappers atop of existing commercial applications on a interactive table top display.

By

studying existing commercial applications I will be able to rapidly prototype rich multimodal applications that would otherwise be impossible for me to develop from the ground up. User studies of these systems will also provide an opportunity to observe how people naturally mitigate interference and turn taking when interacting with single user applications over a multimodal co-located table top display. They will also be used to evaluate how the design implications provided in Objective One work when moved out of the physical world in the realm of a digital tabletop. All of these observations will be used to inform the design of a true multi user multimodal system. 4. I will develop true multi user multimodal co-located systems and evaluate the technical and behavioural nuances of multimodal co-located systems development. I will develop and evaluate different techniques to deal with these nuances to inform the design of future multimodal colocated systems. Again, using the toolkit developed in Objective two, I will create several applications that explore new interaction possibilities available exclusively in a multi user multimodal co-located environment. I will get groups of people to perform collaborative tasks in this environment paying particular attention to the inter-person behaviours and technical nuances of multimodal co-located application development. Using the list of nuances, I will evaluate existing techniques used to mitigate these problems in the existing literature. I may also develop my own interaction techniques and compare them against commonly accepted approaches. For example, if one of the nuances of multimodal co-located interaction turns out to be the need to manage when speech recognition is activated, I will evaluate existing techniques used for speech recognition (e.g., push to talk, look to talk, etc) and possibly develop a new technique to mitigate the issue in the co-located environment. Some of the research directions that I am considering for Objective three include the examining how a multimodal co-located interactive system will influence or affect the natural interactions that occur in co-present meetings. For example, alouds are high level spoken utterances made by the performer of an action meant for the benefit of the group but not directed to any one individual in the group [Heath, 1991]. Since spoken commands are directed to computer they are not truly alouds, thus we do not know what behavioural affordances normally provided by alouds will be achieved in the multimodal speech recognition environment. Furthermore, I am interested in examining techniques to manage when speech

Edward Tse – Thesis Proposal – Version 32 – 3/15/2006

Page 16

utterances are meant for the computer versus when they are meant for collaborators. The research that I am capable of exploring will depending heavily on the capabilities provided by the multimodal co-located toolkit described in Objective 2.

1.5 Current Status Much of Objective one has been completed. I have written a paper that outlines a list of behavioural foundations describing the individual and group benefits and implications to design of gesture and speech interaction in an interactive co-located environment (see Appendix B). This initial list will be further expanded in my thesis to include ethnographic studies of other collaborative environments (e.g., NASA Control Centres, Hospital Surgery Rooms) and my own anecdotal experiences from application development in Objectives three and four. Parts of Objective two have been completed. I have developed a toolkit called the Diamond Touch Gesture Engine that allows different hand postures (e.g., hand, five fingers, fist, etc) and their respective movements (e.g., two fingers moving together) to be detected from multiple people on a table top display. I have used this gesture engine and the Microsoft Speech Application Programmers’ Interface (Microsoft SAPI) to prototype speech and gesture applications that interact with existing commercial applications (e.g., Google Earth and Blizzard’s Warcraft III, see Appendix B). This toolkit needs to be improved to provide more reliable recognition of gesture movement (rotation of five fingers) and input across multiple computers. I have begun work on a toolkit to allow input to be sent across different computers using a local area network called the Centralized External Input (CEXI) Toolkit. The next step will be to combine the Diamond Touch Gesture Engine, Microsoft SAPI and CEXI Toolkit into a unified infrastructure for exploring multimodal co-located interaction. I have explored multi user multimodal co-located wrappers for three commercial single user applications. My experiences from adapting Google Earth, Warcraft III and The Sims has allowed me to focus my efforts on providing rich multimodal gesture and speech interactivity rather than building a truly useful application. These application wrappers have developed significant interest in my research by industry (e.g., PB Faradyne) and government agencies (e.g., Disaster Services of the City of Calgary) as they are a compelling way of illustrating how existing commercial application would work on a table top surface using gesture and speech.

Edward Tse – Thesis Proposal – Version 32 – 3/15/2006

Page 17

A large amount of work remains to be completed in Objective four. In particular, I am actively searching for instances of multi user multi modal gestures in the current literature and I am looking for a application domain to develop a prototype true multi user application. My hope is that I will be able to gauge the interest of industry partners that will be able to tell me about the interactions and issues that they are facing. Using this information I will begin to embark on the exploration of true multi user multimodal systems and their subsequent evaluations. I anticipate several avenues that can be explored as this work unfolds. While it is unlikely that all these avenues will be pursued in the context of this research, it does describe a broader research agenda. Mixed Presence Groupware is an area that explores groups of co-located individuals working together, such systems could leverage multimodal interaction to improve awareness for remote participants. Gaze, head and torso tracking would allow a richer set of deictic gestures, where one could specify areas of interest by orienting their head and torso to a digital representation of a remote participant. Finally, multimodal co-located interaction could be used to explore the movement of digital information from a table top display to a peripheral wall display and vice versa, thus allowing digital content to move seamlessly in the digital work environment.

1.6 Conclusion This thesis argues that single point touch interaction on a large display reduces the expressive capabilities of people’s hands and arms to simple deictic pointers. Richer multimodal interaction that is aware of the hand postures, movements and speech acts that people naturally perform in a co-located environment will not only provide improved group awareness, but it will improve the accuracy and effectiveness of the collaborations in co-located environments. The related work has shown that there is a wealth of ethnographic, theoretical and technical research that has investigated and argued for the benefits of multimodal interaction in a co-located environment. This proposal has identified a largely unexploited area in Human-Computer Interaction: multimodal co-located collaboration.

The research I propose in this document aims to ground the

individual benefits and group benefits of multimodal interaction in the co-located setting.

The

contributions offered by this research are: an improved understanding of the benefits and tradeoffs of multimodal input in a co-located setting, a toolkit that allows rapid prototyping of multimodal interactive systems, the exploration of multi-user multimodal co-located wrappers around existing single user

Edward Tse – Thesis Proposal – Version 32 – 3/15/2006

Page 18

applications, and the design of evaluation of true multimodal co-located systems with the goal of understanding the nuances and design implications of effective multimodal co-located interaction.

1.7 References 1. Baudel, T., & Beaudouin-Lafon, M. (1993). Charade: remote control of objects using free-hand gestures. Communications of ACM, 36(7). p. 28-35. 1.

Bederson, B. and Hourcade, J. (1999): Architecture and implementation of a Java package for Multiple Input Devices (MID). HCIL Technical Report No. 9908. http://www.cs.umd.edu.hcil.

2. Bolt, R.A., Put-that-there: Voice and gesture at the graphics interface. Proc ACM Conf. Computer Graphics and Interactive Techniques Seattle, 1980, 262-270. 3. Clark, H. Using language. Cambridge Univ. Press, 1996. 4. Cohen, P. Speech can’t do everything: A case for multimodal systems. Speech Technology Magazine, 5(4), 2000. 5. Cohen, P.R., Johnston, M., McGee, D., Oviatt, S., Pittman, J., Smith, I., Chen, L. and Clow, J., QuickSet: Multimodal interaction for distributed applications. Proc. ACM Multimedia, 1997, 31-40. 6. Cohen, P.R., Coulston, R. and Krout, K., Multimodal interaction during multiparty dialogues: Initial results. Proc IEEE Int’l Conf. Multimodal Interfaces, 2002, 448-452. 7. Cohen, P.R., Johnston, M., McGee, D., Oviatt, S., Pittman, J., Smith, I., Chen, L. and Clow, J., QuickSet: Multimodal interaction for distributed applications. Proc. ACM Multimedia, 1997, 31-40. 8. Chin, T., Doctors Pull Plug on Paperless System, American Medical News, Feb 17, 2003, http://amaassn.org/amednews/2003/02/17/bil20217.htm 9. Deitz, P. and Leigh, D. (2001). DiamondTouch: A Multi-User Touch Technology. In Proceedings of ACM Symposium on User Interface Software and Technology (UIST) ‘01, pp. 219-226. 10. Gutwin, C., and Greenberg, S. The importance of awareness for team cognition in distributed collaboration. In E. Salas, S. Fiore (Eds) Team Cognition: Understanding the Factors that Drive Process and Performance, APA Press, 2004, 177-201. 11. Han, J., Low-cost multi-touch sensing through frustrated total internal reflection, Proc. UIST 2005, pp. 115 – 118. 12. Heath, C.C. and Luff, P. Collaborative activity and technological design: Task coordination in London Underground control rooms. Proc ECSCW, 1991, 65-80 13. Hollan, J., Hutchins, E., & Kirsh, D. Distributed Cognition: Toward a New Foundation for Human Computer Interaction. Proceedings of ACM TOCHI Vol 7 No 2 Jun 2000 pp. 174-196 14. Hutchins, E., and Palen, L. Constructing Meaning from Space, Gesture, and Speech. Discourse, tools, and reasoning: Essays on situated cognition. Heidelberg, Germany: Springer-Verlag, 1997 Pp. 23-40. 15. Hutchins, E. (2000) The Cognitive Consequences of Patterns of Information Flow. Proc. Intellectica 2000/1, 30, pp. 53-74.

Edward Tse – Thesis Proposal – Version 32 – 3/15/2006

Page 19

2. Tandler, P., (2003), The BEACH Application Model and Software Framework for Synchronous Collaboration in Ubiquitous Computing Environments, Journal of Systems & Software, Special Edition on Application Models and Programming Tools for Ubiquitous Computing, October, 2003. 16. Tobias Isenberg, André Miede, and Sheelagh Carpendale (2006). A Buffer Framework for Supporting Responsive Interaction in Information Visualization Interfaces. Proc. C5 2006, January 26-27, 2006, Berkeley, California, USA), Los Alamitos, CA. 17. Magerkurth, C., Memisoglu, M., Engelke, T. and Streitz, N., Towards the next generation of tabletop gaming experiences. Proceedings of the 2004 conference on Graphics Interface London, Ontario, Canada, 2004), Canadian Human-Computer Communications Society, pp. 73-80. 18. McGee, D.R. and Cohen, P.R., Creating tangible interfaces by augmenting physical objects with multimodal language. Proc ACM Conf Intelligent User Interfaces, 2001, 113-119. 19. McNeill, D. 1992. Hand and Mind: What Gestures Reveal About Thought. University of Chicago Press, Chicago. 20. Oviatt, S. Multimodal interactive maps: Designing for human performance. Human-Computer Interaction 12, 1997. 21. Oviatt, S. L. Ten myths of multimodal interaction, Comm. ACM, 42(11), 1999, 74-81. 22. Rekimoto, J. SmartSkin: An infrastructure for freehand manipulation on interactive surfaces. Proc ACM CHI, 2002. 23. Segal, L. Effects of checklist interface on non-verbal crew communications, NASA Ames Research Center, Contractor Report 177639. 1994 24. Shen, C.; Vernier, F.D.; Forlines, C.; Ringel, M., "DiamondSpin: An Extensible Toolkit for Aroundthe-Table Interaction", ACM Conference on Human Factors in Computing Systems (CHI), pp. 167174, April 2004 3. Tang, A. (2005) Embodiments in Mixed Presence Groupware. MSc Thesis, Department of Computer Science, University of Calgary, Calgary, Alberta, Canada T2N 1N4. Defended January 19, 2005. 25. Tse, E. (2004) The Single Display Groupware Toolkit. MSc Thesis, Department of Computer Science, University of Calgary, Calgary, Alberta, Canada, November. 26. Wu, M. and Balakrishnan, R., Multi-finger and whole hand gestural interaction techniques for multiuser tabletop displays. Proceedings of the 16th annual ACM symposium on User interface software and technology Vancouver, Canada, 2003), ACM Press, pp. 193-202. 4. Wu, M., Shen, C., Ryall, K., Forlines, C., Balakrishnan, R. (2006). Gesture Registration, Relaxation, and Reuse for Multi-Point Direct-Touch Surfaces, Proc IEEE Tabletop 2006, Adelaide, South Australia. pp. 183-190

Edward Tse – Thesis Proposal – Version 32 – 3/15/2006

Page 20

Appendix A. PhD Timeline Included is a rough schedule of upcoming events to provide a rough indication of anticipated timelines and when deliverables will be completed.

2006 February: Final Version of the Research Proposal Completed and submitted to Committee. March: Written candidacy examination, beginning some work on a tool to simplify the process of creating multimodal wrappers around existing single user applications (Thesis Objective 3). April: Oral candidacy examination. Objective is to submit a paper to the User Interface and Software Technologies (UIST) conference regarding the multimodal wrappers around existing single user applications. May: Begin internship at Mitsubishi Electric Research Laboratories (MERL). Work on true multi user multimodal systems (Thesis Objective 4) and begin to study the use of true multimodal systems. May – September: Objective is to submit a paper to the conference on Human Factors in Computing Systems (CHI) regarding studies of the usage of true multi user multimodal systems. I plan to do a presentation of my current work to the Supervisory Committee for approval and future directions. September – December: Continue the future work and directions provided by the supervisory committee at MERL.

2007 January: Directions meeting with the Supervisory Committee to examine current progress and directions. Begin work on my writing my PhD thesis, tie up loose ends in my research and publish any papers that remain to be published about my work. February: Begin final studies and systems to complete the requirements and goals of Thesis Objective 4. September: Begin writing the PhD thesis

2008 March: PhD Thesis completed and submitted to committee for approval. April: PhD Oral Defense

Edward Tse – Thesis Proposal – Version 32 – 3/15/2006

Page 21

Appendix B. Multimodal Co-Located Wrappers Paper

Reference Tse, E., Shen, C., Greenberg, S., Forlines, C. (2006) Enabling Interaction with Single User Applications through Speech and Gestures on a Multi-User Tabletop. Proceedings of AVI 2006, Venice, Italy, To appear.

Edward Tse – Thesis Proposal – Version 32 – 3/15/2006

Enabling Interaction with Single User Applications through Speech and Gestures on a Multi-User Tabletop 1

Edward Tse1,2, Chia Shen1, Saul Greenberg2 and Clifton Forlines1

Mitsubishi Electric Research Laboratories, 201 Broadway, Cambridge, MA, 02139, USA, +1 617 621-7500 2 University of Calgary, 2500 University Dr. N.W. Calgary, Alberta, T2N 1N4, Canada +1 403 220-6087

[shen, forlines]@merl.com and [tsee, saul]@cpsc.ucalgary.ca ABSTRACT Co-located collaborators often work over physical tabletops with rich geospatial information. Previous research shows that people use gestures and speech as they interact with artefacts on the table and communicate with one another. With the advent of large multi-touch surfaces, developers are now applying this knowledge to create appropriate technical innovations in digital table design. Yet they are limited by the difficulty of building a truly useful collaborative application from the ground up. In this paper, we circumvent this difficulty by: (a) building a multimodal speech and gesture engine around the Diamond Touch multi-user surface, and (b) wrapping existing, widely-used off-the-shelf single-user interactive spatial applications with a multimodal interface created from this engine. Through case studies of two quite different geospatial systems – Google Earth and Warcraft III – we show the new functionalities, feasibility and limitations of leveraging such single-user applications within a multi user, multimodal tabletop. This research informs the design of future multimodal tabletop applications that can exploit single-user software conveniently available in the market. We also contribute (1) a set of technical and behavioural affordances of multimodal interaction on a tabletop, and (2) lessons learnt from the limitations of single user applications.

Categories and Subject Descriptors H5.2 [Information interfaces and presentation]: User Interfaces – Interaction Styles.

General TermsDesign, Human Factors Keywords Tabletop interaction, visual-spatial displays, multimodal speech and gesture interfaces, computer supported cooperative work.

1. INTRODUCTION Traditional desktop computers are unsatisfying for highly collaborative situations involving multiple co-located people exploring and problem-solving over rich spatial information. These situations include mission critical environments such as military command posts and air traffic control centers, in which paper media such as maps and flight strips are preferred even when digital counterparts are available [4][5]. For example, Cohen et. al.’s ethnographic studies illustrate why paper maps on Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

AVI '06, May 23-26, 2006, Venezia, Italy. Copyright 2006 ACM 1-59593-353-0/06/0005...$5.00.

a tabletop were preferred over electronic displays by Brigadier Generals in military command and control situations [4]. The ‘single user’ assumptions inherent in the electronic display’s input device and its software limited commanders, as they were accustomed to using multiple fingers and two-handed gestures to mark (or pin) points and areas of interest with their fingers and hands, often in concert with speech [4][16]. While there are many factors promoting rich information use on physical tables over desktop computers, e.g., insufficient screen real estate and low image resolution of monitors, an often overlooked problem with a personal computer is that most digital systems are designed within single-user constraints. Only one person can easily see and interact with information at a given time. While another person can work with it through turn-taking, the system is blind to this fact. Even if a large high resolution display is available, one person’s standard window/icon/mouse interaction – optimized for small screens and individual performance – becomes awkward and hard to see and comprehend by others involved in the collaboration [12]. For a computer system to be effective in such collaborative situations, the group needs at least: (a) a large and convenient display surface, (b) input methods that are aware of multiple people, and (c) input methods that leverage how people interact and communicate over the surface via gestures and verbal utterances [4][18]. For point (a), we argue that a digital tabletop display is a conducive form factor for collaboration since it lets people easily position themselves in a variety of collaborative postures (side by side, kitty-corner, round table, etc.) while giving all equal and simultaneous opportunity to reach into and interact over the surface. For points (b+c), we argue that multimodal gesture and speech input benefits collaborative tabletop interaction: reasons will be summarized in Section 2. The natural consequence of these arguments is that researchers are now concentrating on specialized multi-user, multimodal digital tabletop applications affording visual-spatial interaction. However, several limitations make this a challenging goal: 1. Hardware Limitations. Most touch-sensitive display surfaces only allow a single point of contact. The few surfaces that do provide multi-touch have serious limitations. Some, like SmartSkin [20], are generally unavailable. Others limit what is sensed: SmartBoard’s DViT (www.smarttech.com/dvit) currently recognizes a maximum of 2 touches and the touch point size, but cannot identify which touch is associated with which person. Some have display constraints: MERL’s DiamondTouch [6] identifies multiple people, knows the areas of the table they are touching, and can approximate the relative force of their touches; however, the technology is currently limited to front projection and their surfaces are

relatively small. Consequently, most research systems limit interaction to a single touch/user, or by having people interact indirectly through PDAs, mice, and tablets (e.g., [16]). 2. Software Limitations. It is difficult and expensive to build a truly useful collaborative multimodal spatial application from the ground up (e.g., Quickset [5]). As a consequence, most research systems are ‘toy’ applications that do not afford the rich information and/or interaction possibilities expected in well-developed commercial products. The focus of this paper is on wrapping existing single user geospatial applications within the multi-user, multimodal tabletop setting. Just as screen/window sharing systems let distributed collaborators share views and interactions with existing familiar single user applications [9], we believe that embedding familiar single-user applications within a multi-user multimodal tabletop setting – if done suitably – can benefit co-located workers. The remainder of this paper develops this idea in three ways. First, we analyze and summarize the behavioural foundations motivating why collaborators should be able to use both speech and gestures atop tables. Second, we briefly present our Gesture Speech Infrastructure used to add multimodal, multi user functionality to existing commercial spatial applications. Third, through case studies of two different systems – Google Earth and Warcraft III – we analyze the feasibility and limitations of leveraging such single-user applications within a multi-user, multimodal tabletop.

2. BEHAVIOURAL FOUNDATIONS This section reviews related research and summarize them in the form of a set of behavioural foundations.

2.1 Individual Benefits Proponents of multimodal interfaces argue that the standard windows/icons/menu/pointing interaction style does not reflect how people work with highly visual interfaces in the everyday world [4]. They state that the combination of gesture and speech is more efficient and natural. We summarize below some of the many benefits gesture and speech input provides to individuals. Deixis: speech refined by gestures. Deictic references are speech terms (‘this’, ‘that’, etc.) whose meanings are qualified by spatial gestures (e.g., pointing to a location). This was exploited in the Put-That-There multimodal system [1], where individuals could interact with a large display via speech commands qualified by deictic reference, e.g., “Put that…” (points to item) “there…” (points to location). Bolt argues [1] and Oviatt confirms [18] that this multimodal input provides individuals with a briefer, syntactically simpler and more fluent means of input than speech alone. Studies also show that parallel recognition of two input signals by the system yields a higher likelihood of correct interpretation than recognition based on a single input mode [18]. Complementary modes. Speech and gestures are strikingly distinct in the information each transmits, how it is used during communication, the way it interoperates with other communication modes, and how it is suited to particular interaction styles. For example, studies clearly show performance benefits when people indicate spatial objects and locations – points, paths, areas, groupings and containment – through gestures instead of speech [17][18][5][3]. Similarly, speech is more useful than gestures for specifying abstract actions.

Simplicity, efficiency, and errors. Empirical studies of speech/gestures vs. speech-only interaction by individuals performing map-based tasks showed that multimodal input resulted in more efficient use of speech (23% fewer spoken words), 35% less disfluencies (content self corrections, false starts, verbatim repetitions, spoken pauses, etc.), 36% fewer task performance errors, and 10% faster task performance [18]. Rich gestures and hand postures. Unlike the current deictic ‘pointing’ style of mouse-based and pen based systems, observations of people working over maps showed that people used different hand postures as well as both hands coupled with speech in very rich ways [4]. Natural interaction. During observations of people using highly visual surfaces such as maps, people were seen to interact with the map very heavily through both speech and gestures. The symbiosis between speech and gestures are verified in the strong user preferences stated by people performing map-based tasks: 95% preferred multimodal interaction vs. 5% preferred pen only. No one preferred a speech only interface [18].

2.2 Group Benefits Spatial information placed atop a table typically serves as conversational prop to the group, creating a common ground that informs and coordinates their joint actions [2]. Rich collaborative interactions over this information often occur as a direct result of workspace awareness: the up-to-the-moment understanding one person has of another person’s interaction with the shared workspace [11]. This includes awareness of people, how they interact with the workspace, and the events happening within the workspace over time. As outlined below, many behavioural factors comprising the mechanics of collaboration [19] require speech and gestures to contribute to how collaborators maintain and exploit workspace awareness over tabletops. Alouds. These are high level spoken utterances made by the performer of an action meant for the benefit of the group but not directed to any one individual in the group [13]. This ‘verbal shadowing’ becomes the running commentary that people commonly produce alongside their actions. For example, a person may say something like “I am moving this box” for a variety of reasons: • to make others aware of actions that may otherwise be missed, • to forewarn others about the action they are about to take, • to serve as an implicit request for assistance, • to allow others to coordinate their actions with one’s own, • to reveal the course of reasoning, • to contribute to a history of the decision making process. When working over a table, alouds can help others decide when and where to direct their attention, e.g., by glancing up and looking to see what that person is doing in more detail [11]. Gestures as intentional communication. In observational studies of collaborative design involving a tabletop drawing surface, Tang noticed that over one third of all activities consisted of intentional gestures [23]. These intentional gestures serve many communication roles [19], including: • pointing to objects and areas of interest within the workspace, • drawing of paths and shapes to emphasise content, • giving directions, • indicating sizes or areas, • acting out operations.

Deixis also serves as a communication act since collaborators can disambiguate one’s speech and gestural references to objects and spatial locations [19]. An example is one person telling another person “This one” while pointing to a specific object. Deixis often makes communication more efficient since complex locations and object descriptions can be replaced in speech by a simple gesture. For example, contrast the ease of understanding a person pointing to this sentence while saying ‘this sentence here’ to the utterance ‘the 4th sentence in the paragraph starting with the word deixis located in the middle of the column on page 3’. Gestures as consequential communication. Consequential communication happens as one watches the bodies of other’s moving around the work surface [22][19]. Many gestures are consequential vs. intentional communication. For example, as one person moves her hand in a grasping posture towards an object, others can infer where her hand is heading and what she likely plans to do. Gestures are also produced as part of many mechanical actions, e.g., grasping, moving, or picking up an object: this also serves to emphasize actions atop the workspace. If accompanied by speech, it also serves to reinforce one’s understanding of what that person is doing. Simultaneous activity. Given good proximity to the work surface, participants often work simultaneously over tables. For example, Tang observed that approximately 50-70% of people’s activities around the tabletop involved simultaneous access to the space by more than one person [23]. Gaze awareness. People monitor the gaze of a collaborator [13][14][11]. It lets one know where others are looking and where they are directing their attention. It helps one check what others are doing. It serves as visual evidence to confirm that others are looking at the right place or are attending one’s own acts. It even serves as a deictic reference by having it function as an implicit pointing act. While gaze awareness is difficult to support in distributed groupware technology [14], it happens easily and naturally in the co-located tabletop setting [13][11].

2.3 Implications The above points clearly suggest the benefits of supporting multimodal gesture and speech input on a multi-user digital table. This not only is a good way to support individual work over spatially located visual artefacts, but intermixed speech and gestures comprise part of the glue that makes tabletop collaboration effective. Taken all together, gestures and speech coupled with gaze awareness support a rich multi-person choreography of often simultaneous collaborative acts over visual information. Collaborators’ intentional and consequential gesture, gaze movements and verbal alouds indicate intentions, reasoning, and actions. Participants monitor these acts to help coordinate actions and to regulate their access to the table and its artefacts. Participant’s simultaneous activities promote interaction ranging from loosely coupled semi-independent tabletop activities to a tightly coordinated dance of dependant activities. While supporting these acts are good goals for digital table design, they will clearly be compromised if we restrict a group to traditional single-user mouse and keyboard interaction. In the next section, we describe an infrastructure that lets us create a speech and gesture multimodal and multi-user wrapper around these single-user systems. As we will see in the following case studies, these afford a subset of the benefits of multimodal interaction.

3. GESTURE SPEECH INFRASTRUCTURE Our infrastructure is illustrated in Fig. 1. A standard Windows computer drives our infrastructure software, as described below. The table is a 42” MERL Diamond Touch surface [6] with a 4:3 aspect ratio; a digital projector casts a 1280x1024 pixel image on the table’s surface. This table is multi-touch sensitive, where contact is presented through the DiamondTouch SDK as an array of horizontal and vertical signals, touch points and bounding boxes (Fig. 1, row 5). The table is also multi-user, as it distinguishes signals from up to four people. While our technology uses the Diamond Touch, the theoretical motivations, strategies developed, and lessons learnt should apply to other touch/vision based surfaces that offer similar multi user capabilities. Speech Recognition. For speech recognition, we exploit available technology: noise canceling headset microphones for capturing speech input, and the Microsoft Speech Application Programmers’ Interface (Microsoft SAPI) (Fig. 1, rows 4+5). SAPI provides an n-best list of matches for the current recognition hypothesis. Due to the one user per computer limitation in Microsoft SAPI, only one headset can be attached to our main computer. We add an additional computer for each additional headset, which collects and sends speech commands to the primary computer (Fig. 1, right side, showing a 2nd headset). Gesture Engine. Since recognizing gestures from multiple people on a table top is still an emerging research area [25][26], we could not use existing 3rd party gesture recognizers. Consequently, we developed our own Diamond Touch gesture recognition engine to convert the raw touch information produced by the DiamondTouch SDK into a number of rotation and table-size independent features (Fig. 1, rows 4+5 middle). Using a Univariate Gaussian clustering algorithm, features from a single input frame are compared against a number of pre-trained hand and finger postures. By examining multiple frames over time, we capture dynamic information such as a hand moving up or two fingers moving closer together or farther apart. This allows applications to be developed that understand both different hand postures and dynamic movements over the Diamond Touch. Input Translation and mapping. To interact with existing single user applications, we first use the GroupLab WidgetTap toolkit [8] to determine the location and size of the GUI elements within

Figure 1. The Gesture Speech Infrastructure

it. We then use the Microsoft Send Input facility to relay the gesture and speech input actions to the locations of the mapped UI elements (Fig. 1, rows 1, 2 and 3). Thus speech and gestures are mapped and transformed into one or more traditional GUI actions as if the user had performed the interaction sequence via the mouse and keyboard. The consequence is that the application appears to directly understand the spoken command and gestures. Section 5.5 elaborates further on how this mapping is done. If the application allows us to do so, we also hide the user interface GUI elements so they do not clutter up the display. Of importance is that application source code is neither required nor modified.

4. GOOGLE EARTH and WARCRAFT III Our case studies leverage the power of two commercial single user geospatial applications: Google Earth (earth.google.com) and Blizzard’s Warcraft (www.blizzard.com/war3). The following sections briefly describe their functionality and how our multimodal interface interacts with them. While the remainder of this paper primarily focuses on two people working over these applications, many of the points raised apply equally to groups of three or four.

4.1 Google Earth Google Earth is a free desktop geospatial application that allows one to search, navigate, bookmark, and annotate satellite imagery of the entire planet using a keyboard and mouse. Its database contains detailed satellite imagery with layered geospatial data (e.g., roads, borders, accommodations, etc). It is highly interactive, with compelling real time feedback during panning, zooming and ‘flying’ actions, as well as the ability to tilt and rotate the scene and view 3D terrain or buildings. Previously visited places can be bookmarked, saved, exported and imported using the places feature. One can also measure the distance between any two points on the globe. Table 1 provides a partial list of how we mapped Google Earth onto our multimodal speech and gesture system, while Fig. 2 illustrates Google Earth running on our multimodal, multi user table. Due to reasons that will be explained in §5.4, almost all speech and gesture actions are independent of one another and immediately invoke an action after being issued. Exceptions are ‘Create a path / region’ and ‘measure distance’, where the system waits for finger input and an ‘ok’ or ‘cancel’ utterance (Fig. 1).

4.2 Warcraft III

Table 1. The Speech/Gesture interface to Google Earth Speech commands

Gesture commands

Navigates to location, Fly to eg., Boston, Paris

One finger move / flick

Pans map directly / continuously

Flys to custom-created Places places, e.g., MERL

One finger double tap

Zoom in 2x at tapped location

Navigation panel

Toggles 3D Navigation controls, e.g., rotate

Two fingers, spread apart

Zoom in

Layer

Toggles a layer, e.g., bars, banks

Two fingers, spread together

Zoom out

Undo layer

Removes last layer

Above two actions done rapidly

Continuous zoom out / in until release

Reorient

Returns to the default upright orientation

One hand

3D tilt down

Create a path Creates a path that can be travelled in 3D Ok

Five fingers

3D tilt up

Tour last path Does a 3D flyover of the previously drawn path

Bookmark

Pin + save current location

Create a region

Highlight via semitransparent region

Last bookmark

Fly to last bookmark

Measure Distance

Measures the shortest distances between two

Next bookmark

Fly to previous bookmark

Create a path

ppooiinntt ppooiinntt

Warcraft III is a real time strategy game. It implements a command and control scenario over a geospatial landscape. The landscape is presented in two ways: a detailed view that can be panned, and a small inset overview. No continuous zooming features are available like those in Google Earth. Within this setting, a person can create units comprising semi-autonomous characters, and direct characters and units to perform a variety of actions (e.g., move, build, attack). While Google Earth is about navigating an extremely large and detailed map, Warcraft is about giving people the ability to manage, control and reposition different units over a geospatial area.

Figure 2. Google Earth on a table. where the object to attack can be specified before, during or even after the speech utterance.

Table 2 shows how we mapped Warcraft III onto speech and gestures, while Fig. 3 illustrates two people interacting with it on a table. Unlike Google Earth and again for reasons that will be discussed in §5.4, Warcraft’s speech and gesture commands are often intertwined. For example, a person may tell a unit to attack,

This section is loosely structured as follows. The first three subsections raise issues that are primarily a consequence of constraints raised by how the single user application produces visual output: upright orientation, full screen views, and feedthrough. The remaining subsections are a consequence of

5. ANALYSIS and GUIDELINES From our experiences implementing multi-user multi-modal wrappers for Google Earth and Warcraft III, we encountered a number of limitations that influenced our wrapper design, as outlined below. When possible, we present solutions to mitigate these limitations, which can also guide the design of future multiuser multi-modal interactions built atop single user applications.

Table 2. The Speech/Gesture interface to Warcraft III Speech commands

One hand

Pans map directly

Attack / attack Selected units attack a pointed to location here [point]

One finger

Selects units & locations

Build