Affective Multimodal Mirror: Sensing and Eliciting ...

Viewer
Transcript

Affective Multimodal Mirror: Sensing and Eliciting Laughter Willem A. Melder

Khiet P. Truong

Department of Human Interfaces TNO Human Factors P.O. Box 37 3769 ZG Soesterberg

Department of Human Interfaces TNO Human Factors P.O. Box 37 3769 ZG Soesterberg

[email protected]

[email protected]

David A. van Leeuwen

Mark A. Neerincx

Department of Human Interfaces TNO Human Factors P.O. Box 37 3769 ZG Soesterberg

Department of Human Interfaces TNO Human Factors P.O. Box 37 3769 ZG Soesterberg

[email protected]

Faculty of EEMCS Delft University of Technology [email protected]

Marten Den Uyl VicarVision Singel 160 1015 AH Amsterdam [email protected]

Lodewijk R. Loos Waag Society Nieuwmarkt 4 1012 CR Amsterdam [email protected]

B. Stock Plum V2_ Lab Eendrachtsstraat 10 3012 XL Rotterdam [email protected]

I.5.5 [Pattern Recognition]: Implementation – interactive systems, special architecture

ABSTRACT In this paper, we present a multimodal affective mirror that senses and elicits laughter. Currently, the mirror contains a vocal and a facial affect-sensing module, a component that fuses the output of these two modules to achieve a user-state assessment, a user state transition model, and a component to present audiovisual affective feedback that should keep or bring the user in the intended state. Interaction with this intelligent interface involves a full cyclic process of sensing, interpreting, reacting, sensing (of the reaction effects), interpreting… The intention of the mirror is to evoke positive emotions, to make people laugh and to increase the laughter. The first user experiences tests showed that users show cooperative behavior, resulting in mutual user-mirror action-reaction cycles. Most users enjoyed the interaction with the mirror and immersed in an excellent user experience.

General Terms Algorithms, Design, Experimentation, Human Factors.

Keywords Multi modal laughter recognition, face voice expression emotion, affective mirror.

1. INTRODUCTION Nowadays, computers are becoming increasingly embedded in our daily lives. This leads to a growing need for effective and natural human-computer interaction. Traditionally, a mouse or keyboard is used to communicate with a computer. For example, when giving a presentation, we have to click a button on a device in order to move to the next slide. In a more natural environment, we might want to use speech to control the computer and say “next slide”' to proceed with the next slide, or make a “move-on”' gesture. In addition, we can use a combination of these modalities to interact with the computer: we can highlight an area on the slide during a presentation by pointing at that area and say “highlight text”.

Categories and Subject Descriptors H.1.2 [Models and Principles]: User/machine Systems – Human factors H.5.1 [Multimedia Information Interfaces and Presentation]: Multimedia Information Systems – Audiovisual input/output H.5.2 [Multimedia Information Interfaces and Presentation]: User Interfaces – User-centered design

These types of interfaces where new modalities are used and combined with each other to interact with machines in a natural way are gaining in number. Multimodal interaction has become a key factor in developing new, effective ways of natural humanmachine interaction. In many ways, multimodal interfaces are preferred over unimodal interfaces. Oviatt [25] describes that multimodal interfaces satisfy higher levels of user preference

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. HCM’07, September 28, 2007, Augsburg, Bavaria, Germany. Copyright 2007 ACM 978-1-59593-781-0/07/0009...$5.00.

31

when interacting with these systems. Probably because people experience a greater degree of flexibility, expressiveness and control when using such interfaces. The preference for multimodal interaction systems was especially apparent in a visual-spatial domain [13]. Furthermore, studies have reported enhanced performance when using multimodal instead of unimodal interfaces. For example, during a line-art drawing task, speech and mouse input improved task efficiency [17].

1.1 Spontaneous emotions For a long time, the use of actors has been a popular method to acquire emotion data for research on emotional expressions. Using actors is a relatively easy way to collect clean, noise-free speech signals for emotional speech analysis. However, it is unclear to what extend acted emotions are representative of real emotions and how ecologically valid it is to use actors. We can imagine that spontaneous emotions can be less extreme and can be expressed in more subtle ways than acted emotions, which are expressed by people who were asked to act emotional. Several studies have shown that there are indeed significant differences between acted and real emotional speech [41, 39]. For instance, differences have been found in the production and perception of emotional (visual) speech: acted emotional speech was not felt by the speakers, while it was more strongly perceived by observers than real emotional visual speech [41]. In the current study, where we present our real-time multimodal affective application, one of the aims is to be able to sense and evoke natural, spontaneous emotions in a real-life environment.

Multimodality has also become an important topic of interest to researchers in the field of automatic emotion recognition. An additional way to improve the effectiveness and naturalness of human-computer interaction is to enable computers to sense human affect. The emotion recognition community is more and more focusing on fusing unimodal emotion recognition systems: the recognition performance may improve if we use information from different modalities that can provide complementary information about a person's emotional state. Most of the multimodal emotion recognition studies have focused on a fusion between vocal and facial expressions [e.g., 31, 44]: usually, recognition performance increased in the audiovisual recognition condition. Other researchers have used more bio-physiological measures such as heart rate, skin response etc. [e.g., 15, 16]. Developers should consider what type of measurements to use because some measurements can be perceived as obtrusive by people.

1.2 Affective mirror In this paper, we describe how we developed our own affective, multimodal interactive interface that we call the “Affective Mirror”' (AM). Note that the concept of this affective mirror is different from the one proposed in [29]. Picard introduced the term “affective mirror” to refer to a concept in which a machine is able to literally mirror a user’s emotions, i.e., the machine first interprets the user’s emotion and then reflects this interpretation to the user (e.g., via an avatar). It gives the user visual feedback on how his/her behavior and emotional expressions is perceived by other humans. Users could learn from such reflections or mirrors, for example, to prepare for a job interview. Prendiger et al.[30] developed such an “affective mirror” with emotion recognition based on physiological measures. However, the concepts and prototypes do no include real-time adaptation of the affect projection to elicit specific user experiences, based on the sensed emotional state.

The development of affect sensing systems offers opportunities in different application domains. In the domain of customer-oriented service providers, such as call centers, companies are interested in using emotion recognition technology to detect angry or frustrated customers [e.g., 42, 38]. In the educational domain, emotion recognition in intelligent tutor and e-learning systems can support the student by e.g., identifying the frustration level of the student and if necessary, adjust the level of difficulty to this frustration level [e.g., 19]. The automatic identification of stress [e.g., 45, 12] or fatigue [43] can play an important role in a professional domain where drivers or operators need to stay alert. Further, in a surveillance environment, there is a growing demand for automatic detection of aggression [6]. Finally, nowadays, with emerging innovative game-play elements and “ambient intelligent”' environments, the focus is on creating excellent and fun user-experiences where emotions can play an important role [1]. For example, games can be made even more interactive by adapting the game play to the player's emotional state.

Another related concept is that of a “persuasive mirror” [2]. The persuasive mirror can monitor the behavior of a person ubiquitously and can help people to improve their lifestyle. Personal reflection (by providing the user with continuous visual awareness of his/her behavior) is used to motivate people to keep up a healthy lifestyle.

Although many researchers believe that multimodal interaction can profit from affect recognition and vice versa, few efforts [21, 18, 10, 5] have been made to integrate these two technologies into a working, real-time affective multimodal interactive application. Lisetti and Nasoz [18] developed a multimodal affective user interface MAUI that consists of three subsystems, namely a visual system, a kinesthetic system and an auditory system. Based on these systems an affect prediction is given. Descriptive feedback is given to the user about his/her current state. Feedback is also given in the form of avatars that can mirror the user's facial expressions. Gaze-X [21] is a context-aware affective multimodal interface that can adapt its interface to the user's emotional state and context in an office scenario environment. It uses speech, eye gaze direction, facial expressions, and keystrokes and mouse movements as input modalities. One of the major advantages of having such a real-life working application is that it can be used as a research tool to perform real-life experiments to measure affect or usability aspects.

Other (multimodal) virtual distortion mirrors have been developed, such as a “multimodal caricatural mirror” in which a user’s emotional expressions are enlarged in the visual and auditory expressions of an avatar [24]. A “classical” virtual distortion mirror that transforms faces in order to make the viewer laugh was developed in [8]. However, this mirror does not contain emotion recognition. To our knowledge, none of these mirrors combine the technologies and concepts we are aiming at, namely 1) multimodal affect sensing, and 2) completing the cycle of sensing, interpreting, acting and sensing of the effects, comprising an interaction in which the machine is also able to influence and trigger a user’s state. The intention of our AM is to elicit positive emotions and to make people laugh and increase the laughter. Its uniqueness lies in the fact that AM is able to influence the user’s state by first interpreting the user’s emotional state and, subsequently, generate

32

aural indicators) could be used to inform the users about their status in the game.

an appropriate feedback that affects the user, who in turn, will react to that feedback. This way, an interactive loop between user and machine is established. The affect sensing system is currently based on a visual subsystem and a vocal subsystem that detect smiles and laughter. The basic idea is that AM can detect in what kind of state the user is in and then provide visual feedback by distorting the user's face in the mirror, just like a traditional carnival mirror would do. The more you laugh, the further you proceed in different levels of distortions. The distortions are driven by the amount and type of laughter that can be a smile and/or vocalized laughter. The system has been realized as a Golden Demo for the Dutch BSIK project MultimediaN.

For the MultimediaN Golden Demo we chose to implement a single user scenario, but we divided the users into two categories. The first category contains users that are cooperative and laugh easily, where users in the second category are reluctant to cooperate, or don’t laugh easily. In the final demo both categories should be attracted to interact with the mirror and should get involved in ongoing interaction. Both should be able to have an interesting user experience. We introduced multiple levels as in a video game. Users can reach a next level if they show the AM a good quantity of laughter in the voice or the face. Reaching the next level means that the video effects will change and the user exploration can continue further.

This paper is organized as follows. In section 2, we describe possible interaction scenarios for the “affective mirror”. Section 3 presents the separate components that were used to create the “affective mirror”. Section 4 elaborates on how these components were combined and how these components communicate with each other. User experiences with the “affective mirror” are described in section 5. Finally, we conclude with some final remarks and future research in section 6.

After a session with the mirror, which generally takes a couple of minutes, the users receives a score card with their laughter statistics and a verbal description, accompanied with their picture.

2. SCENARIOS The initial idea for the AM was to create an affective multi modal interface that could adapt itself to the user and that could change the affective state of the user. The AM would sense the user by interpreting the observational user data. Affective recognition would capture laughter and affective verbal expressions in the voice, facial expressions from the frontal video stream and affective gestures from the body movement input channel. Fusion of the recognition subsystem results would allow us to monitor user experience and adapt the user interface to the current user state. Further, we considered multi user scenarios where users would interact with each other via the mirror. We worked out several scenarios for user interaction to explore the possible ways of interaction of the mirror with the user(s).

Figure 2: Example of a score card with statistics

In a single user scenario (Figure 1) the user can see his image reflected in a funny way and will discover that he can influence the mirror by laughing (initially also by making gestures). The other way around, the computer will try to elicit certain behavior or expressions by applying selected layers of mirror distortion. The mirror senses changes in expression and adapts itself to obtain its goals (which can be to ‘make people laugh’).

3. COMPONENTS In this section, we provide all the separate subcomponents that were available for the Affective Mirror to enable extraction of information on the emotional state of the user and the subcomponents that were used to evoke an affective response. The complete schema of subcomponents and the user/system interaction loop is visualized in figure 3.

Entering… User walks on to the mirror, curious, doesn't know what to expect Mirror notices the user's presence, tries to attract the user with intriguing sounds User is intrigued by the mirror, now comes closer to see what's going on Mirror captures the user's face, highlights the face and hides the background Interaction… User is surprised by the adaptive mirror and waits for the mirror to change mirror facial expression 'surprise' is classified, visual effect 'blow up eyes' is started User sees his mirror image getting distorted and looks either happy or sad Mirror facial expression 'happy' is classified, visual effect 'raise mouth corners' is started User notices that the mirror plays with his facial expression and starts to laugh Mirror detects the laughter, visual effect 'swirl' is started User is amazed by the interaction, doesn't like his face deformation and looks 'disgusted' Mirror classifies the 'disgust' facial expression and makes a funny 'wobble' sound Leaving… User now looses interest, steps back from the mirror Mirror looses track of the face, shows an after image that fades away

Mirror

User Human behaviour & expression

Face Camera Voice Microphone Gesture Accelerometer

Sensor data LARS FaceReader

Affect recognition

Human emotion, cognition

Figure 1: A use case for a single user interaction scenario Human Perception

An example of a multi user scenario is that users look at each other and see the other person’s image with distortion effects, which is supposed to be funny for both users. The explicit task for both users is to look serious as long as possible, which eventually one of the two will lose. Additionally visual feedback (assisted by

WordSpotter MUSH

Eyes Screen Ears Speakers

Multi modal Fusion

User profile Database States/ Intentions

Audiovisual Effects

Mirror profile database

Realizer

Figure 3: Schematic view of the Affective Mirror interaction loop and subcomponents

33

trial sessions with users that this was likely to occur in our case: relatively long durations of silences were observed with some users. Therefore, the updating of µ and σ was disabled each time after some period of silence. This was achieved by detecting silence frames with a relatively simple energy-based voice activity detection algorithm.

3.1 Laughter Recognition in Speech The vocal laughter recognizer (LARS) was based on previous research in [34]. We used meetings from the ICSI Meeting Recorder Corpus [14] to train acoustic models for laughter, speech and silence. More specifically, the audio from the closetalk recordings where participants wore a head-set, were used for training. A set of 2680 homogeneous spontaneous laughter events (the segments contained solely laughter), comprising of approximately 83 minutes, were extracted from the audio. As acoustic features, Rasta filtered Perceptual Linear Prediction features (RPLP) were extracted [32] sampled at 16 ms with an analysis window of 32 ms. Gaussian Mixture Models (GMMs) were trained with 26 RPLP features, including delta features (first order derivative) and a log energy component. The TORCH3 library [7] was used for the calculation of the GMM likelihoods. For each frame, a log likelihood ratio LLR is derived from the laughter, speech and silence GMMs:

3.2 Facial Expression Recognition We used the FaceReader (FR) developed by VicarVision [9] to analyze facial expressions and smiles. The classification of facial expressions is performed by FR in 3 steps. First, an accurate position of a face is found using the Active Template Method [33]. Next, using the Active Appearance Model (AAM) which is trained on a set of annotated images, an artificial face model is synthesized that describes both the locations of key points in the face, as well as the texture of the face. An “appearance vector” can then describe new face models as deviations from the “mean face”. In this way, the AAM can model individual facial variations, variations related to pose/orientation, lighting and facial expressions. Finally, in order to classify emotional expressions, a neural network is trained with the appearance vector as input and the seven basic emotions (happy, angry, sad, surprised, scared, disgust, neutral) as output [37]. In the most recent version (version 1.0), the FR can also output a value on the valence scale (positive–negative). The likelihoods on the valence scale and of the emotion categories are used in AM to determine whether a person was smiling in the face. As training data for the neural network, the “Karolinska Directed Emotional Faces” set [20] was used, which contains 980 high quality facial images with posed emotions. The total classification accuracy on this set of images is around 89% correct. It is questionable whether this high performance can also be achieved with spontaneous emotional expressions in real-life situations as it is in our case. Image quality may also influence classification performance.

LLR = LLlaughter – max(LLspeech, LLsilence). This log likelihood ratio is used to determine to what degree the user is laughing and will be used to manipulate the affective feedback. In previous offline laughter-speech discrimination experiments, we achieved Equal Error Rates (EERs) of approximately 6% [34]. The results from the laughter-speech discrimination experiments probably do not transfer well to real-life environments. Switching from offline, artificial, laughter-speech discrimination to online, real-time laughter detection brought along some issues. We dealt with a) different acoustic environments (mismatch between the acoustic environment in which the training data was recorded and the environment in which the detector is applied), b) the occurrence of sounds other than laughter and speech, and c) normalization of acoustic features which is usually performed afterwards over a whole utterance. Issue (b) was tackled by adding a silence model; anticipating that laughter, speech and silence would cover the large majority of sounds that could be encountered by the Affective Mirror. Another option would have been to perform voice activity detection to filter out non-speech and non-laughter sounds, and subsequently perform laughter detection. Issue (a) is a more general problem: there is no guarantee that the acoustic models that are trained with data recorded in a meeting environment behave the same way when applied to data recorded in a different acoustic environment under different acoustic conditions. Due to these differences in acoustic environments, calibration of e.g., detector thresholds is needed in order to achieve optimal results. We have tackled issue (c) with an online “running”' normalization algorithm. Normalization was performed for each feature by xnormt = (xt - µt)/σt, where µt and σt were updated online for each t, see Equation 1 (using a similar formula for σt).

μ t = (1 − α )μ t −1 + αxt

3.3 Word Spotting The TNO speech recognition system SpeechMill is modified to function as a keyword spotter. The core recognition engine is based on the Sonic Speech Recognizer of the University of Colorado [28]. Spotting keywords is a task where the audio channel is monitored for the utterance of target words from a wordlist. If a target word is spoken in the audio input channel, it can be spotted or missed. If no keyword is uttered, there can not be a miss, but a false alarm can occur. The goal is to maximize the number of keywords spotted correctly and to minimize the number of false alarms. Word spotting is realized when configuring a finite state network with a top level grammar rule defining: all phonemes recursively, a list of words and again all phonemes recursively. In combination with a affective wordlist, TNO is constructing an Affective Word Spotter (AWS). The wordlist contained items selected from Affective Norms for English Words [4] using the affective ratings as criterion for selection. The AWS listens for affective verbal expressions in the user’s vocal utterances. The valence, arousal and dominance listed for each word can be used to determine user state in multi modal affect analysis.

(1)

Here, we set α = 0.001 so that the current (overall) mean µt is updated with a relatively large contribution (1-α) of the previous mean µt-1 and a relatively small contribution (α) of the current feature value xt. The disadvantage of such an online normalization procedure is that if a certain sound were sustained for a long time, the mean and the standard deviation would be biased towards that sound. As a consequence, the features would be normalized with distorted means and standard deviations. It appeared, after a few

3.4 Gesture Recognition One of the project partners in MultimediaN, V2_, developed a gesture recognition interface called MUSH (prototype version 3)

34

rich corpus of meaningful, representative multi modal affective annotated observations should be collected. Using corpus based training methods a classifier can be trained that does the fusion implicitly. One disadvantage of this approach is that omission of sensor data or subsystem results would degrade fusion performance. On the other hand, if the corpus would provide a large set of equally distributed incomplete feature vectors, the classifier would then again be robust against a missing modality or corrupted user data.

[36]. The hand-held gesture capture device contains two accelerometers that measure acceleration in three directions. A continuous stream of accelerometer readings is sent to a USB receiver that communicates with a USB host supporting the USB HID (Human Input Device) protocol. The MUSH gesture interface software processes the accelerometer data from the hand-held device in three layers: First, the Vector Pre-Processing layer receives the accelerometer-data and converts these into a 3D vector. Further processing of this vector yields 4 different vectors for the next layers of the application. Second, the Gesture Recognition layer employs a neural network to analyze and classify the development of the Motion Vector over time. Finally, both the output-parameters of the neural network and the vectors from the vector layer (Orientation, Velocity and Position) can be used to Gesture-Control an application.

In decision level fusion the user state is determined after the subsystems have taken decision for the single modality. Depending on the context, task, individual differences and environmental conditions, the subsystems are likely to be attributed different weights. After attributing weights to the sublevel decisions, they are combined to produce a user state hypothesis. To determine the parameters at the decision level, a more explicit validation of subsystem results and threshold values is needed, which can only be done properly using a rich corpus of example decisions. For the AM we created the laughter score to implement decision level fusion.

A gesture can be considered as the development of motion over time. In an attempt to use MUSH in multi modal expression analysis, we reasoned that if MUSH can distinguish gestures, it can also distinguish the same gesture expressed in different ways. The assumption that we make is that when people are happy or excited they are likely to make more expressive movements (increased speed and bigger size). On the contrary, when they are sad they are inhibited, not lively and do not express much body movements. The pairing of body motion and affective states urged us to set up a collection of movements, each expressed in several ways. Each affective gesture in the database indicates a particular gesture joined with an affective user state, based on the way a gesture is expressed. After training the MUSH neural network with the examples from the affective gesture database, not only will gestures be recognized, but also the way that a gesture is expressed is captured.

As in a video game we introduced a score that expressed the laughter on a scale of zero to 500. Five levels were defined in the AM so 100 points were needed to get to the next level. The occurrence of laughter will result in a portion of a laughter block (defined as 10 points) to the score. Continuously, points are subtracted from the score as time passes by. The decay factor can be adjusted, which offers the possibility to make the mirror more susceptible to laughter. We used the following functions to add vocal and facial laughter to the score:

⎧ LARS LLR > τ LLR → 1 , V( t ) = ⎨ ≤ → 0 LARS τ LLR LLR ⎩

3.5 Multi Modal Fusion In multi modal expression analysis several classification systems trained to work with one modality. When a hypothesis on the user state is needed the results of the single modality subsystems have to be combined [35]. In order to derive the user state hypothesis we need a module to do multi modal fusion (MMF).

where LARSLLR is the LARS log likelihood ratio and threshold value for laughter, set to 0.5 here.

τ LLR is the

⎧ FRvalence > τ valence → 1 , F( t ) = ⎨ ⎩ FRvalence ≤ τ valence → 0

In the fusion of the results of the subsystems three problems came up: First, subsystems deliver results with different frequencies. Typically, the FR facial expression likelihoods are generated at a frequency limited by CPU performance (about 7 times per second), while the LARS likelihood ratios are calculated for every 16 millisecond frame (about 60 times per second). For word spotting an update frequency cannot even be determined, because it depends on how many times a word from the list is spoken, if a word from the list is uttered at all. Second, multiple recognition systems can deliver results asynchronously, which means whenever the subsystem is ready processing all the user data and a result is available, the MMF is notified. Either MMF updates the hypothesis with a predefined frequency using the available data at the time of update or, the MMF updates its hypothesis whenever new data arrives. Third, due to environmental conditions a subsystem might be less reliable or fail to produce results.

where FRvalence is the FR valence value and the threshold value for laughter, τ valence , is set to 0.25 in this study.

The score is updated five times per second, corresponding with the update frequencies of the subsystems. Audio samples were captured with a steady pace of one frame per 16 milliseconds, and therefore we used the vocal updates as trigger for the laughter scale to be updated. A random binomial quantity D(t) є{0,1}, was introduced to create some interesting, unpredictable mirror behavior. We set the probability of success (D(t)=1) to p=1/15. The decay factor can be adjusted, which offers the possibility to make the AM more susceptible to laughter (lower values for the decay factor). As far as we can define an asynchronous or event driven update process in a lineair function, the following describes the score function:

In [26] a distinction between feature level fusion and decision level fusion is described. In feature level fusion the likelihoods resulting from the subsystems are combined into a joint feature space resulting in a (higher level) feature vector. To determine proper parameters for the classifier in the joint feature domain a

S ( t ) = S ( t −1) + (αV( t ) + β F( t ) − γD( t ) ) * B ,

35

do humans need proper image resolution and a good angle of view to the other party, also FR needs these conditions. FR is trained with a database of frontal faces photographed with good lighting conditions. The required conditions for the FR are even stricter than those for humans. Directional light from a window, lack of illumination of the face and faces turned away with an angle more than 15 degrees will result in poor image quality in FR. If image quality is not excellent, face finding, fitting and classification is not possible in FR. Creating optimal conditions is somewhat artificial when implementing a real world application, but other options were not available to us. Thus, we needed to use a device like the EC because it provides the AM with a frontal video stream, without the bias of a camera standing next or on top of the screen. Not only will the FR perform better, but also user behavior will be more spontaneous, because the user will have a more realistic ‘mirror’-experience by looking himself straight in the eyes, without the notion of being watched caused by camera visibility.

where α is a weight factor for vocal results, β is a weight factor for facial analysis results, γ is the weight factor for score decay and B denotes the size of a block of points for the score. We used α=0.7, β=0.5, γ=0.1 and B=10. The score card that users received after a session showed the percentage of laughter in the voice versus facial laughter. The ratio was determined using the total amount of points that someone scored with his voice versus the total amount in the face.

3.6 Audiovisual Affective Feedback Video and audio effects are added to the user’s mirror image to get the attention and to make a person laugh. The real-time graphical filters, the geometric image deformations to the user’s face, are created with the Realizer, an audio/video rendering engine that is part of the KeyWorx platform. Waag Society recently released KeyWorx, an open source platform to invent, develop, integrate and deploy applications with multi-user/ multimedia features, under Open Source license [40].

Hardware/software required

The Realizer consists of a number of modules that can each do a specific task. It is up to the user to make a patch (a set of modules and connections between the modules). For example, in order to see the mirror image of the person we have to connect the camera output, connect it with a mirror module to flip left/right and then connect the mirror output to a module that displays the image on the screen. Realizer modules accept three types of signals: “video”, “audio" and “abstract". An abstract signal is a stream of numbers that can be interpreted in various ways, for example as values to control a parameter or as a string of text. When using output parameters of one type of media to control input parameters of another media this is called cross media synthesis.

Both the Realizer and MUSH require Apple Mac OSX. The Realizer includes audio/video effects written by Waag Society and additionally it uses Apple Core Image video effects. We used an iMac with an 1.83 GHz Intel Core Duo processor, 2 GB of RAM and an ATI X1600 Graphics adapter. The subsystem FR required Windows XP with service pack 2, including the .NET framework. The modules LARS and MMF were implemented on a Win32 architecture, using the multimedia audio drivers. Initially we tried to use a virtual machine with XP running within a VMWare Fusion Beta (VMFB) virtual machine. Unfortunately, the performance of this VMFB version was too low for real-time operation and it didn’t support USB 2.0, which is necessary to guarantee the data throughput that is needed for an uncompressed video stream (at resolution of 640x480 and 30 fps). The attempt to run every component on one (physical) machine stranded and we moved on to use a PC running XP and a 2 GHz Intel Core Duo processor and 1GB of RAM.

The Realizer functions as a server that hosts the client application, a patcher, which controls it. Controlling the Realizer means basically telling it which modules to enable/disable and connect and what the configuration for these modules must be. These commands are sent via XML/TCP. Since the AM controls the Realizer it is a patcher. The XML-protocol used for the communication between the patcher and the Realizer consists of messages that invoke functions of the Realizer. A full listing of the XML-protocol can be found in the Realizer protocol section of the Realizer white paper [40].

Video input The video signal for both analysis and the mirror image came from analogue surveillance camera that is part the EC and that (with some effort) can be reached behind the half silvered mirror. The camera outputs a y/c (luminance and chrominance) signal or s-video signal, which we split with an Extron Electronics s-video splitter (MDA3). One s-video output was available for the facial expression recognition (SVA) and one to use as the mirror image that the user actually sees (SVB). The SVA output was connected to a Digitus USB Video grabber (DA-VC211) that provided digital video streaming to the laptop. The second output SVB was connected to an Imaging Source video-to-firewire converter (DFG/1394-1e) that provided the iMac with a digital video stream.

In an attempt to realize interesting mirror behavior we created video effect adaptation that depends on a) time (time elapsed in current user state, time elapsed in current level and time elapsed without laughter are effect modifiers), b) valence (value from the facial subsystem), c) laughter score and d) the random quantity D.

4. SYSTEM ARCHITECTURE As with any human centered software development project, a collection of user requirements prioritized with ‘must haves’, ‘nice to haves’, ‘future issues’ was the base of the AM. The user requirements showed that we needed an interface with a half silvered mirror, hiding the camera from the user’s view. This way the user would see the reflection of his face and not see the hidden camera behind his own (distorted) mirror image.

Audio input The audio signal came from a commercially available desktop microphone that was connected to the microphone input of a PCMCIA Creative Labs SoundBlaster Audigy2 ZS. The microphone needs a voltage to charge its membrane. When this was provided by the sound card, the channel noise was seriously degrading the signal-to-noise ratio and the audio was not fit for energy-based speech activity detection. This resulted in poor online feature normalization, which subsequently disrupts the

EyeCatcher We acquired an EyeCatcher (EC) from a Dutch company called Ex’ovision [11]. The EC is a device providing participants in a video conference call with a truly eye-to-eye experience. Not only

36

Video effects

laughter recognition. To fix this problem we used an external battery to provide the microphone with power. In the future we would like to use more directed acoustic capturing device like a microphone array to provide the human affect recognizer with an enhanced speech signal.

To create the distortions we used: bump distortion, twirl distortion and rippler. We used oscillator modules to smoothly modify effect parameters like scale, angle or radius. Two phase-shifted sinusoid inputs are used to change the x- and y-coordinates of the video effect centers. The center’s motion on the display is a Lissajous curve, which is the graph of the system of parametric equations:

User State Model To provide the mirror with an appropriate model of its user we created a simple User State Model (USM) that represents the current affective state and contains history about previous states. Similarly we defined a Mirror State Model (MSM) to model the mirror ‘character’. The state models allow state change only when the right affective user data is present for a certain period of time. We introduced time delays for state transition, to avoid too quick a user state transition. For the AM we used a three-state model with a state transition time delay as shown in Table 1 .

x = A sin(at ), y = B sin(bt ) , where t is the elapsed time in the current state. The phase relationship between the signals is known as a Lissajous figure. The appearance of the figure is highly dependent to the ratio a/b (we used a ratio of 4/5). The A and B limit the area of the screen where the center can be positioned. In one level we used FR valence to modify the angle of twirl distortion directly: a smile resulted in a twist to the left, where an angry face twisted the face to the right.

Table 1: State transition time delays (in sec) from↓ to→

Absent

Neutral

Laughing

Absent

-

1.0

1.0

Neutral

2.0

-

0.25

Laughing

2.0

0.5

-

5. USER EXPERIENCES A so-called Situated Cognitive Engineering+ (SCE+) method is applied to realize and refine the intended user experiences [23]. According to this method, you need both a user and a technology perspective for the development of adaptive user interfaces. First, the definition of a technological design space sets a focus in the process of specification and generation of user interface concepts. Second, the reciprocal effects of the adaptive behavior of human and technology are made explicit and are integrated in the development process. In early design phases, these reciprocal effects should be evaluated with prototypes that show realistic adaptive behavior. The first evaluations will test whether the general concept is valid and can react appropriately to the dynamic behavior of the user, and to set the parameters for the adaptation mechanisms of components (e.g., to establish stable reactions). Incremental development allows for adding more and more intelligence to the user interface, taking into account the evaluation results with the first, simpler versions. Eventually, the process of iteration stops when the evaluation shows that the adaptive user interface evokes the intended user experiences for the target group(s) and usage contexts.

Neutral

Absent

Laugh

Figure 4: User state model with three states State history has not been used yet, but it is considered as a valuable source of information for further extension of the USM. More research needs to be done on the variability in state transitions, so that prior behavior can be used to predict and classify subsequent behavior.

During the development, users have been interacting with test versions as soon as an interactive prototype was available. These tests were used to refine the overall scenario (i.e., multi-level game play), to set the parameters of the components, to know the constraints of the technology in “real-world environments” (e.g. lighting conditions, background noise) and check whether the mirror evoked the intended reactions in such environments. The first users tended to be receptive for changes in the interface. When shown a static video effect they were surprised at first, but they were quickly habituated to the image deformation. More variability was added to enlarge the surprising effects of the interaction. After these tests, we released the first version of the Affective Mirror. This version is a stand-alone attraction that stood at two events in the Netherlands: The open day at TNO’s 75th year celebration in Soesterberg, and the MultimediaN Golden Pavilion in Nemo Science Center Amsterdam. Taken together, a large diversity of visitors experienced the Affective Mirror, such as playful children, curious parents, interested exhibitors, and serious

Figure 5: Some video effects result in hilarious deformations

37

[3] Bachorowski, J.A., amd Owren, M.J. (2001). Not all laugs are alike: Voiced but not unvoiced laughter readily elicits positive affect. Psychological Science, Vol. 12(3), 252-257.

scientists. Some visitors started to laugh very quickly and went through the game levels fluently. Other visitors were more sensitive to the way their behavior influenced the mirror behavior. This resulted into user-mirror cooperative behavior to produce funny distorted faces: reciprocal user-mirror action-reaction cycles, in which the user is expressing weird facial and vocal behavior and the mirror is providing different dynamic image morphing effects. For most visitors, the session was successful: they started to laugh, passed several laughter levels in the game, and received a final assessment as shown in the score card of Figure 2.

[4] Bradley, M.M. and Lang, P.J. Affective norms for English words (ANEW). Gainesville, FL. The NIMH Center for the Study of Emotion and Attention, University of Florida. [5] Brøndsted, T., Nielsen, T.D., and Ortega, S. Affective multimodal interaction with a 3d agent. In Proceedings of the 8th international workshop on the cognitive science of natural language processing, 1999, 102-109. [6] Clavel, C., Ehrette, T., and Richard, G. Events detection for an audio-based surveillance system. In Proceedings of IEEE International Conference on Multimedia and Expo (ICME 2005), 2005, 1306-1309.

6. CONCLUSIONS AND DISCUSSIONS The Affective Mirror tracks user’s emotional states, determines the desirability of this state, and supports the occurrence of desirable emotions. In this way, an affective human-machine interaction was realized, comprising reciprocal human-machine perceive-act cycles via intuitive, visual and auditory, dialogues. As mentioned in the introduction, we do not know other applications that encompass this complete cycle in the real world. Although the performance of each individual component is not yet optimal, their combination proves to provide a stable adaptive user interface that evokes the intended user behavior and experiences. It is interesting to note that the assessment of the laughs can provide insight in their effects on possible human responses to these laughs. For example, voiced laughs tend to elicit more positive responses than unvoiced ones [3].

[7] Collobert, R., Bengio, S., and Mariéthoz, J. Torch: a modular machine learning software library. Technical Report IDIAPRR 02-46, IDIAP, 2002. [8] Darrell, T., Gordon, G., Woodfill, J., and Harville, M. A virtual mirror interface using real-time robust face tracking. In Proceedings of the 3th International Conference on Face and Gesture Recognition, 1998. [9] Den Uyl, M.J. and Van Kuilenburg, H. The FaceReader: online facial expression recognition. In Proceedings of Measuring Behavior, 2005. [10] Duric, Z., Gray, W.D., Heishman, R., Li, F., Rosenfeld, A., Schoelles, M.J., Schunn, C., and Wechsler, H. Integrating perceptual and cognitive modeling for adaptive and intelligent human-computer interaction. In Proceedings of the IEEE, Vol.90 (7), 2002, 1272-1289.

With respect to the specific problems encountered during the “real world” tests, it proved to be hard to tune acoustic models to the acoustic environment of the specific real world events. Trained on a close-talk microphone, in a relatively background-noise free environment, the events had a completely different acoustic environment where background noise is likely to occur and where the microphone is further away from the user’s mouth (it is attached to the mirror itself rather than the user). Consequence is that a lot of tuning is required and it is difficult to set the detection thresholds that determine whether an incoming sound is laughter or not. Second, optimal lighting conditions were hard to meet for the FaceReader.

[11] Ex’ovision EyeCatcher: http://www.exovision.nl/download/EyeCatcher-leaflet.pdf [12] Grootjen, M., Neerincx, M.A., Weert, J.C.M., and Truong, K.P. Measuring cognitive task load on a naval ship: implications of a real world environment. In Proceedings of International Conference on Human-Computer Interaction (HCII’07), 2007. [13] Hauptmann, A.G. Speech and gestures for graphic image manipulation. In Proceedings of ACM conference on Human Factors in Computing Systems (CHI’89), 1989, 241-245.

Future research will center on two objectives. First, we will improve the laughter sensing and evoking mirror by reducing some of the technical shortcomings mentioned above, adding other modalities, incorporating user models, improving the fusion and conducting user tests. Second, we will apply the same approach for a different set of emotions, including negative emotions [35, 22]. Detecting ‘simple’ striking negative emotions (e.g., ‘panic’), and maybe missing some nuances, can be of high practical value in high-demand task domains [12].

[14] Janin, A., Baron, D., Edwards, J., Ellis, D., Gelbart, D., Morgan, N., Peskin, B., Pfau, T., Shriberg, E., Stolcke, A., and Wooters, C. The ICSI meeting corpus. In Proceedings of IEEE International conference on acoustics, speech and signal processing (ICASSP’03), 2003, 364-367. [15] Kapoor, A. and Picard, R.W. Multimodal affect recognition in learning environments. In Proceedings of the 13th annual ACM International Conference on Multimedia, 2005, 677682.

7. ACKNOWLEDGMENTS This study is supported by the Dutch BSIK project MultimediaN (http://www.multimedian.nl).

[16] Kim, J., André, E., Rehm, M., Vogt, T., and Wagner, J. Integrating information from speech and physiological signals to achieve emotional sensitivity. In Proceedings of Interspeech, 2005.

8. REFERENCES [1] Aarts, E. Ambient intelligence: a multimedia perspective. IEEE Multimedia 11 (1), 2004, 12-19.

[17] Leatherby, J.H., and Pausch, R. Voice input as a replacement for keyboard accelerators in a mouse-based graphical editor: an empirical study. In Journal of the American Voice Input/Output Society 11 (2), 1992.

[2] Andrés del Valle, A.C. and Opalach, A. The Persuasive Mirror. In Proceedings of Persuasive, 2006.

38

Proceedings of the International Conference on Pattern Recognition (ICPR 2006), 2006, 1136-1139.

[18] Lisetti, C.L. and Nasoz, F. MAUI: a multimodal affective user interface. In Proceedings of ACM Multimedia Interaction Conference 2002, 161-170.

[32] SPRACHcore. http://www.icsi.berkeley.edu/~dpwe/projects/sprach/

[19] Litman, D.J. and Forbes-Riley, K. Recognizing student emotions and attitudes on the basis of utterances in spoken tutoring dialogues with both human and computer tutors. Speech Communication 48 (5), 2006, 559-590.

[33] Sung, K.K. and Poggio, T. Example-based learning for viewbased human face detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (1), 1998, 39-51.

[20] Lundqvist, D., Flykt, A., and Öhman, A. The Karolinska Directed Emotional Faces – KDEF, Department of Clinical Neuroscience, Psychology section, Karolinska Institute.

[34] Truong, K.P. and Van Leeuwen, D.A. Automatic discrimination between laughter and speech, Speech Communication 49 (2), 2007, 144-158.

[21] Maat, L. and Pantic, M. Gaze-X: adaptive affective multimodal interface for single-user office scenarios. In Artificial Intelligence for Human Computing, Vol.4451, 2007, 251-271.

[35] Truong, K.P, van Leeuwen, D.A., and Neerincx, M.A. (in press). Unobtrusive Multimodal Emotion Detection in Adaptive Interfaces: Speech and Facial Expressions. In: D.D. Schmorrow & L.M. Reeves (Eds), Foundations of Augmented Cognition, 3rd Edition, LNAI 4565 proceedings (10 pages), ISBN 978-3-540-73215-0.

[22] Merkx, P.A.B., Truong, K.P. and Neerincx, M.A. (2007). Inducing and measuring emotion through a multiplayer firstperson shooter computer game. In: H.J. van den Herik, J.W.H.M. Uiterwijk, M.H.M. Winands, and M.P.D. Schadd (Eds.). Proceedings of the Computer Games Workshop 2007, Amsterdam, The Netherlands.

[36] V2_ institute for the unstable media, MUSH device v3: http://multimedian.v2.nl/mush/mushv3_documentation.pdf [37] Van Kuilenburg, H., Wiering, Marco, Den Uyl, M. A model based method for automatic facial expression recognition. In Proceedings of the European Conference on Machine Learning (ECML’ 05), 2005, 194-205.

[23] Neerincx, M.A. and Lindenberg, J. Situated cognitive engineering for complex task environments. In: Schraagen, J.M.(Ed.), Natural Decision Making & Macro cognition. Ashley.

[38] Vidrascu, L., and Devillers, L. Detection of real-life emotions in dialogs recorded in a call center. In Proceedings of Interspeech, 2005, 1841-1844.

[24] Olivier, M., Benoit, M., Irène, K., Arman, S., and Jordi, A. Multimodal caricatural mirror. In Proceedings of ENTERFACE’05, 1st Summer Workshop on Multimodal Interfaces, 2005, 13-20.

[39] Vogt, T. and André, E. Comparing feature sets for acted and spontaneous speech in view of automatic emotion recognition. In Proceedings of IEEE International Conference on Multimedia & Expo (ICME’05), 2005.

[25] Oviatt, S. User-centered modeling and evaluation of multimodal interfaces. In Proceedings of the IEEE, Vol.91 (9), 2003, 1457-1468.

[40] Waag Society, KeyWorx: http://www.keyworx.org, Realizer: http://kwlive.dev.waag.org/realizer/doc/doxygen/html/

[26] Pantic, M. and Rothkrantz, L.J.M. Towards an affectsensitive multimodal human-computer interaction. In Proceedings of the IEEE, Vol.91, 2003.

[41] Wilting, J., Krahmer, E., and Swerts, M. Real vs. acted emotional speech. In Proceedings of Interspeech, 2006. [42] Yacoub, S., Simske, S.,Lin, X., and Burns, J. Recognition of emotions in interactive voice response systems. In Proceedings of Eurospeech, 2003, 729-732.

[27] Pantic, M., Sebe, N., Cohn, J.F. and Huang, T. Affective multimodal human-computer interaction, In Proceedings of the 13th annual ACM international conference on multimedia, 2006.

[43] Yang, G., Lin, Y., and Bhattacharya, P. A driver fatigue recognition model using fusion of multiple features. In IEEE SMC, Vol.2, 2005, 1777-1784.

[28] Pellom, B., SONIC: The university of Colorado Continuous Speech Recognizer, University of Colorado, tech report#TRCSLR-2001-01, Boulder, Colorado, March, 2001.

[44] Zeng, Z., Hu, Y., Fu, Y., Huang, T.S., Roisman, Z.W., and Wen, Z. Audio-visual emotion recognition in adult attachment interview. In Poceedings of the 8th International Conference on Multimodal Interfaces (ICMI ’06), 2006, 139145.

[29] Picard, R.W. Affective computing. MIT Press, Cambridge, MA, 1997. [30] Prendinger, H., Mayer, S., Mori, J., and Ishizuka, M. Persona effect revisited: using bio-signal to measure and reflect the impact of character-based interfaces. In Proceedings of the Intelligent Virtual Agents 4th International Workshop (IVA’03), 2003.

[45] Zhai, J. and Barreto, A. Stress recognition using noninvasive technology. In Proceedings of FLAIRS Conference, 2006, 395-401.

[31] Sebe, N., Cohen, I., Gevers, T., and Huang, T. Emotion recognition based on joint visual and audio cues. In

39

From affective blindsight to affective blindness - Belief, Perception and ...

Timecourse of mirror and counter-mirror effects ...

From affective blindsight to affective blindness: When ...

Affective Habituation

Affective Economies -

Affective Statements & Questions - OCDE.com

Affective Habituation - CiteSeerX

Multimodal Metaphor

LED-mirror layout - GitHub

MULTIMODAL MULTIPLAYER TABLETOP GAMING ... - CiteSeerX

Review Book Eliciting Sounds: Techniques and ...

Eliciting and Utilizing Willingness-to-Pay: Evidence ...