interact2-2001.ppt 3245KB Jun 24 2011 12:32:38 AM

Download Report

Transcript interact2-2001.ppt 3245KB Jun 24 2011 12:32:38 AM

Multimodal Human Computer Interaction

New Interaction Techniques 22.1.2001

Roope Raisamo ([email protected])

Department of Computer and Information Sciences University of Tampere, Finland

Multimodal human-computer interaction

A definition [Raisamo, 1999e, p. 2]: ”Multimodal interfaces combine many simultaneous input modalities and may present the information using synergistic representation of many different output modalities”

Multimodal interaction techniques

Our definition of an interaction technique [Raisamo, 2000]: • An interaction technique is a way to carry out an interactive task. It is defined in the binding, sequencing, and functional levels, and is based on using a set of input and output devices or technologies .

– In a multimodal interaction technique there are more than one inputs or outputs used for the same task.

Two views

• A Human-Centered View – common in psychology – often considers human input channels, i.e., computer output modalities, and most often vision and hearing – applications: a talking head, audio-visual speech recognition, ...

• A System-Centered View – common in computer science – a way to make computer systems more adaptable

Multimodal human computer interaction

Computer input modalities Computer ”cognition” Computer output media Human output channels Human Cognition Interaction information flow Intrinsic perception/action loop Human input channels

Senses and modalities

Sensory perception Sense of sight Sense of hearing Sense of touch Sense of smell Sense of taste Sense of balance Sense organ Eyes Ears Skin Nose Tongue Organ of equilibrium Modality Visual Auditive Tactile Olfactory Gustatory Vestibular [Silbernagel, 1979]

Design space for multimodal user interfaces

Combined Independent Sequential Use of modalities ALTERNATE Parallel SYNERGISTIC EXCLUSIVE Meaning No Meaning CONCURRENT Meaning Levels of abstraction No Meaning

[Nigay and Coutaz, 1993]

An architecture for multimodal user interfaces

Input processing

motor speech vision …

Media analysis

language recognition gesture … Adapted from [Maybury and Wahlster, 1998]

Output generation

graphics animation speech sound …

Media design

language modality gesture …

Interaction management

media fusion discourse modeling plan recognition and generation user modeling presentation design

Modeling

[Nigay and Coutaz, 1993]

Put – That – There

[Bolt, 1980]

Potential benefits

A list by Maybury and Wahlster [1998, p. 15]: – Efficiency – Redundancy – Perceptability – Naturalness – Accuracy – Synergy – Mutual disambiguation of recognition errors [Oviatt, 1999a]

Common misconceptions

A list by Oviatt [1999b]: 1. If you build a multimodal system, user will interact multimodally.

2. Speech and pointing is the dominant multimodal integration pattern.

3. Multimodal input involves simultaneous signals.

4. Speech is the primary input mode in any multimodal system that includes it.

5. Multimodal language does not differ linguistically from unimodal language.

Common misconceptions

6. Multimodal integration involves redundancy of content between modes.

7. Individual error-prone recognition technologies combine multimodally to produce even greater unreliability.

8.

All users’ multimodal commands are integrated in a uniform way.

9. Different input modes are capable of transmitting comparable content.

10. Enhanced efficiency is the main advantage of multimodal systems.

Two paradigms for multimodal user interfaces

1. Computer as a tool – multiple input modalities are used to enhance direct manipulation behavior of the system – the machine is a passive tool and tries to understand the user through all different input modalities that the system recognizes – the user is always responsible for initiating the operations – follows the principles of direct manipulation [Shneiderman, 1982; 1983]

Two paradigms for multimodal user interfaces

2. Computer as a dialogue partner – the multiple modalities are used to increase the anthropomorphism in the user interface – multimodal output is important: talking heads and other human-like modalities – speech recognition is a common input modality in these systems – can often be described as an agent-based conversational user interface

Two hypotheses on combining modalities

1. The combination of human output channels effectively increases the bandwidth of the human  machine channel.  This has been discovered in many empirical studies of multimodal human-computer interaction [Oviatt, 1999b].

Two hypotheses on combining modalities

2. Adding extra output modality requires more neurocomputational resources and will lead to deteriorated output quality, resulting in reduced effective bandwidth.  Two types of effects are usually observed:  a slow-down of all output processes, and  interference errors due to the fact that selective attention cannot be divided between the increased number of output channels.

 Two examples of this: writing when speaking, and speaking when driving a car.

Call for research

A summary in [Raisamo, 1999e] pointed out that more research is needed to understand the following: – How the brain works and which modalities can best be used to gain the synergy advantages that are possible with multimodal interaction?

– When a multimodal system is preferred to a unimodal system?

– Which modalities make up the best combination for a given interaction task?

– Which interaction devices to assign to these modalities in a given computing system?

– How to use these interaction devices, that is, which interaction techniques to select or develop for a given task?

Touch’n’Speak

[Raisamo, 1998]

• Touch’n’Speak is a multimodal user interface framework that makes use of combined touch and speech input and different output modalities – Input: touch buttons, touch lists, touch gestures in area selection (time, location, pressure), speech commands – Output: graphical, textual, and auditory (non-speech) output, speech feedback • The framework was used to implement a restaurant information system that provides information on restaurants in Cambridge, MA, USA.

A snapshot of Touch’n’Speak

Examples

• CHI2000 Video Proceedings: The Efficiency of Multimodal Interaction for a Map-Based Task (8:18) • SIGGRAPH Video Review 76, CHI’92 Technical Video Program: Multi-Modal Natural Dialogue (10:25) • SIGGRAPH Video Review 77, CHI’92 Technical Video Program: Combining Gestures and Direct Manipulation (9:56) • CHI’99 Video Proceedings: Embodiment in Conversational Interfaces: Rea (2:08)

Homework

• Read Chapter 2 (Multimodal interaction) in [Raisamo, 1999e].

– [Raisamo, 1999e] is available online at http://granum.uta.fi/pdf/951-44-4702-6.pdf

– A printable version is available online at http://www.cs.uta.fi/~rr/interact/dissertation.pdf