Are Gestures in Virtual Environments Predictable: Initial

Download Report

Transcript Are Gestures in Virtual Environments Predictable: Initial

Non-verbal behaviors and coding schemes

Andrea Corradini

Southern Denmark University IFKI, Kolding, Denmark [email protected]

Roadmap

• • Overview of non-verbal behaviors – Definition – Relation to speech – Classifications Coding Schemes – Multimodal Corpus – HamNoSys , FORM, others

What do you see when somebody’s

• • • • • • • • • • • • • • • Waving Expressing disappointment Hitch hiking Conducting an orchestra Negotiating at the stock exchange Playing cards Playing an instrument Controlling air or car traffic Praying Dancing Sitting on a chair Typing on a keyboard Mimicking Using sign language ...

What is a gesture?

• • • Human communication includes many modalities: speech, gestures, bodily and facial expressions You think you know what is meant by the word ‘gesture’ Can you make explicit what your knowledge is based on?

Look up even more dictionaries!

Kinesics

• • • Non-verbal behaviors related to movement, either of any part of the body or the body as a whole All communicative body movements are generally classified as kinesics Systematic study of the relationship between communication and non-linguistic body motions (as blushes, shrugs, or eye movement)

Let’s get to a compromise

• • Gestures are intimately connected with human movements and actions – (dynamic) gestures versus static postures Gestures are specialized actions, movements – Work in ways that not all other motions do not, e.g.

communicatively – Operate under psychological & cognitive constraints, not just anatomical – Manual skills is essential for sign language production; the opposite is false thus gestural communication is higher in the motor system hierarchy (Kimura 1993)

Lots of open questions

• • • • • • Is a gesture a voluntary action? Is there any level of control or gestural awareness?

Does it convey information? Any communicative intent?

Is there a set of ’gestemes’ from which one can build gestures? Is there a ’gestobulary’?

Is there any benefits for the gesturer?

Are gestures an integrated and complimentary form of spoken utterances or a spill-over of speech production?

Sign languages can fully represent spoken languages, hence the opposite is true. Did sign or spoken language appear first or did they evolve together?

Gesture classification

• • • The lack of an algorithmic definition for gesture is not a good start Lack of definition  cut categorization lack exhaustive clear Several classifications/dichotomies according to many criteria: Cadoz(1994), Wundt (1973), Efron (1941), Kendon (1988), McNeill (1992), Nespoulos, Perron & Lecours (1984, 1986), Ekman & Friesen (1969), ..

Cadoz’s classification

• • The functionality of hand movements can roughly be split into communication purposes or world manipulation Cadoz (1994) classifies hand movements according to their function and add one class in between – semiotic: motions to communicate meaningful information; relates to a meaning – ergotic: motions to manipulate the physical world; relates to a task – epistemic: motions to explore the world through tactile experience; relate to haptic exploration

Kendon’s classification

• • Kendon (1988) classifies gestures along a (linguistic) continuum: – Gesticulation: idiosyncratic spontaneous hand-arm movements accompanying speech – Language-like gestures: like gesticulation, but grammatically integrated in the utterance – Pantomime: gestures without speech used e.g. to communicate a story – Emblems – Sign language: a set of gestures and postures for a full fledged linguistic communication system Emphasis on the speech/gesture connection: along the continuum the presence of speech declines while the presence of language properties increases

Kendon’s gesture definition

• • • Gestures are all actions that are intentionally produced and overtly communicative There is no context-independent definition since what may be a gesture in one situation may be an incidental movement in another Gesture, hence its definition, depends on how speakers and listeners are treating each other’s flow of actions in any dialogue situation

Obama’s finger

• • Speaking to a crowd on controversies created out of petty slips to pit one against other candidates In news and on blogs Obama gave Clinton the finger in subtle way • "They like stirring up controversy, they like playing gotcha games, getting us to attack each other, and I have to say Senator Clinton, you know, looked in her element. [scratches face with middle finger] You know, she was taking every opportunity to get a dig in there. You know?"

”Fare le corna”

• • ”Make the horns” emblem: amulet-superstition, joke, infidelity, satanism in metal music, etc..

Whom was this gesture for? Context dependent too.

McNeill’s classification

• • • McNeill (1992) interprets Kendon’s ’gesticulations’ in combination with the speech they accompany Imaginistic – Iconic: the imagery underlying the motion depicts form or manner of aspects of the accompaying speech – Metaphoric: the same but for abstract aspects – Cohesive: tie togheter temporally separated part of discourse Non-imaginistic – Beat: motions to stress segment of discourse, speech rythm; biphasic i.e. hand not transported in space to perform it – Deictic: to point in space/time – Deictic cohesive: as imaginistic cohesive but deictic

Classification schemes: summary

• • • • • Put forward by cognitive scientists, psychologists, psycholinguists, biologists and linguists who focused on either characteristics of the movement or insights in gesticulative dysfunctions or reason why specific movements occur, etc.

None provides any rule base in making gestures understandable for or synthesizable by computers Each useful but none universal; some shared categories Categories are not always discrete and not mutually exclusive groups into which gestures can be placed i.e. a gesture might involve more than one category Each taxonomy serves as a tool that suits a fellow’s particular research aims rather than a basic truth on the nature of gesture – This reflects the context-dependent nature of multimodal speech-gesture dialogue

Emblems, Iconic & Pointing

• • • Pointing hand configuration relates how the object being referred to is being used Iconics add details to the mental image that the speaker is trying to convey Emblems are likely among the most conscious and voluntary gestures – socially acquired through culture contact – may accompany words or occur instead of words – if simultaneous with speech, they are paralinguistic features that contribute to add emphasis to word From: Andrea de Jorio (1832) La mimica degli antichi investigata nel gestire napoletano

Gesture elements (Kendon 1980)

• • • The ’stroke’ is the movement’s most meaningful part – the phase leading up to it is the ’preparation’ – the phase that follows it to a rest position is the ’recovery’ – may be followed by a ’post-stroke hold’ (Kita 1993) Preparation, stroke any possible post-stroke hold is called ’gesture phrase’ – gesture phase  (preparation)+ stroke (post-hold stroke)+ Gesture movements are organized as excursions each called ’gesture unit’ – gesture unit  (gesture phase)* recovery

Example I

• • • The speaker before uttering ’throw’ has already started the movement that leads to the stroke They are planned together and integral part of one utterance At stroke there is a convergence of semantic from two modes From: Adam Kendon (2004) Gesture – Visible action as utterance

Example II

From: Adam Kendon (2004) Gesture – Visible action as utterance

Gesture surface representation

• Certain characteristics distinguish it from other kinds of activity such as practical actions, postural adjustments, orientation changes, self-manipulations – Gestures excursions start and end at the same position – Tri-phasic (beats: bi-phasic) – Well bounded, tend to have clear onsets and offsets – Phases of action but can not be composed to create new gestures – Preceed speech (unless for emphasis) – Are symmetric

Kendon’s view

• • • • Gesture is partner to speech Their combination is a function of the resources available to a speaker to convey additional (redundant/complementary) information Speakers accommodate for gesture and speech components in relation to the communication needs Speakers build utterances as a gesture-speech ensemble which is treated as a single unit: when utterance is repeated or corrected, so are both components, hence they are equally important

McNeill’s view

• • • During narrations of animated cartoons some 90% of utterances are accompanied by at least one gesture; iconic and metaphoric gestures do not typically function as fillers in spoken utterances but rather bring in some new information – referred to the ‘essential doubleness’ of language (Saussure 2002) Gestures do not simply embellish speech, both are part of the language generation process and arise from the same computational stage Gesture are an alternate manifestation of the process by which ideas are encoded into patterns of behaviors and as such convey information about the internal mental processes of the speaker

McNeill’s experiments

• • • • Cartoon episode: a cat tries to catch a bird by climbing up inside a drainpipe that terminates where the bird is perched Speaker: ‘and he tries going up through it this time’ Gesture occurs with italics when speaker says ‘up’ (hand rises) and ‘through’ (fingers spread into a shape resembling an interior space) Speech and gesture are co expressive, neither is redundant or supplemental

Gesture features

• Gestures have the following non-linguistic properties (McNeill 1994): – the meaning is determined by its global appearance (i.e. the gesture unit as a whole) and gesture segments only convey meaning when they appear together as one single gesture – are non-combinatoric i.e. do not form larger, hierarchically structured gestures – are context-sensitive, i.e. different gestures may refer to the same entity – no standard form i.e. different speakers display the same meaning in idiosyncratic ways – anticipate speech in their preparation phase and synchronize with it in the stroke phase

Well-known linguistic corpora

Corpus name # Tokens

Brown 1 000 000 Penn Treebank 2 000 000 MapTask BNC 150 000

Notes

POS tagged, balanced Parsed Spoken dialogue, parsed, dialogue acts 100 000 000 POS tagged

Measuring agreement

• • • In order for the statistics extracted from an annotation to be reproducible, coding distinctions must be understandable also to persons who did not create the scheme (Carletta, 1996) Measuring the percentage of agreement does not take chance agreement into account The K statistic (Siegel & Castellan, 1988): • K=0: no agreement • • .6 <= K < .8: tentative agreement .8 <= K <= 1: OK agreement

Corpus analysis

• • • • • • Simple transcription Annotation to analyze specific tokens such as citations, names, etc.

Part-of-speech tagging (e.g., Brown Corpus, BNC) Syntactic structures (e.g., Penn Treebank, WSJ Corpus) Word sense (e.g. SEMCOR) Dialogue acts (e.g. MAPTASK, TRAINS)

Language annotation

• A review is provided by Steven Bird & Mark Liberman (2000) in “A Formal Framework for Linguistic Annotation” – TIMIT (Garofolo et al. 1986) – CHILDES (MacWhinney 1995) – LDC Telephone Speech – NIST UTF (NIST 1998) – Switchboard (Godfrey et al. 1992) – A few more

Annotation tools I

• • Multi-modal corpora require new annotation and analysis Focus shifted on areas that received less attention in textual corpora and involve multiple levels of temporal phenomena From: Bigbee, T., Loehr, D., and Harper, L. (2001) Emerging Requirements for Multi-Modal Annotation and Analysis Tools

Annotation tools II

From: Garg , S. et al. (2004) Evaluation of Transcription and Annotation tools for a Multi-modal, Multi-party dialogue corpus

Multimodal analysis

• • • • Lack of or limited access to multimodal resources that capture visual info and its interaction with other modes Systematically collected and annotated multimodal corpora would facilitate a: – principled understanding of modalities integration – generalized guidance on media allocation – cross-modal reference resolution/generation – anticipation of phenomena and/or patterns in a certain mode – gold standard against which the multimodal systems output could be evaluated – interface design in intelligent multimodal systems development Researchers use their own coding scheme for annotating behaviors and for computing the metrics they are interested in (like temporal relationships between modes) No common standard for encoding multimodal behavior implies that corpora can not be shared

Annotation scheme

• • • • • The term annotation scheme is related to the overall structure of annotations: it is a blue-print of the basic structure of it How faithfully does the encoding reflect the original movement?

In multimodal corpus analysis, there has not been any real standardization initiative Most commonly used: FACS for facial expressions (Ekman and Friesen, 1978) and structural/functional classifications for hand gesture (Efron, 1941; McNeill, 1992) Necessary to reach a generic representation of the encoded information independently from the theoretical approach chosen

Which modalities?

• Multimodal corpora & annotation scheme to formally represent one or more of these modalities: – Speech – Non-speech sounds – Facial expressions – Lip movement – Head movements – Gaze – Hand and arm movements – Leg and foot movements – Limb movements – Body posture – Clothing – Tattooing, Piercing, Make Up, ..

– Touch – Smell – Taste – ...

Annotation requirements

• • • • • • • Fast and reliable annotation View, select, and search tools Extensibility of annotation Support for multimodality Annotation import/export/merging Consistent coding schemes – “..Hands-on: It is highly desirable that the describer of a certain coding scheme has actually tried to use the coding scheme..” (International Standards for Language Engineering report D9.1 on Natural Interactivity and Multimodality) Many others...

McNeill’s scheme

Lots of these

• • Assess where gesture occurs wrt the speech and bracket the speech accompanied by the gesture. Mark the onset of the preparation and end of the retraction phases separately with square brackets [ ], and mark the stroke in bold types Structural transcription is followed by functional description

McNeill’s scheme II

• • • Step 1: gesture type – Representational: iconic, metaphoric – – – Deictic Beats Confidence 1 = low confident, 3 = confident, 5 = totally sure Step 2: form and meaning – If representational or deictic go to Step 3 – If beat: stop here unless Step 3: motion in 3d – hand shape, hand meaning, motion shape, motion meaning – – – – – – handedness (which hand RH/LF, or both hands; if both, are they symmetrical, same shape SH, different shape DH) shape of the hand (use ASL hand forms) palm-finger orientation: P/FTU: palm/finger toward up, TD/C: toward down/center, AB/TB / AC: away from body (outward) / toward body (inward), away from center motion direction (T/A B: toward/away body, P F/S: parallel to front/side of body) motion space motion features: Uni-1: unidirectional, 1 movement, effort exerted in one direction, the other movement returning the hand to rest; BiDir: bidirectional, two movements with effort exerted in both directions, 2 SM: both hands move in same way; 2 DM: each hand moves in its own way)

McNeill’s motion space

HamNoSys

• • • • HamNoSys created to describe different sign languages, none specific Initial configurations: – Hand shape: basic form, position of thumb, inclination – Hand position: pointing direction, orientation of the hand, hands’ description with respect to the person gesturing, wrt the body part at the level of which the hand is positioned, wrt the distance from the body of the person gesturing – Actions i.e. combinations of hand movements, at the same time or immediately after each other: line shape, bow shape, zigzag shape, change of hand position, etc.

Single-handed vs. both handed About 200 symbols (http://www.sign-lang.uni hamburg.de/Projekte/HamNoSys/HamNoSysErklaerungen/en glisch/Contents.html)

FORM

• • • • • Martell, C. (2005) non-semantic, geometrically based annotation scheme to describe the kinematic information in a gesture Multi-track each representing different aspects of gestural form whereby each independent moving body part has 2 tracks – location/shape/orientation – movement Track spans over time period; you place objects at specific positions to mark the state of the gesture at that time Hierarchical approach to mirror anatomic features: groups at the top contains subgroups, tracks are within (sub)groups Tracks contains list of attributes; lowest level has pairs attribute name - attribute value

• That’s it!

Thank you!