没有幻灯片标题 - Idiap Research Institute

Download Report

Transcript 没有幻灯片标题 - Idiap Research Institute

Detecting Group Interest-level
in Meetings
Daniel Gatica-Perez, Iain McCowan,
Dong Zhang, and Samy Bengio
IDIAP Research Institute, Martigny, Switzerland
1
Outline





The Goal
Our approach
Meeting Corpus
Audio-Visual Features
Experiments




Performances Measures
Feature Selection
Results
Conclusions
2
The Goal


Extract relevant segments in meetings
Relevant segments are defined based on
group interest-level (degree of engagement in
participants’ interactions)
3
Our Approach
Low-level AV Features
Statistical Models
Person 1 AV Features
Microphones
Early Integration
Person 2 AV Features
Cameras
Multi-stream HMM
Person N AV Features
4
Our Approach


Early integration HMM: concatenate audio
and visual features to form the observation
vector
Multi-stream HMM: audio and visual
streams are trained independently; outputs
are merged at the state level during decoding
5
Meeting Corpus (mmm.idiap.ch)




50 meetings: 30 for training, 20 for testing
Each meeting: 5 minutes, 4 participants
Recorded based on topic and action scripts
Behavior and emotion of participants are
natural
6
Annotating Group-Interest Level

Interval coding scheme




(a) discrete scale: 1-5
(b) 15-second interval unit
(c) 2 independent annotators
Post-processing



(a) normalization (for annotator bias)
(b) analysis of inter-annotator agreement
(c) average of the two annotators
7
Annotating Group-Interest Level
Time
5
HIGH
3
NEUTRAL
1
NEUTRAL
4
HIGH
8
Audio-Visual Features
Modality
Description
head orientation from skin color blobs
right hand orientation from skin color blobs
Visual
right hand eccentricity from skin color blobs
head and hand motion from skin color blobs
SRP-PHAT from microphone array
speech relative pitch from lapels
Audio
speech energy from lapels
speech rate from lapels
9
Performance Measures
- Nc: high-level frames correctly detected
- Nf : high-level frames falsely accepted
- Nd: high-level frames falsely rejected
precision = Nc / (Nc + Nf)
recall = Nc / (Nc + Nd)
- Expected Performance Curve (EPC):
ep = alpha*precision + (1-alpha)*recall
10
Feature Selection



Selected AV features: (3 audio + 2 visual features)
Audio: speech energy, speaking rate, speech pitch
Visual: person motion, head angle
11
Results (Single-modal vs. Multimodel)
12
Results (Single-stream vs. Multi-stream)
13
Overall Results
alpha = 0
alpha = 0.5
alpha = 1
pr
rc
pr
rc
pr
rc
Audio-only
0.54
0.85
0.58
0.80
0.70
0.34
Audio-only
(Feature fusion)
0.54
0.85
0.60
0.77
0.73
0.42
MS-HMM
0.63
0.85
0.63
0.84
0.67
0.54
MS-HMM
(Feature fusion)
0.59
0.84
0.77
0.60
0.75
0.55
Method
14
Conclusions




Audio modality is dominant
Modality combination improves performance
in some regions
Multi-stream better than early integration
Feature fusion at the group level is beneficial
15