presentation.ppt

Download Report

Transcript presentation.ppt

The Games Corpus
Design, implementation and annotation
Agustín Gravano
[email protected]
Spoken Language Processing Group
Columbia University
The Games Corpus
1.
2.
Design and Implementation
Annotation
"The Games Corpus" - Agustín Gravano - Columbia University
2
The Games Corpus
1.
2.
Design and Implementation
Annotation
"The Games Corpus" - Agustín Gravano - Columbia University
3
Experiment Design

Goal: Study the relation between the down-stepped
contour and





Information status
Syntactic position
Discourse position
Spontaneous speech
Both monologue and dialogue
"The Games Corpus" - Agustín Gravano - Columbia University
4
Experiment Design




Three computer games.
Two players, each on a different computer.
They collaborate to perform a common task.
Totally unrestricted speech.
"The Games Corpus" - Agustín Gravano - Columbia University
5
Cards Game #1
Player 1 (Describer) 
Player 2 (Searcher) 
• Short monologues
• Vary frequency and order of
occurrence of objects on the cards.
"The Games Corpus" - Agustín Gravano - Columbia University
6
Cards Game #2
Player 1 (Describer) 
Player 2 (Searcher) 
• Dialogue
• Vary frequency and order of
occurrence of objects on the cards.
"The Games Corpus" - Agustín Gravano - Columbia University
7
Objects Game
Player 1 (Describer) 
Player 2 (Searcher) 
• Dialogue
• Vary target and surrounding objects
(subject and object position).
"The Games Corpus" - Agustín Gravano - Columbia University
8
Games Session

Repeat 3 times:




Short break (optional)
Repeat 3 times:



Cards Game #1
Cards Game #2
Objects Game
Each subject participated in 2 sessions.
12 sessions
"The Games Corpus" - Agustín Gravano - Columbia University
9
Subjects

Postings:



Columbia’s webpage for temporary job adds.
Craig’s list
 http://www.craigslist.org
 Category: Gigs  Event gigs
Problem:


People are unreliable
~50% did not show up, or cancelled with short notice.
"The Games Corpus" - Agustín Gravano - Columbia University
10
Subjects

Possible solutions:

Give precise instructions to e-mail ALL required info:

Name, native speaker?, hearing impairments?, etc.
Ask for a phone number.



Call them and explain why it is so important for us that they
show up (or cancel with adecuate notice).
Increase the pay after each session.
 Example: $5, $10, $15 instead of $10, $10, $10.
"The Games Corpus" - Agustín Gravano - Columbia University
11
Recording

Sound-proof booth




2 subjects + 1 or 2 confederates.
Head-mounted mics.
Digital Audio Tape (DAT): one channel per speaker.
Wav files




One mono file per speaker.
Sample rate: 48000
Downsampled to 16000 (but kept original files!)
~20 hours of speech  2.8 GB (16k)
"The Games Corpus" - Agustín Gravano - Columbia University
12
Logs


Log everything the subjects do to a text file.
Example:
17:03:55:234
17:04:04:868
17:04:31:837
17:04:38:426
17:05:03:873
...

BEGIN_EXECUTION
NEXT_TURN
RESULTS
97 points awarded.
NEXT_TURN
RESULTS
92 points awarded.
Later, this may be used (e.g.) to divide each session
into smaller tasks or conversations.
"The Games Corpus" - Agustín Gravano - Columbia University
13
The Games Corpus
1.
2.
Design and Implementation
Annotation
"The Games Corpus" - Agustín Gravano - Columbia University
14
Speech Processing Tools

Praat


WaveSurfer


http://www.praat.org
http://www.speech.kth.se/wavesurfer
Transcriber

http://trans.sourceforge.net
"The Games Corpus" - Agustín Gravano - Columbia University
15
Orthographic Tier - Method 1
"The Games Corpus" - Agustín Gravano - Columbia University
16
Orthographic Tier - Method 1

Problems



Very stressing
Time consuming
Separate transcription from alignment.
"The Games Corpus" - Agustín Gravano - Columbia University
17
Orthographic Tier - Method 2
1.
Transcribe chunks using a web interface.
"The Games Corpus" - Agustín Gravano - Columbia University
18
Orthographic Tier - Method 2
1.
2.
3.
4.
Transcribe chunks using a web interface.
Align each chunk automatically.
Concatenate all chunks.
Correct the alignment by hand using Praat,
Wavesurfer or similar.
"The Games Corpus" - Agustín Gravano - Columbia University
19
Orthographic Tier - Method 2

Advantages



Transcription task is very comfortable.
Most of the alignment task is done automatically.
Only fine-grain hand corrections are needed.
Problems


Overhead: chunking, automatic alignment, concat.
Error prone! Easy for humans to overlook errors in the
automatic alignment.
"The Games Corpus" - Agustín Gravano - Columbia University
20
Orthographic Tier - Method 3
Transcribe the whole file, using:
1.


a regular audio player (e.g., Windows Media Player), and
a regular plain-text editor (e.g., Notepad).
Use Wavesurfer to align the words.
2.


“Load text labels” function
Check out:

Spectrogram settings

Customizable shortcuts
"The Games Corpus" - Agustín Gravano - Columbia University
21
Orthographic Tier

Transcription guidelines





Alignment guidelines


capital letters
abbreviations
disfluencies
mmhm, uhhuh, gotcha, etc.
boundaries
http://www.cs.columbia.edu/~agus/games

username/password = speech/lions
"The Games Corpus" - Agustín Gravano - Columbia University
22
Too many cooks…

Concurrency problem

File locking webpage

Annotators lock a file before working on it,
and release it when done.
"The Games Corpus" - Agustín Gravano - Columbia University
23
Annotation: Cue Words



okay, mmhm, uhhuh, right, etc.
Acknowledgment, Backchannel, Segment
Beginning, Segment End, etc.
Developed an ad-hoc application in Java.


Bad idea!!! Too long development time.
Instead, use Praat (or other general-purpose tool).



For simple, specific tasks, Praat is not difficult to learn.
Create a file with empty points at the middle point of the
words that need to be labeled.
Annotators only label those words, safely ignoring the rest.
"The Games Corpus" - Agustín Gravano - Columbia University
24
Other Annotations

Turn switches



Prosody


Smooth switches, interruptions, backchannels, etc.
The labeler received a Praat file with empty turns.
ToBI Labeling Conventions: Tones and Break Indices.
Questions

Identification, form and function.
"The Games Corpus" - Agustín Gravano - Columbia University
25
Guidelines for Guidelines



Web based (password protected)
Highlight recent changes
Avoid long lists: categorize, trees.
"The Games Corpus" - Agustín Gravano - Columbia University
26
Files

games/data/session_NN/sNN.GAME.P.Y.ext





NN = 01..12
GAME = {cards, objects}
P = 0..3 if GAME=cards, 0..1 if GAME=objects
Y = {A, B}
ext = {wav, words, tones, breaks, misc, turns, …}
"The Games Corpus" - Agustín Gravano - Columbia University
27
Files

Examples:
games/data/session_08/s08.cards.3.B.wav
s08.cards.3.B.words
s08.cards.3.B.misc
…
s08.objects.1.A.wav
s08.objects.1.A.words
s08.objects.1.A.misc
…
games/data/session_11/…
"The Games Corpus" - Agustín Gravano - Columbia University
28
Files Format

All files (except *.wav) are saved as plain text, with
the WaveSurfer format:



(for interval tiers)
(for point tiers)
Advantages



Start End Value
Time Value
Human-readable.
Very easy to process.
Problems


Consistency
Rounding
"The Games Corpus" - Agustín Gravano - Columbia University
29
Files Format

Other formats:

XML
 General-purpose mark-up language.
 <TAG attribute=“value”> … </TAG>
 Solves problems like consistency and rounding.
 Not human-readable, harder to process.

Praat
 Not human-readable, hard to process.
 Also has the consistency problem.
"The Games Corpus" - Agustín Gravano - Columbia University
30
Scripts


So far, we have needed dozens of Perl scripts.
Examples:






Convert between Praat and WaveSurfer formats.
Create a Praat file with empty CW labels, turns, etc.
Find typos, missing labels, and other errors.
Unify notation (e.g., “mm-hmm”  “mmhm”).
Check consistency of files.
…
"The Games Corpus" - Agustín Gravano - Columbia University
31
Back-up!

Back-up wav files only once (too heavy) in different
places (DVD, 3+ computers).

Back-up everything else (plain text: light)
periodically, and automatically.

Configure “cron” to make a backup copy every 8 hours.
"The Games Corpus" - Agustín Gravano - Columbia University
32
Timeline
time
design+implem.
orthographic tier
prosody (ToBI)
cue words
turn switches

Orthographic tier first!
"The Games Corpus" - Agustín Gravano - Columbia University
33
The Games Corpus
Design, implementation and annotation
Agustín Gravano
[email protected]
Spoken Language Processing Group
Columbia University