Transcript Document
Bridging the Gap between Linguists & Technology Developers:
Large-Scale, Sociolinguistic Annotation for Dialect and
Speaker Recognition*
Christopher Cieri1, Stephanie Strassel1, Meghan Glenn1,
Reva Schwartz2, Wade Shen3, Joseph Campbell3
1. Linguistic Data Consortium
2. United States Secret Service 3. MIT Lincoln Laboratory
3600 Market Street, Suite 810
Washington, DC
244 Wood Street
Philadelphia, PA 19104
[email protected]
Lexington, MA 02421
{ccieri, strassel, mlglenn}@ldc.upenn.edu
{swade, jpc}@ll.mit.edu
* This work is sponsored by the Department of Homeland Security under Air Force Contract FA8721-05-C-0002. Opinions, interpretations,
conclusions and recommendations are those of the authors and are not necessarily endorsed by the United States Government
LREC 2008, May 26 – June 1, Marrakesh
Introduction to Phanotics
Increased interest in speaker recognition community in high-level features that
abstract from the acoustic signal.
lexical choice, presence of idiomatic expressions, syntactic structures
Forensic applications require robustness to channel differences
channel adaptation and the
identification of features inherently robust to channel difference
Language Recognition community increasingly mutually intelligible dialects,
not just languages
Decades of research in dialectology suggest that high-level features can
enable systems to cluster speakers according to the dialects they speak.
Phanotics (Phonetic Annotation of Typicality in Conversational Speech) seeks
to
Sponsored by United States Secret Service
MIT Lincoln Laboratory coordinates effort and develops the systems
Linguists from Arizona State and Old Dominion universities consult on dialectal
phenomena
LDC and Appen Pty Ltd o Australia annotate data provided by LDC and
Identify high-level features characteristic of American dialects,
annotate a corpus for these features
use the data to develop dialect recognition systems
use the categorization to create better models for speaker recognition
LREC 2008, May 26 – June 1, Marrakesh
Annotation Approach
Annotating large corpora for many high-level features impractical
without
existing data
annotations
technologies that simplify the annotator’s task
Phanotics uses data orthographically transcribed to serve as a guide
to potential loci for the features sought
orthographic transcripts, pronouncing lexicon, forced-aligner generate
putative, time-aligned, phonetic transcription that
images that the speaker’s utterances were standard.
high-level features of interest described as deviations from standard
pronunciation
loci in which actual pronunciation differs from putative standard are
potential high-level features
Since
complete phonetic transcription cost-prohibitive
automatic phonetic transcription is not adequately accurate
we lack dialect studies for every difference one might encounter
We do not count deviations directly but allow the technologies to guide
human annotators to expected features.
LREC 2008, May 26 – June 1, Marrakesh
Requirements
Requires natural speech from speakers of target dialects
Initial focus on distinguishing African American Vernacular English (AAVE)
from all other dialects of American English (non-AAVE)
plan to investigate other American dialects later
Selected data collected to minimize the effect of observation
recordings of subjects engaged in conversations
Project requires subjects categorized according to the dialect spoken.
Since goal is to establish typicality of features by dialect,
categorization based on something other than features themselves
relied on self-reported metadata
AAVE
native speakers of American English
born and raised in the United States
ethnically African American
Non-AAVE
American English speakers of other ethnicities
Remove subjects from either pool who appear later mis-categorized.
LREC 2008, May 26 – June 1, Marrakesh
Data Selection
Mixer Corpora
CTS, from LDC; supports robust SR development
subjects provided age, sex, occupation, cities born/raised, ethnicity
subjects completed
>=10 six-minute calls
speaking to other subjects whom they typically did not know
about assigned topics
Bilinguals in Arabic, Mandarin, Russian, and Spanish used those languages &
English
7% calls in cross-channel recording room (8+ microphones on one side of call
calls audited for topic and audio quality but not generally transcribed
Although not designed for the current effort includes self-report ethnicity.
Pool contains speakers of multiple American English dialects who
categorized themselves as African American and other ethnicities
126 Mixer calls transcribed by Phanotics project
35 included conversations between two speakers of AAVE
91 include conversations between one AAVE and non-AAVE
LREC 2008, May 26 – June 1, Marrakesh
Data Selection
Fisher Corpus
collected at LDC to support STT development within DARPA EARS
subjects provided age, sex, native language, and the cities where they were born
and raised
subjects completed 1-25 10-minute calls, speaking to other participants, whom they
typically did not know, about assigned topics
calls audited for topic and quality
verbatim, time-aligned orthographic transcripts were produced
lacks crucial information on the ethnicity of the speaker
but some subjects were LDC employees, their family, friends, and colleagues
small number (171) could be assigned to an ethnic category after the fact
StoryCorps® Griot Initiative
funded by Corporation for Public Broadcasting in US
one-year effort to record one-hour interviews of African Americans.
nine recording locations open for up to six weeks each
subjects interview friends and family on topics of their choice
potential users receive instructions on conducting good interviews; trained facilitator
present
participants receive a free copy of their interview; other copies are archived and
distributed
StoryCorps provides Phanotics selected interview in exchange for transcripts
Sociolinguistic Interviews
recorded and contributed by researchers working in the United States
variable quality being reviewed for potential use
LREC 2008, May 26 – June 1, Marrakesh
Transcription
Most audio lacked transcripts; LDC designed spec for this project.
similar to Fisher Quick Transcription specification
emphasizes speed and accuracy.
annotators segment speech at sentence level
sentences further segmented if >8 seconds; >0.5 seconds internal silence
segments overlap; audio containing no speech left un-segmented
standard orthography, case, punctuation (period, question mark, comma)
-- incomplete sentences and restarts; - incomplete words
proper names, acronyms, letter strings capitalized
uttered numbers written as words, not as strings of digits
limited set of standard contractions are used and
non-standard contractions (‘cause for because) written as the full word
obviously mispronounced, idiosyncratic words tagged with ‘+’
no other attempt made to mark dialectal pronunciation
accomplished in annotation phase
limited set of non-lexemes, (um, uh) used in filled pauses
speech errors transcribed as produced
limited time to transcribe diffluencies since these will be rejected
background noises not marked; limited set of markers for speaker noises
transcribers indicate low confidence with double parentheses (()).
LREC 2008, May 26 – June 1, Marrakesh
Feature Annotation
Goal: identify features that distinguish dialect from standard
features described as rules that change standard into non-standard
rules apply variably according to internal and external constraints
lexical identity, morphology of affected word, position within sentence,
phonological environment, functional effect of change (for example
whether it neutralizes a distinction between two words), the age, sex,
socioeconomic class of speakers, dialects they speak
Examples
reduction of consonant clusters in final position
left => lef’, missed => miss)
deletion of r, l, w
car => ca’, palm => pa’m, young ones => young ‘uns
change of the voiced and voiceless interdental fricatives into stops
bother => boda’
Data preparation, customized tools simplify the annotation process
Rules specified as a => b/x_y
a becomes b when preceded by x and followed by y
input+environment, “xay”, constitute search term
input+output a=>b constitute a question to be answered by human
Did the subject say xay or xby?
LREC 2008, May 26 – June 1, Marrakesh
Feature Annotation
SPAAT (Super Phonetic Annotation and Analysis Tool) designed for
rapid annotation and analysis
for each feature, presents list of regions of interest (ROI) where rule may
have applied
since transcript & audio previously forced-aligned, annotator can listen to
the audio with small amount of preceding and following context
Annotator’s job is to decide whether or not the rule has applied.
LREC 2008, May 26 – June 1, Marrakesh
Initial Results
average time to annotate an ROI ranges 15-25
Approach to measuring inter-annotator agreement
distinguishes initial agreement measured at beginning of effort
assess the difficulty of a task
from measures repeated after thorough documentation created,
annotators undergone rigorous training, testing and selection
Initial inter-annotator agreement varies by rule, rule type, annotator
and annotator training
absolute average initial agreement across five annotators, all rules was
74.49% on three-way decision where a feature is annotated as present,
intermediate or absent
converted to two-way decision (feature is present versus intermediate +
absent) initial agreement climbs to 85.54%
Pair wise agreement by chance in three way and two way decisions is,
respectively, 11.1% and 25%
initial two way agreement rates were 83.81% for rules involving
substitutions and 91.95% for rules involving reductions and insertions.
Team now working to increase IAA
expanding training program, documentation to include audio examples
decision: form is standard, non-standard, intermediate, unrelated to rule,
indeterminate, ROI is mistaken
creating a small gold standard
LREC 2008, May 26 – June 1, Marrakesh
Summary
Project connects sociolinguistics and HLT
Seeks to determine typicality of high level features in distinguishing
dialect for forensic purposes
Focuses initially on AAVE; later on other dialects of American English
Uses existing audio from CTS and interviews
Creates transcripts, audio-transcript time-alignments
Combination of these with SPAAT speeds annotation
Initial inter-annotator agreement encouraging
Modifications of spec, training, tool expected to increase IAA
Fisher audio and transcripts already available in LDC’s Catalog
LDC2005S13
LDC2005T19
LDC2004S13
LDC2004T19
Fisher English Training Part 2, Speech
Fisher English Training Part 2, Transcripts
Fisher English Training Speech Part 1 Speech
Fisher English Training Speech Part 1 Transcripts
Mixer audio in queue
Story Corps Griot and Sociolinguistic Interviews under negotiation
To be distributed after use in the program
Mixer Transcripts
Annotations
possibly SPAAT
LREC 2008, May 26 – June 1, Marrakesh