BLaRK - RUhosting

Download Report

Transcript BLaRK - RUhosting

Introduction to BLaRKs
Helmer Strik
Dept. of Linguistics
Centre for Language and Speech Technology (CLST)
Radboud University Nijmegen, the Netherlands
Radboud University Nijmegen
Introduction: Background
BLaRK: Basic Language Resources Kit
NTU: define the BLaRK for Dutch
(more details in next presentation)
How to define the
Basic Language Resources for a language
for a given context?
Basic Language Resources for Dutch, in general
Basic Language Resources for Dutch, handicapped
Basic Language Resources for SA
Also for many other languages
Cape Town, 24-11-2008
2
Radboud University Nijmegen
BLaRK: Basic Language Resources Kit
Components:
Data: sets of language data and descriptions in
machine readable form
Modules (or semi-products): the basic software
components of HLT applications
Applications: classes of applications rather than
specific applications or products
2 matrices:
1. Modules x Data
2. Applications x Modules
 BLaRK
Cape Town, 24-11-2008
3
Radboud University Nijmegen
Speech
Technology
doc prod
++
+
+
++
++
+
+
+
+
++
++
+
++
++
+
++
++
+
+
+
++
++
++
++
++
++
+
++
++
+
++
++
++
++
++
++
+
++
++
+
++
++
++
++
++
++
++
++
++
++
++
++
++
+
++
+
+
++
+
++
++
+
++
+
++
++
+
+
+
++
+
+
+
+
+
++
++
++
+
+
+
+
+
+
+
++
++
++
+
+
+
++
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
translation
++
info
access
speech
input
access
control
CALL
media cor
multi
mod corp
corp
multi
multi ling
corp
speech
corp
unanno
corp
anno corp
thesauri
multilin
lex
dialog
systems
Technology
Applications
speech
output
Language
Modules
Applications
monoling
lex
Modules
Data
Data
Language Technology
Grapheme-phoneme
conv.
Token detection
Sent boundary detection
Name recognition
Spelling correction
Lemmatising
Morphological analysis
Morphological synthesis
Word sort disambig.
Parsers and grammars
Shallow parsing
Constituent recognition
Semantic analysis
Referent resolution
Word meaning disambig.
Pragmatic analysis
Text generation
Lang. dep. translation
++
++
++
+
+
+
+
++
++
++
++
++
+
+
+
+
++
++
++
+
++
++
++
++
++
++
++
++
++
++
++
++
++
++
++
+
++
++
++
++
++
++
++
++
+
+
+
++
++
+
++
++
+
++
++
+
+
++
++
++
++
++
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
++
++
+
+
++
++
++
++
++
++
++
++
+
++
++
+
++
++
++
++
+
++
++
++
+
++
++
++
++
++
++
++
+
++
Quantify:
0, 1, or 2
(+’s)
Field survey
Speech Technology
Complete speech recog.
Acoustic models
Language models
Pronunciation lexicon
Robust speech
recognition
Non-native speech recog.
Speaker adaptation
Lexicon adaptation
Prosody recognition
Complete speech synth.
Allophone synthesis
Di-phone synthesis
Unit selection
Prosody prediction for
Text-to-Speech
Aut. phon. transcription
Aut. phon. segmentation
Phoneme alignment
Distance calc. phonemes
Speaker identification
Speaker verification
Speaker tracking
Language identification
Dialect identification
Confidence measures
Utterance verification
Cape Town, 24-11-2008
++
++
+
++
+
+
+
+
+
++
+
++
+
++
++
++
++
++
++
+
+
+
+
+
+
+
+
+
++
++
+
+
+
+
+
+
+
+
+
+
++
++
++
++
++
+
+
+
+
+
++
+
+
+
+
+
+
+
+
+
++
++
++
+
+
+
+
+
+
+
+
+
+
+
+
++
++
+
+
+
+
++
++
+
++
+
+
+
+
+
+
++
+
+
+
+
++
+
+
+
++
++
++
++
++
+
++
+
+
+
+
++
++
++
++
++
++
++
++
++
+
+
+
+
+
++
+
+
+
+
+
+
+
+
+
+
+
+
+
++
+
+
++
+
++
++
+
+
+
+
++
+
+
+
+
+
++
++
++
++
++
++
++
++
++
++
++
++
++
++
+
+
+
+
+
+
+
+
+
+
++
++
+
+
+
+
+
++
++
+
+
+
+
++
+
++
++
++
++
+
+
+
+
+
++
+
+
+
+
+
++
++
++
+
+
++
+
4
+
++
+
+
+
+
+
+
++
+
++
++
++
+
+
+
+
+
+
+
+
+
++
++
+
+
++
++
++
++
++
+
+
++
++
++
++
++
++
+
+
+
+
+
+
+
+
+
++
++
+
+
+
+
+
+
+
+
&
Expert
opinions
+
Radboud University
Nijmegen
+
BLaRK
Language technology
• Modules
Robust modular text preprocessing
Morphological analysis and morphosyntactic disambiguation
Robust syntactic analysis
Aspects of semantic analysis (word meaning and reference)
• Data
Monolingual lexicon
Annotated corpus of written Dutch
Benchmarks for evaluation
Cape Town, 24-11-2008
5
Radboud University Nijmegen
BLaRK
Speech technology
• Modules
Automatic speech recognition
Speech synthesis system
Tools for annotation of speech corpora
Confidence measures and utterance verification
Identification (speaker, language, dialect)
• Data
Monolingual speech corpora for specific applications
Multilingual speech corpora
Multimodal/medial speech corpora
Benchmarks for evaluation
Cape Town, 24-11-2008
6
Radboud University Nijmegen
From BLaRK to priority lists
1. BLaRK: Basic Language Resources Kit
2. Inventory & Evaluation
3. Priority lists
BLaRK
inventory
priority
Cape Town, 24-11-2008
7
Radboud University Nijmegen
2. Inventory & Evaluation
Inventory:
Which components in BLaRK are available?
Bought
Freely obtainable
Reusable
Of sufficient quality
Evaluation:
And of sufficient quality?
Checklist approach or formal evaluation
Cape Town, 24-11-2008
8
Radboud University Nijmegen
Modules
Data
Cape Town, 24-11-2008
Modules
Grapheme-phoneme conversion
Token detection
Sentence boundary detection
Name recognition
Spelling correction
Lemmatising
Morphological analysis
Morphological synthesis
Word sort disambiguation
Parsers and grammars
Shallow parsing
Constituent recognition
Semantic analysis
Referent resolution
Word meaning disambiguation
Pragmatic analysis
Text generation
Language dependent translation
Complete speech recognition
Acoustic models
Language models
Pronunciation lexicon
Robust speech recognition
Non-native speech recognition
Speaker adaptation
Lexicon adaptation
Prosody recognition
Complete speech synthesis
Allophone synthesis
Di-phone synthesis
Unit selection
Prosody prediction for Text-to-Speech
Autom. phonetic transcription
Autom. phonetic segmentation
Phoneme alignment
Distance calculation of phonemes
Speaker identification
Speaker verification
Speaker tracking
Language identification
Dialect identification
Confidence measures
Utterance verification
Data
Unannotated corpora
Annotated corpora
Speech corpora
Multi lingual corpora
Multi modal corpora
Multi media corpora
Test corpora
Monolingual lexicons
Multilingual lexicons
Thesaurus
Availability
8
9
3
4
3
9
Availability
7
9
7
3
2
5
3
2
2
1
3
3
4
8
3
5
2
2
2
2
2
6
7
6
1
3
3
5
8
8
2
2
2
2
2
2
2
9
9
5
4
3
1
1
1
8
6
4
Quantify:
1-10
Field survey
&
Expert
opinions
Radboud University Nijmegen
3. Priority lists
The prioritisation was based
on the following requirements:
The components should currently be unavailable,
inaccessible, or of insufficient quality.
The components should be relevant for a large number
of applications.
Developing the components should be possible in the
short term.
Cape Town, 24-11-2008
10
Radboud University Nijmegen
Consensus, broad support
Report version 1
Feedback
Academia & industry
Sent to the Dutch-Flemish HLT field (1000 sites)
Workshop 15/11/2001
 Report version 2, final version
Cape Town, 24-11-2008
11
Radboud University Nijmegen
From BLaRK to priority lists
1. BLaRK
2. Inventory & Eval.
3. Priority lists
Report 1
Feedback:
•HLT Field
•Workshop
1. BLaRK
2. Inventory & Eval.
3. Priority lists
Report 2
Cape Town, 24-11-2008
12
Radboud University Nijmegen
Introduction: Background
BLaRK: Basic Language Resources Kit
How to define the
Basic Language Resources for a language
for a given context?
Basic Language Resources for Dutch, in general
Basic Language Resources for Dutch, handicapped
Basic Language Resources for SA
Also for many other languages
http://lands.let.kun.nl/~strik/BLaRK.html
Cape Town, 24-11-2008
13
Radboud University Nijmegen
Questions?
Cape Town, 24-11-2008
14
Radboud University Nijmegen