Radboud University Nijmegen

Download Report

Transcript Radboud University Nijmegen

TQE: Transcription Quality Evaluation
A CLARIN-NL project
Radboud University Nijmegen
Institute for Dutch Lexicology
Max Planck Institute for Psycholinguistics
TQE: practical information
• Duration: 01/04/2010 – 01/07/2011
• Type: Demonstrator Project
• Project team:
o
o
o
CLST: Centre for Language and Speech Technology
Helmer Strik (coord.), Joost van Doremalen, Eric Sanders,
Catia Cucchiarini, Robin Oostrum, Ferdy Hubers
INL: Instituut voor Nederlandse Lexicologie
Remco van Veenendaal, Laura van Eerten
MPI: Max Planck Institute for Psycholinguistics
Daan Broeder, Tobias van Valkenhoef, Peter Withers
• CLARIN centre
o
MPI: Max Planck Institute for Psycholinguistics
Daan Broeder
Automatic Transcription
Quality Evaluation
• Input:
o
o
Audio signals
Phone(tic) transcriptions
• Output:
o
For each phone: TQE measure
• How:
o
o
o
Audio and phonetic transcriptions are aligned
Phone boundaries are derived
For each phone a TQE measure is determined,
a confidence measure, e.g. ranging from 0-100%
indicating how well phone & segment ‘fit together’,
i.e. what the quality of the transcription is
MPI version
CLST development version
Survey: 2a) De bestandsformaten
Antwoord
WAV
OGG
AIFF
MP3
MP4
FLAC
ALAW
ULAW
anders
Telling
30
6
13
16
5
5
4
3
4
%
34,88
6,98
15,12
18,60
5,81
5,81
4,65
3,49
4,65
35
30
25
20
15
10
5
0
WAV
OGG
AIFF
MP3
MP4
FLAC
ALAW
ULAW
anders
Survey: 2c) De opnameprecisie
30
Antwoord
8 bit
12 bit
16 bit
24 bit
Telling
%
7,89
3
7,89
3
63,16
24
8
21,05
25
20
15
10
5
0
8 bit
12 bit
16 bit
24 bit
Survey: 3) De formaten en standaarden voor fonetische
transcripties
Antwoord
SAMPA
X-SAMPA
IPA
Telling
23
6
25
%
28,40
7,41
30,86
CGN-set
YAPA
Celex
LH+
anders
9
3
7
3
5
11,11
3,70
8,64
3,70
6,17
30
25
20
15
10
5
0
SAMPA
X-SAMPA
IPA
CGNfoneemset
YAPA
Celex
LH+
anders
Survey: 4) De software
35
Antwoord Telling
%
Praat
53,33
32
Audacity
16,67
10
CoolEdit
11,67
7
Audition
8,33
5
anders
6
10,00
30
25
20
15
10
5
0
Praat
Audacity
CoolEdit
Audition
anders
Survey: 8) Interesse in opname CLARIN-infrastructuur
25
Antwoord
Ja
Nee
Weet niet
Telling
22
11
1
%
64,71
32,35
2,94
20
15
10
5
0
Ja
Nee
Weet niet
Survey: 9) Bereid tot meeleveren metadata
Antwoord Telling
Ja
27
Nee
3
%
90
10
30
25
20
15
10
5
0
Ja
Nee
Survey: 10) Huidig gebruik van metadataforma(a)t(en)
Antwoord Telling %
OLAC
1
2,38
IMDI
4
9,52
CMDI
5
11,9
Dublin
Core
4
9,52
TEI
4
9,52
Geen
21
50
Anders
3
7,14
25
20
15
10
5
0
OLAC
IMDI
CMDI
Dublin Core
TEI
Geen
Anders
Epilogue
• More information:
http://lands.let.ru.nl/~strik/research/TQE/
• Questions ?