Transcript Slide 1
Recent Activities in
Chinese Corpus Consortium
Thomas Fang Zheng
Center for Speech and Language Technologies
Tsinghua National Laboratory for Information Science and Technology
Tsinghua University
http://www.CCCForum.org
2
Outline
Brief Introduction to CSLT
CCC History
CCC Co-founders and users
CCC Activities
Newly Created CCC Corpora
3
Center for Speech and Language Technologies (CSLT)
Founded in 1979, named as Speech Laboratory (under Dept. CS), 2nd early
in China
Joined the State Key Laboratory of Intelligent Technology and Systems (SKLits)
in 1999, renamed as Center for Speech Technology (CST),SKLits ranked A
for all 3 times’ competition all over China
Join Force to form CSLT in Jan. 2007, faculty members from School of
Information Science and Technology (SIST):
CST, SKLits
Speech Group, Dept. EE
Speech-on-Chip Group, Dept. EE
NLP Group, Dept. CS
Future Information Technology Center, RIIT
Division of CSAI, Tsinghua National Lab for Info Sci. & Tech.
http://cslt.riit.tsinghua.edu.cn
4
Organization Chat
Assistant Director
Executive
Deputy Deputy - Students
R&D
Deputy -
Advisory Board
Director
Assistants
Financial Engineering
Joint-Institute
Resources & Standard
Lab
NLU Lab
intelligent-Search Lab
Speech-on-Chip
Lab
VPR Lab
ASR Lab
VPR Joint-Lab
5
Administrative
Advisory Board:
Victor Zue (MIT, IEEE Fellow, NAE member)
B.-H. (Fred) Juang (GeorgiaTech, IEEE Fellow, NAE member)
William Byrne (Cambridge)
Dan Jurafsky (Stanford)
Richard Stern (CMU)
FANG Ditang (Tsinghua)
WU Wenhu (Tsinghua)
LIU Runsheng (Tsinghua)
Directors:
Director:
Deputy Director:
Deputy Director:
Assistant Director:Dr. WU
Prof. Thomas Fang Zheng
Asso. Prof. XU Mingxing (Students)
Asso. Prof. XIA Yunqing (R&D)
Xiaojun (Administration)
6
Faculty Members
Speech Processing:
Associate Professor: LIU Yi
Associate Professor: XIAO Xi
Associate Professor: XU Mingxing
Assistant Professor: LIANG Weiqian
Assistant Professor: OU Zhijian
Natural Language Processing
Associate Professor: SUN Jiasong
Associate Professor: ZHOU Qiang
Assistant Professor: WU Xiaojun
Assistant Professor: XIA Yunqing
Research Focus
CCC History
•
•
•
•
9
d-Ear Technologies
CST, THU
CASS
HCI&MM Lab, THU
• CLSP, JHU
• Spoken Language
Translation Lab, ATR
• CUHK
• COLIPS
10
The aim of the CCC is to provide corpora for Chinese ASR,
TTS, NLP, perception analysis, phonetics analysis,
linguistic analysis, and other related tasks.
The purpose of the CCC includes:
Collecting and integrating existing Chinese speech and linguistic
corpus resources, and continuing creation of new such resources.
Integrating existing tools for the creation, transcription, and
analysis of Chinese speech and linguistic corpus resources,
improving their usability, and creating new tools.
Collecting, organizing and introducing the specifications and
standards for Chinese speech and language research and
development.
Promoting the exchange of Chinese speech and linguistic corpus
resources.
11
Currently, the board of CCC includes:
Council Members: William Byrne, Lianhong Cai, P. C. Ching,
Aijun Li, K. T. Lua, Satoshi Nakamura, Zhanjiang Song, Thomas
Fang Zheng
Council Chair: Dr. Thomas Fang Zheng, Center for Speech and
Language Technologies, Tsinghua University, China
Vice Council Chair: Dr. Satoshi Nakamura, NiCT/ATR, Japan
Standing Secretary: LIU Yi, CSLT, Tsinghua University
13
CCC Co-founder(s)&User(s)
14
Current resources
24 corpora
For ASR (14):
CHRD (Chinese Hotel Reservation Dialogue);
CADCC (Chinese Annotated Dialogue and Conversation Corpus );
CACSC (Cantonese Accent Chinese Speech Corpus );
CSTSC-Flight Corpus (Chinese Spontaneous Telephone Speech Corpus in the
flight enquiry and reservation domain);
CUCorpora (Cantonese spoken language corpora);
TRSC (500-People Telephone Read Speech Corpus);
TNDC (Telephone Name Dialing Corpus );
CASS Corpus (Chinese Annotated Spontaneous Speech);
WDCS Corpus (Wu-Dialectal Chinese Speech);
BIT-MobileSpeech (Mobile Phone Speech Corpus for Traffic Information Query);
BIT-MobileTalk (Mobile Phone Conversational Speech Corpus for Travel);
BIT-TonalName (Tonally Confusing Name Speech Corpus);
BIT-MonoSyllable (Mandarin Mono-Syllable Corpus);
BIT-TeleSpeech (Telephone Read Speech Corpus)
15
For VPR (5):
CCC-VPR3C2005 (CCC 3-Channel Corpus for Voiceprint Recognition);
CCC-VPR2C2005-1000X (CCC 2-Channel Corpus for Voiceprint Recognition 11 kHz);
CCC-VPR2C2005-1000/3000/6000 (CCC 2-Channel Corpus for Voiceprint
Recognition – 1,000/3,000/6,000 speakers);
CCC-VPR27C2006-50 (CCC 27-Channel Corpus for Voiceprint Recognition - 50
speakers);
CCC-VPR36C2006-100 (CCC 36-Channel Corpus for Voiceprint Recognition 100 speakers);
For TTS (2):
ASCCD (Annotated Speech Corpus of Chinese Discourse);
CoSS-0 (TH Corpus of Speech Synthesis No. 0);
For acoustic analysis (2):
SCSC (Syllable Corpus of Standard Chinese);
WCSC (Word Corpus of Standard Chinese)
Text corpus (1):
PSCLT (Singapore Primary School Chinese Language Text)
More under development
16
Free resources (4):
Pinyin Syllable List (Complete list of 416 Chinese syllables);
Pinyin XIF List (An extended list of pinyin initials and finals);
Sampa-C Reference (Reference for the Chinese segmental labeling
convention Sampa-C);
O-COCOSDA 98-02 (O-COCOSDA conference proceedings for 1998
through 2002) -- Provided by Prof. Itahashi on behalf of OCOCOSDA;
Member resources (3):
Word List (An automatically generated wordlist of 50,000 Chinese
words with pinyin and count information);
CCC-VPR3C2005 (CCC 3-Channel Corpus for Voiceprint
Recognition);
CCC-VPR2C2005-1000 (CCC 2-Channel Corpus for Voiceprint
Recognition - 1000 speakers);
CCC Activities
Technology specification for automatic
voiceprint recognition (speaker
recognition) (SJ/T 11380-2008) was
issued by former Ministry of Information
Industries (MII) on Mar 10, 2008.
The Standard includes three parts:
Terminologies and definitions
Data exchange format
Application programming interfaces
Drafters:
CSLT, Tsinghua University (CCC co-founder)
d-Ear Technologies (CCC co-founder)
China Electronic Standardization Institute
(CESI)
Coming soon ...
SC2 on Human Biometric Applications
of TC100 on Security Protection Alarm
Systems of Standardization
Administration of China
(SAC/TC100/SC2)
17
18
Chinese minority language processing:
A special session on Chinese minority language
processing at National Conference on Man-Machine
Speech Communication of China (NCMMSC’2009) to be
held in Xinjiang, Aug. 2009 (http://www.ncmmsc.org)
Chairs: Prof. WANG Kunlun, Prof. Abdukirim
Turghunjan
Invited speeches will cover the follow languages
Uigur (Uyghur);
Mongolian;
Tibetan;
……
Database collections from CASS, XJU, XJNU, ...
Newly Created Corpora
CCC-VPR2C2006-10000 (CCC 2-Channel Corpus for
Voiceprint Recognition 2006 - 10000 speakers finished
finally
For Channel Robust Voiceprint Recognition
Language – standard Chinese
Read speech
Telephone and mobile phone channels
8 kHz, 16-bit, PCM
10,000 speakers aged 18-23, each reads
1) a brief introduction (personal information), and
2) 40 Chinese sentences
19
Song’s Melody Corpus
For research on the Query by Humming (QBH) system
5,000 songs
Popular Chinese songs
English songs (such as in theme of Disney)
Stored in xml format
Recorded into “numbered musical notation” by hand, and
segmented into sentences
Transcriptions: pitch, duration, tag of sentence start
20
CCC-TCT (Tsinghua Chinese Treebank)
For research on Chinese text parsing
Texts:
A balanced collection of journalistic, literary, academic, and
other documents published in 1990s
Annotation:
Basic information: sentence-splitting, word segmentation, POS
tagging
Syntactic parse trees: Assign each non-terminal with two tags:
constituent tag and grammatical relation tag
Size:
1M Chinese words, about 45,000 sentences
CCC-CEB1 (Chinese Event Bank)
For research on Chinese event analysis
Texts:
Sentences extracted from TCT
Focusing on the basic event situations of possession
transferring, existence, space and time transferring
Annotation:
Event verb senses: suitable sense codes defined in four
semantic lexicons, including a situation network, a computeroriented common-sense KB (Hownet), a Chinese thesaurus
dictionary (CiLin, Word Forest) and the Contemporary Chinese
dictionary (Xianhan, Modern Chinese).
Event role tags: 6 core or peripheral arguments of the special
events evoked by the event verbs
Size:
about 1M Chinese characters
Thanks!
http://www.CCCForum.org