Transcript Slide 1

Recent Activities in
Chinese Corpus Consortium
Thomas Fang Zheng
Center for Speech and Language Technologies
Tsinghua National Laboratory for Information Science and Technology
Tsinghua University
http://www.CCCForum.org
2
Outline
Brief Introduction to CSLT
CCC History
CCC Co-founders and users
CCC Activities
Newly Created CCC Corpora
3
Center for Speech and Language Technologies (CSLT)
 Founded in 1979, named as Speech Laboratory (under Dept. CS), 2nd early
in China
 Joined the State Key Laboratory of Intelligent Technology and Systems (SKLits)
in 1999, renamed as Center for Speech Technology (CST),SKLits ranked A
for all 3 times’ competition all over China
 Join Force to form CSLT in Jan. 2007, faculty members from School of
Information Science and Technology (SIST):






CST, SKLits
Speech Group, Dept. EE
Speech-on-Chip Group, Dept. EE
NLP Group, Dept. CS
Future Information Technology Center, RIIT
Division of CSAI, Tsinghua National Lab for Info Sci. & Tech.
 http://cslt.riit.tsinghua.edu.cn
4
Organization Chat
Assistant Director
Executive
Deputy Deputy - Students
R&D
Deputy -
Advisory Board
Director
Assistants
Financial Engineering
Joint-Institute
Resources & Standard
Lab
NLU Lab
intelligent-Search Lab
Speech-on-Chip
Lab
VPR Lab
ASR Lab
VPR Joint-Lab
5
Administrative
 Advisory Board:





Victor Zue (MIT, IEEE Fellow, NAE member)
B.-H. (Fred) Juang (GeorgiaTech, IEEE Fellow, NAE member)
William Byrne (Cambridge)
Dan Jurafsky (Stanford)
Richard Stern (CMU)
 FANG Ditang (Tsinghua)
 WU Wenhu (Tsinghua)
 LIU Runsheng (Tsinghua)
 Directors:




Director:
Deputy Director:
Deputy Director:
Assistant Director:Dr. WU
Prof. Thomas Fang Zheng
Asso. Prof. XU Mingxing (Students)
Asso. Prof. XIA Yunqing (R&D)
Xiaojun (Administration)
6
Faculty Members
 Speech Processing:
 Associate Professor: LIU Yi
 Associate Professor: XIAO Xi
 Associate Professor: XU Mingxing
 Assistant Professor: LIANG Weiqian
 Assistant Professor: OU Zhijian
 Natural Language Processing
 Associate Professor: SUN Jiasong
 Associate Professor: ZHOU Qiang
 Assistant Professor: WU Xiaojun
 Assistant Professor: XIA Yunqing
Research Focus
CCC History
•
•
•
•
9
d-Ear Technologies
CST, THU
CASS
HCI&MM Lab, THU
• CLSP, JHU
• Spoken Language
Translation Lab, ATR
• CUHK
• COLIPS
10
 The aim of the CCC is to provide corpora for Chinese ASR,
TTS, NLP, perception analysis, phonetics analysis,
linguistic analysis, and other related tasks.
 The purpose of the CCC includes:
 Collecting and integrating existing Chinese speech and linguistic
corpus resources, and continuing creation of new such resources.
 Integrating existing tools for the creation, transcription, and
analysis of Chinese speech and linguistic corpus resources,
improving their usability, and creating new tools.
 Collecting, organizing and introducing the specifications and
standards for Chinese speech and language research and
development.
 Promoting the exchange of Chinese speech and linguistic corpus
resources.
11
 Currently, the board of CCC includes:
 Council Members: William Byrne, Lianhong Cai, P. C. Ching,
Aijun Li, K. T. Lua, Satoshi Nakamura, Zhanjiang Song, Thomas
Fang Zheng
 Council Chair: Dr. Thomas Fang Zheng, Center for Speech and
Language Technologies, Tsinghua University, China
 Vice Council Chair: Dr. Satoshi Nakamura, NiCT/ATR, Japan
 Standing Secretary: LIU Yi, CSLT, Tsinghua University
13
CCC Co-founder(s)&User(s)
14
Current resources
 24 corpora
 For ASR (14):














CHRD (Chinese Hotel Reservation Dialogue);
CADCC (Chinese Annotated Dialogue and Conversation Corpus );
CACSC (Cantonese Accent Chinese Speech Corpus );
CSTSC-Flight Corpus (Chinese Spontaneous Telephone Speech Corpus in the
flight enquiry and reservation domain);
CUCorpora (Cantonese spoken language corpora);
TRSC (500-People Telephone Read Speech Corpus);
TNDC (Telephone Name Dialing Corpus );
CASS Corpus (Chinese Annotated Spontaneous Speech);
WDCS Corpus (Wu-Dialectal Chinese Speech);
BIT-MobileSpeech (Mobile Phone Speech Corpus for Traffic Information Query);
BIT-MobileTalk (Mobile Phone Conversational Speech Corpus for Travel);
BIT-TonalName (Tonally Confusing Name Speech Corpus);
BIT-MonoSyllable (Mandarin Mono-Syllable Corpus);
BIT-TeleSpeech (Telephone Read Speech Corpus)
15
 For VPR (5):
 CCC-VPR3C2005 (CCC 3-Channel Corpus for Voiceprint Recognition);
 CCC-VPR2C2005-1000X (CCC 2-Channel Corpus for Voiceprint Recognition 11 kHz);
 CCC-VPR2C2005-1000/3000/6000 (CCC 2-Channel Corpus for Voiceprint
Recognition – 1,000/3,000/6,000 speakers);
 CCC-VPR27C2006-50 (CCC 27-Channel Corpus for Voiceprint Recognition - 50
speakers);
 CCC-VPR36C2006-100 (CCC 36-Channel Corpus for Voiceprint Recognition 100 speakers);
 For TTS (2):
 ASCCD (Annotated Speech Corpus of Chinese Discourse);
 CoSS-0 (TH Corpus of Speech Synthesis No. 0);
 For acoustic analysis (2):
 SCSC (Syllable Corpus of Standard Chinese);
 WCSC (Word Corpus of Standard Chinese)
 Text corpus (1):
 PSCLT (Singapore Primary School Chinese Language Text)
 More under development
16
 Free resources (4):
 Pinyin Syllable List (Complete list of 416 Chinese syllables);
 Pinyin XIF List (An extended list of pinyin initials and finals);
 Sampa-C Reference (Reference for the Chinese segmental labeling
convention Sampa-C);
 O-COCOSDA 98-02 (O-COCOSDA conference proceedings for 1998
through 2002) -- Provided by Prof. Itahashi on behalf of OCOCOSDA;
 Member resources (3):
 Word List (An automatically generated wordlist of 50,000 Chinese
words with pinyin and count information);
 CCC-VPR3C2005 (CCC 3-Channel Corpus for Voiceprint
Recognition);
 CCC-VPR2C2005-1000 (CCC 2-Channel Corpus for Voiceprint
Recognition - 1000 speakers);
CCC Activities
 Technology specification for automatic
voiceprint recognition (speaker
recognition) (SJ/T 11380-2008) was
issued by former Ministry of Information
Industries (MII) on Mar 10, 2008.
 The Standard includes three parts:
 Terminologies and definitions
 Data exchange format
 Application programming interfaces
 Drafters:
 CSLT, Tsinghua University (CCC co-founder)
 d-Ear Technologies (CCC co-founder)
 China Electronic Standardization Institute
(CESI)
 Coming soon ...
 SC2 on Human Biometric Applications
of TC100 on Security Protection Alarm
Systems of Standardization
Administration of China
(SAC/TC100/SC2)
17
18
Chinese minority language processing:
A special session on Chinese minority language
processing at National Conference on Man-Machine
Speech Communication of China (NCMMSC’2009) to be
held in Xinjiang, Aug. 2009 (http://www.ncmmsc.org)
Chairs: Prof. WANG Kunlun, Prof. Abdukirim
Turghunjan
Invited speeches will cover the follow languages




Uigur (Uyghur);
Mongolian;
Tibetan;
……
Database collections from CASS, XJU, XJNU, ...
Newly Created Corpora
 CCC-VPR2C2006-10000 (CCC 2-Channel Corpus for
Voiceprint Recognition 2006 - 10000 speakers finished
finally
 For Channel Robust Voiceprint Recognition
 Language – standard Chinese
 Read speech
 Telephone and mobile phone channels
 8 kHz, 16-bit, PCM
 10,000 speakers aged 18-23, each reads
 1) a brief introduction (personal information), and
 2) 40 Chinese sentences
19
 Song’s Melody Corpus
 For research on the Query by Humming (QBH) system
 5,000 songs
 Popular Chinese songs
 English songs (such as in theme of Disney)
 Stored in xml format
 Recorded into “numbered musical notation” by hand, and
segmented into sentences
 Transcriptions: pitch, duration, tag of sentence start
20
CCC-TCT (Tsinghua Chinese Treebank)
For research on Chinese text parsing
Texts:
 A balanced collection of journalistic, literary, academic, and
other documents published in 1990s
Annotation:
 Basic information: sentence-splitting, word segmentation, POS
tagging
 Syntactic parse trees: Assign each non-terminal with two tags:
constituent tag and grammatical relation tag
Size:
 1M Chinese words, about 45,000 sentences
CCC-CEB1 (Chinese Event Bank)
For research on Chinese event analysis
Texts:
 Sentences extracted from TCT
 Focusing on the basic event situations of possession
transferring, existence, space and time transferring
Annotation:
 Event verb senses: suitable sense codes defined in four
semantic lexicons, including a situation network, a computeroriented common-sense KB (Hownet), a Chinese thesaurus
dictionary (CiLin, Word Forest) and the Contemporary Chinese
dictionary (Xianhan, Modern Chinese).
 Event role tags: 6 core or peripheral arguments of the special
events evoked by the event verbs
Size:
 about 1M Chinese characters
Thanks!
http://www.CCCForum.org