Transcript Slide 1
Recent Activities in Chinese Corpus Consortium Thomas Fang Zheng Center for Speech and Language Technologies Tsinghua National Laboratory for Information Science and Technology Tsinghua University http://www.CCCForum.org 2 Outline Brief Introduction to CSLT CCC History CCC Co-founders and users CCC Activities Newly Created CCC Corpora 3 Center for Speech and Language Technologies (CSLT) Founded in 1979, named as Speech Laboratory (under Dept. CS), 2nd early in China Joined the State Key Laboratory of Intelligent Technology and Systems (SKLits) in 1999, renamed as Center for Speech Technology (CST),SKLits ranked A for all 3 times’ competition all over China Join Force to form CSLT in Jan. 2007, faculty members from School of Information Science and Technology (SIST): CST, SKLits Speech Group, Dept. EE Speech-on-Chip Group, Dept. EE NLP Group, Dept. CS Future Information Technology Center, RIIT Division of CSAI, Tsinghua National Lab for Info Sci. & Tech. http://cslt.riit.tsinghua.edu.cn 4 Organization Chat Assistant Director Executive Deputy Deputy - Students R&D Deputy - Advisory Board Director Assistants Financial Engineering Joint-Institute Resources & Standard Lab NLU Lab intelligent-Search Lab Speech-on-Chip Lab VPR Lab ASR Lab VPR Joint-Lab 5 Administrative Advisory Board: Victor Zue (MIT, IEEE Fellow, NAE member) B.-H. (Fred) Juang (GeorgiaTech, IEEE Fellow, NAE member) William Byrne (Cambridge) Dan Jurafsky (Stanford) Richard Stern (CMU) FANG Ditang (Tsinghua) WU Wenhu (Tsinghua) LIU Runsheng (Tsinghua) Directors: Director: Deputy Director: Deputy Director: Assistant Director:Dr. WU Prof. Thomas Fang Zheng Asso. Prof. XU Mingxing (Students) Asso. Prof. XIA Yunqing (R&D) Xiaojun (Administration) 6 Faculty Members Speech Processing: Associate Professor: LIU Yi Associate Professor: XIAO Xi Associate Professor: XU Mingxing Assistant Professor: LIANG Weiqian Assistant Professor: OU Zhijian Natural Language Processing Associate Professor: SUN Jiasong Associate Professor: ZHOU Qiang Assistant Professor: WU Xiaojun Assistant Professor: XIA Yunqing Research Focus CCC History • • • • 9 d-Ear Technologies CST, THU CASS HCI&MM Lab, THU • CLSP, JHU • Spoken Language Translation Lab, ATR • CUHK • COLIPS 10 The aim of the CCC is to provide corpora for Chinese ASR, TTS, NLP, perception analysis, phonetics analysis, linguistic analysis, and other related tasks. The purpose of the CCC includes: Collecting and integrating existing Chinese speech and linguistic corpus resources, and continuing creation of new such resources. Integrating existing tools for the creation, transcription, and analysis of Chinese speech and linguistic corpus resources, improving their usability, and creating new tools. Collecting, organizing and introducing the specifications and standards for Chinese speech and language research and development. Promoting the exchange of Chinese speech and linguistic corpus resources. 11 Currently, the board of CCC includes: Council Members: William Byrne, Lianhong Cai, P. C. Ching, Aijun Li, K. T. Lua, Satoshi Nakamura, Zhanjiang Song, Thomas Fang Zheng Council Chair: Dr. Thomas Fang Zheng, Center for Speech and Language Technologies, Tsinghua University, China Vice Council Chair: Dr. Satoshi Nakamura, NiCT/ATR, Japan Standing Secretary: LIU Yi, CSLT, Tsinghua University 13 CCC Co-founder(s)&User(s) 14 Current resources 24 corpora For ASR (14): CHRD (Chinese Hotel Reservation Dialogue); CADCC (Chinese Annotated Dialogue and Conversation Corpus ); CACSC (Cantonese Accent Chinese Speech Corpus ); CSTSC-Flight Corpus (Chinese Spontaneous Telephone Speech Corpus in the flight enquiry and reservation domain); CUCorpora (Cantonese spoken language corpora); TRSC (500-People Telephone Read Speech Corpus); TNDC (Telephone Name Dialing Corpus ); CASS Corpus (Chinese Annotated Spontaneous Speech); WDCS Corpus (Wu-Dialectal Chinese Speech); BIT-MobileSpeech (Mobile Phone Speech Corpus for Traffic Information Query); BIT-MobileTalk (Mobile Phone Conversational Speech Corpus for Travel); BIT-TonalName (Tonally Confusing Name Speech Corpus); BIT-MonoSyllable (Mandarin Mono-Syllable Corpus); BIT-TeleSpeech (Telephone Read Speech Corpus) 15 For VPR (5): CCC-VPR3C2005 (CCC 3-Channel Corpus for Voiceprint Recognition); CCC-VPR2C2005-1000X (CCC 2-Channel Corpus for Voiceprint Recognition 11 kHz); CCC-VPR2C2005-1000/3000/6000 (CCC 2-Channel Corpus for Voiceprint Recognition – 1,000/3,000/6,000 speakers); CCC-VPR27C2006-50 (CCC 27-Channel Corpus for Voiceprint Recognition - 50 speakers); CCC-VPR36C2006-100 (CCC 36-Channel Corpus for Voiceprint Recognition 100 speakers); For TTS (2): ASCCD (Annotated Speech Corpus of Chinese Discourse); CoSS-0 (TH Corpus of Speech Synthesis No. 0); For acoustic analysis (2): SCSC (Syllable Corpus of Standard Chinese); WCSC (Word Corpus of Standard Chinese) Text corpus (1): PSCLT (Singapore Primary School Chinese Language Text) More under development 16 Free resources (4): Pinyin Syllable List (Complete list of 416 Chinese syllables); Pinyin XIF List (An extended list of pinyin initials and finals); Sampa-C Reference (Reference for the Chinese segmental labeling convention Sampa-C); O-COCOSDA 98-02 (O-COCOSDA conference proceedings for 1998 through 2002) -- Provided by Prof. Itahashi on behalf of OCOCOSDA; Member resources (3): Word List (An automatically generated wordlist of 50,000 Chinese words with pinyin and count information); CCC-VPR3C2005 (CCC 3-Channel Corpus for Voiceprint Recognition); CCC-VPR2C2005-1000 (CCC 2-Channel Corpus for Voiceprint Recognition - 1000 speakers); CCC Activities Technology specification for automatic voiceprint recognition (speaker recognition) (SJ/T 11380-2008) was issued by former Ministry of Information Industries (MII) on Mar 10, 2008. The Standard includes three parts: Terminologies and definitions Data exchange format Application programming interfaces Drafters: CSLT, Tsinghua University (CCC co-founder) d-Ear Technologies (CCC co-founder) China Electronic Standardization Institute (CESI) Coming soon ... SC2 on Human Biometric Applications of TC100 on Security Protection Alarm Systems of Standardization Administration of China (SAC/TC100/SC2) 17 18 Chinese minority language processing: A special session on Chinese minority language processing at National Conference on Man-Machine Speech Communication of China (NCMMSC’2009) to be held in Xinjiang, Aug. 2009 (http://www.ncmmsc.org) Chairs: Prof. WANG Kunlun, Prof. Abdukirim Turghunjan Invited speeches will cover the follow languages Uigur (Uyghur); Mongolian; Tibetan; …… Database collections from CASS, XJU, XJNU, ... Newly Created Corpora CCC-VPR2C2006-10000 (CCC 2-Channel Corpus for Voiceprint Recognition 2006 - 10000 speakers finished finally For Channel Robust Voiceprint Recognition Language – standard Chinese Read speech Telephone and mobile phone channels 8 kHz, 16-bit, PCM 10,000 speakers aged 18-23, each reads 1) a brief introduction (personal information), and 2) 40 Chinese sentences 19 Song’s Melody Corpus For research on the Query by Humming (QBH) system 5,000 songs Popular Chinese songs English songs (such as in theme of Disney) Stored in xml format Recorded into “numbered musical notation” by hand, and segmented into sentences Transcriptions: pitch, duration, tag of sentence start 20 CCC-TCT (Tsinghua Chinese Treebank) For research on Chinese text parsing Texts: A balanced collection of journalistic, literary, academic, and other documents published in 1990s Annotation: Basic information: sentence-splitting, word segmentation, POS tagging Syntactic parse trees: Assign each non-terminal with two tags: constituent tag and grammatical relation tag Size: 1M Chinese words, about 45,000 sentences CCC-CEB1 (Chinese Event Bank) For research on Chinese event analysis Texts: Sentences extracted from TCT Focusing on the basic event situations of possession transferring, existence, space and time transferring Annotation: Event verb senses: suitable sense codes defined in four semantic lexicons, including a situation network, a computeroriented common-sense KB (Hownet), a Chinese thesaurus dictionary (CiLin, Word Forest) and the Contemporary Chinese dictionary (Xianhan, Modern Chinese). Event role tags: 6 core or peripheral arguments of the special events evoked by the event verbs Size: about 1M Chinese characters Thanks! http://www.CCCForum.org