音声コーパスの活用 -NII音声資源コンソーシアムの設

Download Report

Transcript 音声コーパスの活用 -NII音声資源コンソーシアムの設

GSK: Development and
Distribution of Resources
Licensing and Distribution of Resources and Applications
Hitoshi ISAHARA
GSK: Gengo Shigen Kyokai
(Language Resource Association)
National Institute of Information and
Communications Technology (NICT)
Organizing Creation & Utilization of
Language Corpora
Creation of language corpora needs some cost.
Utilization needs a system to distribute corpora.
Some activities started early in 1990s.
1992 LDC in U.S.A.
1995 ELRA in Europe
Regional Conference on Localized ICT
Development and Dissemination across Asia
Jan. 15, Vientiane, Laos
2
Japanese Activities
GSK: Gengo Shigen Kyokai
(Language Resource Association)
Launched in 1999,
Reformed as an NPO in 2003,
Project accepted in 2005 for 3 years,
Text corpora are its main concern at present.
NII-SRC distributes speech corpora.
Regional Conference on Localized ICT
Development and Dissemination across Asia
Jan. 15, Vientiane, Laos
3
GSK and NII-SRC
Language Resource Association (GSK)
A nonprofit organization collecting and distributing text and speech corpora.
http://www.gsk.or.jp/
NII-Speech Resources Consortium (NII-SRC)
Collects and distributes most major speech corpora.
http://research.nii.ac.jp/src/eng/
These two organizations try to play central roles for collecting and
distributing speech and language corpora in Japan.
Regional Conference on Localized ICT
Development and Dissemination across Asia
Jan. 15, Vientiane, Laos
4
JEITA
(Japan Electronics and
Information Technology
Industries Association)
GSK
NII-SRC
Knowledge Information
Processing Technologies
Committee
NII: National Institute
of Informatics
NICT: National
Institute of Information
and Communications
Technology
Language Resource
Sub-committee
TCL
Natural Language Processing
Portal Site
SHACHI: Language
Resource Metadata DB
Regional Conference on Localized ICT
Development and Dissemination across Asia
Jan. 15, Vientiane, Laos
5
Purpose of GSK
Collection, distribution, investigation,
research, and standardization of
electronic data and software tools
necessary for the promotion of science,
technology, education and industry
concerning natural language.
Regional Conference on Localized ICT
Development and Dissemination across Asia
Jan. 15, Vientiane, Laos
6
GSK Organization
President
Two vice presidents
11 board members
25 steering committee members
All are voluntary workers.
Regional Conference on Localized ICT
Development and Dissemination across Asia
Jan. 15, Vientiane, Laos
7
No-fee Distribution
Corpus
GSK
User
Provider
Distribution
permission
Payment
Agreement
As a rule, the cost of handling corpora falls on the user,
though the corpus itself is free of charge.
Regional Conference on Localized ICT
Development and Dissemination across Asia
Jan. 15, Vientiane, Laos
8
Agency
Agency
Request
Form
Commission
User
Provider
GSK
Payment
Agreement
The providers of the corpora entrust GSK with requests
received from users. GSK mediates between users and
providers.
Regional Conference on Localized ICT
Development and Dissemination across Asia
Jan. 15, Vientiane, Laos
9
Advertizing
User
Provider
Ad request
GSK
Publicity
Ad rate
Payment
Agreement
Corpora providers entrust GSK with advertizing useful
information on their data or corpora.
Regional Conference on Localized ICT
Development and Dissemination across Asia
Jan. 15, Vientiane, Laos
10
Some Examples of GSK Corpora
JEITA Multimodal Corpus
Japanese Web N-ram Version 1
CICC Multilingual Dictionary
IPAL Lexicon of Basic Japanese
Regional Conference on Localized ICT
Development and Dissemination across Asia
Jan. 15, Vientiane, Laos
11
JEITA Multimodal Corpus
A corpus of collected person-to-person taskoriented dialogues. 80 min. of video for 9
conversations concerning topics of “faces” and
“travel” included. Speech data transcribed and
provided with annotations indicating
morphemes, dialogue structure and prosody.
Contained in 1 DVD-R (800 MB).
Regional Conference on Localized ICT
Development and Dissemination across Asia
Jan. 15, Vientiane, Laos
12
Japanese Web N-gram Version 1
N-grams that have been extracted from Google
crawling publicly available Japanese webpages.
Pages requiring special permission to brows or
indicated with nonarchaive/noindex are not
included. N-grams (1-7) with frequency greater
than 20 were extracted from approximately 20
billion sentences.
Contained in 6 DVD-Rs (26 GB after gzip compression).
Regional Conference on Localized ICT
Development and Dissemination across Asia
Jan. 15, Vientiane, Laos
13
CICC Multilingual Dictionary
A collection of Malay, Indonesian, Chinese, and
Thai Dictionaries containing 50,000 basic
words, POS tags; some contains English
translations. Technical Term Dictionary for
each language is also available.
Contained in 1 CD-ROM for each language.
CICC: Center for the International Cooperation for Computation
Regional Conference on Localized ICT
Development and Dissemination across Asia
Jan. 15, Vientiane, Laos
14
IPAL Lexicon of Basic Japanese
Containing
861 verbs, 136 adjectives, and 1,081 Nouns
and glossary. English translations also
provided for nouns contained in glossary.
Contained in 1 CD-ROM.
Regional Conference on Localized ICT
Development and Dissemination across Asia
Jan. 15, Vientiane, Laos
15
Summary
1. There are several distributers of language
resources in Japan.
2. GSK is the only consortium of language
resources qualified as NPO in Japan.
3. GSK plans to collaborate with Language
Grid Project.
Regional Conference on Localized ICT
Development and Dissemination across Asia
Jan. 15, Vientiane, Laos
16