Chinese Languages - What we are facing in speech synthesis Chinese Languages • Dialects • Minority languages (55 big families) Official language: Putonghua or Standard Chinese (SC),

Download Report

Transcript Chinese Languages - What we are facing in speech synthesis Chinese Languages • Dialects • Minority languages (55 big families) Official language: Putonghua or Standard Chinese (SC),

Chinese Languages
- What we are facing in
speech synthesis
Chinese Languages
• Dialects
• Minority languages
(55 big families)
Official language: Putonghua or Standard
Chinese (SC), common writing system
based on SC
Chinese Dialects
9 /10 dialectal areas: Guan官话, Jin晋语,Wu 吴语,
Hui 徽语, Xiang湘语, Gan赣语, Hakka 客家话,
Yue粤语 and Min闽语 (平话 PingHua)
Mandarin is referred to as Standard Chinese or common
language.
– The language that was used by the government
– The language that was normally spoken by the native
speakers.
The difference of Chinese
• Guan (Mandarin) dialectlanguage
is differentisfrom
likeMandarin
the difference
– Mandarin is based onofthe
GuanEuropean
(Mandarin)
dialects.
many
Languages
– Guan (Mandarin) dialects
large regional
areas.
suchcover
as Portuguese
, Spanish
Each has tremendousand
difference
French.from one and other.
– Guan (Mandarin) dialects exist objectively without any
limitations or conventions as Mandarin.
Chinese Guan distribution
LanYin
Guan
Northeast
Guan
Beijing
Guan
JiLu Guan
Zhong
Yuan
guan
JiaoLiao
Guan
Jianghuai
Guan
Southwest
Guan
After Chinese language map from Institute of Linguistic , Cass
A dialect is usually spoken by
people from different provinces
Example of Modern Wu dialects
Population
Place
16,400,000
JianSu
11,850,000
Shanghai
36,650,000
ZheJiang
1,850,000
Northeast part of Jiangxi
270,000
North Pu of Fujian
3,100,000
South Anhui
Total area: 137,500 Km2
Sub-dialects of WU
江淮官话
(Jiang Huai)
苏沪嘉小片
(SuHu)
宣州片
(Xuan Zhou)
徽语
(Hui Yu)
处衢片
(ChuQu)
杭州小片
(Hang Zhou)
临绍小片(Lin Shao)
太湖片
(Tai Hu)After
台州片
Chinese
(Tai zhou) language
map from
Institute of
瓯江片(OuJiang) Linguistic ,
Cass
• Wu dialect is a group of dialects spoken in ShangHai,
ZheJiang, southern JiangSu, and part of FuJian and
AnHui.
• Wu dialect has about 70 million speakers, which makes
it the second biggest dialect running after Mandarin. The
dialect of interest in this paper is Shanghainese, the
native dialect spoken in Shanghai covering more than
11,850,000 populations.
• Although it is rather young in Wu dialect family,
Shanghainese becomes more and more interesting to
researchers because of its economical and political
importance.
Can we synthesize these dialectal words?
Do we need to synthesize Chinese dialects?
A special phenomenon for most dialects
except Cantonese: A sound without a
corresponding writing character.
•“口” used for those spoken syllables
without corresponding Chinese Characters
•Or find a homophone syllable to substitute
the sound
(?%)
有音無字现象:現今所有的漢語方言中,只有粵語已經成功地發展出一套漢
字書寫系統,而且深植於民眾的日常生活中。
After Phonology of FuJian ShiPo dialect- 福建石陂 (North Min)
Examples from MinNan dialect
• 赶紧去口9+50+103淡薄水来互伊啉。
kua~ ki_n k_hi tsa~ ta_m po? tsui lai hO i li_m (in SAMPA-C)
• 口4+27+103使甲侬口12+33+103中指,无礼貌。
bue sai ka? lO_N kiau tiO_N tsai bo le bau (in SAMPA-C)
• 字写甲歪歪口9+86+1039+86+106真否看。
li sia ka? uai uai tsuai? tsuai? tsi_n p_ha~i k_hua~ (in SAMPA-C)
A kind of confusion when processing
• 阿瑛敢会困口5+37+102灶骹?
dialectal
text.
a i_N ka_m e k_hu_n
tia_m tsau
k_ha (in SAMPA-C)
65% sound in MinNan dialect with
correct
writing
characters
(correct
(numbers are initial, final and
tone coding
for the syllable
without writing
character)
meaning with correct sound)
Different Pronunciations between written words
and spoken words (文白异读)
Example in Minnan dialect:
“命” (life)
• in written words “命令、命名” (command,
name) uttered as /mi/,
• in spoken words “性命、好命、命运”
(life, good fortune, fortune) uttered as
/mia /。
Xun du-训读
In polytonal syllables, there is a general phenomenon
called XunDu, the character has correct meaning but
wrong pronunciation.
For an stance, when Xiamen speakers see the
monosyllabic word “书” /su/ (book), they will speak it
as /tse/, but this sound correspond to another
word “册” (a book) . So in XiaMen dialect, “书”(book)
has two sounds, one is /su/ as in words “书法、书写、
楷书”(calligraphy, writing, regular script);one is
/tse/ as in words “书包、书皮、买书、书呆、书
虫”(school bag, book cover, buy books, bookworm)
Examples from MinNan Dialect
From CRI news reports ( China Radio report 白话音)
胡錦濤指出,近年來,中朝各領域交流合作取得了豐碩
成果,給兩國人民帶來了實實在在的利益。中方願繼續本
著互惠互利、共同發展的原則,鼓勵和支援中國企業同北
韓企業開展不同形式的投資合作,推動兩國經貿合作關係
不斷取得新進展。
金永南說,胡錦濤總書記的訪問必將在傳統的朝中友好
合作關係史上寫下新的篇章。朝方將同中方攜手努力,加
If the written characters are
強朝中傳統友誼,按照互利原則,採取有力措施推進兩國
common official system,
合作。
MinNan dialect can be spoken
29日當天,胡錦濤在北韓勞動黨總書記、國防委員會委
員長金正日的陪同下,參觀了象徵朝中友誼的朝中大安友
as well except some lexical
誼玻璃廠。
words.
Examples from MinNan Dialect
•
聽眾朋友,說起江西廬山相信許多人都不會陌生,它是我國著名的
旅遊勝地,是一座集風景、文化、宗教、教育、政治為一體的千古名
山。這裡是中國山水詩的搖籃,古往今來,無數文人墨客慕名登臨廬
山,為其留下4000餘首詩詞歌賦。
(A Minnan speaker reading a paragraph selected from an article
from CRI introducing LUSHAN Mountain)
•
听众朋友,讲起江西庐山,相信真多人拢(勿会)生分,伊是咱中
国有名的旅游圣地,是一座集風景、文化、宗教、教育、政治為一體
的千古名山。遮是中国山水诗的弧(同音字)篮,古往今来,無數文
人墨客慕名登臨廬山,為伊留落来4000外首詩詞歌賦。
•
(The same content really spoken in MinNan dialect, red part are extremely different
from the SC-based on as above. )
Although they can speak
following the text of SC,
MinNan speakers really don’t
speak as that in real life.
Spontaneous speech with a script From CRI
山真面目
•
title: Finding a real LUSHAN Mountain 看看廬
After a CRI script
聽眾朋友,說起江西廬山相信許多人都不會陌生,它是我國著名的旅遊勝地,是一座集風景、文化、
宗教、教育、政治為一體的千古名山。這裡是中國山水詩的搖籃,古往今來,無數文人墨客慕名登
臨廬山,為其留下4000餘首詩詞歌賦。晉代高僧慧遠(西元334~416年)在山中建立東林寺,開創了
佛教中的“凈土宗”,使廬山成為中國封建時代重要的宗教勝地。遺存至今的白鹿洞書院,是中國
古代教育和理學的中心學府。廬山上還薈萃了各種風格迥異的建築傑作,包括羅馬式與哥特式的教
堂、融合東西方藝術形式的拜佔庭式建築,以及日本式建築和伊斯蘭教清真寺等,堪稱廬山風景名
勝區的精華部分。廬山不但擁有“秀甲天下”的自然風光,更有著豐厚燦爛的文化內涵。在今天的
《中國百姓生活遊》節目中,今天我們就帶各位到廬山趴趴走。《中國百姓生活遊》節目和國家旅
遊局共同主辦。
•
what the two announcers really talking in a more spontaneous speech style,
the lexical words, grammar are seriously different from the text based on
SC as shown in the above text
(transcription for this spontaneous dialogue):
嗯,神州抛抛走。今囝日咱要去走的
即位所在呢,是足介赞的。即就是江西的庐山。啊讲起许个庐山哦,我是勿八去过,
但是呃自小汉有讲读甲真多即的诗啊词啊。而且真多课文当中嘛有写遘介绍即的庐山。
是啊庐山呢,是即的,呃,已经互联合国科教文组织号做世界自然遗产甲世界文化遗
产。在咱中国安尼三十统个的即个世界遗产当中哦,象伊即个号做,咱叫做双宜哦,
双宜的即的并无真多。是啊,我知影讲伊阁是汇集真多,呃,风景啊文化、宗教、教
育、政治為一體,着是讲,伊也是真多年的一的古山啊。是啊。所以无伊要叫讲,阁
要自然遗产,阁要文化遗产。你看哦,在咧即的为古代到现主时是诚多文人墨客拢来
遘即的庐山。所以庐山伊着留落偌多即的诗词甲歌賦你知唔?4000外首啊。哇,看势
即的所在一定是有伊真水的所在,无敢会有遮多人想要留落来。阿阁而且哦,你4000
外条无可能过过共款啊……
What is the problem or contradiction
for Speech Synthesis (ML)?
Common language
written
spoken
Common writing
system
Standard Chinese
Regional Dialects
Regional writing
Lacking ofsystems
the writing
system,
dialectal grammar, lexicon for
dialectal sounds for most
dialects except Cantonese.
Dialects
What’s our present task in Speech
synthesis?
to synthesize the speech
according to the common
written words
easier
OR
to synthesize the speech
really spoken by local people
more complicate
Coping with Tones in Chinese in SSML
• Phonological tones can reach to 9 tones
(Cantonese), only 4 lexical tones in SC.
• Complicate tone sandhi rules for many
dialects, not only occur within words, but
also between words depending on the
syntactic or semantic relations.
New specifications relating Chinese dialects
proposed by our Institute
• ISO/ICE 10646 accepted 24 tonal symbols (2004,6 propose by our
institute)
• Phonetic Alphabet used in China ,including minority languages and
Mandarin Chinese. (“中国通用音标符号集”) .
This specification has been submitted to the Education Ministry
of China, which will become a standard specification for Chinese
language survey, teaching and study, even used in speech
information processing.
Tones
25 tones, when use a traditional five letter tone scale in this
specification.
• 5 level tones
• 10 rising tones
• 10 falling tones
If tone sandhi (tonal icons are presented at right side) and short tones
(shorter line) considered, (25+25)*2 =100 tones are got.
•
•
•
•
20 long-short tones
20 short-long tones
30 concave tones
30 vaulted tones
Sub-dialects coding
• Handbooks for Survey Chinese Dialects
edited by the leading office of Chinese
Languages investigation, YUWEN
Publishing House.
《中国语言文字使用情况调查 调查员手册》,
中国语言文字使用情况调查领导小组办公室编, 语文出版社。