Proposals for Extending SSML 1.0 from the Point-of-View of Hungarian TTS Developers Géza Németh, Géza Kiss, Bálint Tóth Laboratory of Speech Technology, Department of Telecommunications.

Download Report

Transcript Proposals for Extending SSML 1.0 from the Point-of-View of Hungarian TTS Developers Géza Németh, Géza Kiss, Bálint Tóth Laboratory of Speech Technology, Department of Telecommunications.

Proposals for Extending
SSML 1.0
from the Point-of-View of
Hungarian TTS Developers
Géza Németh, Géza Kiss, Bálint Tóth
Laboratory of Speech Technology, Department of Telecommunications and Media
Informatics
Budapest University of Technology and Economics, Budapest, Hungary
{nemeth,kgeza,toth.b}@tmit.bme.hu
Budapest University of Technology & Economics (BME)
Dept. of Telecommunications & Media Informatics (TMIT)


Speech activities:
Coordinator: Gordos Géza D.Sc.
Speech Technology Lab (STL)
Németh Géza
PhD
and
Olaszy Gábor
D.Sc.
Telecommunications & Signal Processing Lab (TSP)
In each lab
Tatai Péter
•4-6 PhD students
MSc
•Graduate students
306 in Speech Information
Laboratory of Speech Acoustics
Systems subject (2005)
Vicsi Klára
(LSA)
D.Sc.
Basic research

Multi-lingual artificial speech generation (synthesis, STL)





Speech recognition (TSP, LSA)







limited vocabulary (e.g., numbers, date, address)
multi-lingual TTS (Hungarian, German, Polish, Spanish)
speech profiles (variability, individual features)
expression/emotion presentation (user’s manual <-> news)
noise handling (telephone, in-car, ..., TSP)
dictation (good quality, continouos, LSA)
audio indexing (e.g. radio archives, broadcast news, TSP)
speech segmentation (TSP, LSA)
emotion detection (TSP)
Speech understanding (TSP)
Speech databases (LSA, TSP)
Applied Research
Fully proprietary components and solutions:
 All parameters controlled, systems are tailor-made for the end-user,
Integration of original research results, unique products
 T-Mobile Hungary services: E-mail reader 1999-, name- and address
reader in reverse directory, 2003 (Motto: Why is the human operator
speaking, not the machine?!), Symbian SMS-reader 2002- (STL)
 Others: SMS reader 2001-, bookreader 2002-, (STL)
 Voice portals (Generali Hungary name dial-in 2004, Hungarian
VoiceXML browser, 2003, TSP+STL)
 Industrial information systems (STL, TSP)
 Unified Messaging (STL)
 Call Center (STL, TSP)
 Audio user interfaces (especially portable/mobile devices, car
information systems, wearable devices, STL, TSP)
 Disability (1986-, speech, vision, Hungarian version of Jaws for
Windows, notetaker for blind people, STL, TSP, LSA)

Contact information
Tel: (+36 1) 463-38-83
Fax: (+36 1) 463-31-07
http://speechlab.tmit.bme.hu
email: [email protected]
Overview
Text structure
Text-to-phoneme conversion
Text normalization
Prosody prediction
Prosody prescription
Summary
Overview
Text structure
Text-to-phoneme conversion
Text normalization
Prosody prediction
Prosody prescription
Summary
Overview Text structure Text-to-phoneme Text
Prosody
conversion
normalization prediction
Prosody
prescription
Summary
Overview Text structure Text-to-phoneme Text
Prosody
conversion
normalization prediction
Prosody
prescription
Text structure elements already contained
in SSML 1.0:
 paragraph
 sentence
Suggested further structuring:
 word
 syllables
Summary
Overview Text structure Text-to-phoneme Text
Prosody
conversion
normalization prediction
Prosody
prescription
This can be used
 to help
text-to-phoneme conversion
 prosody prediction and prescription
 …

by giving higher level information, namely
syllable structure
 part-of-speech information
(Examples given later)


to indicate words in languages that do not
use space to separate words
Summary
Overview Text structure Text-to-phoneme Text
Prosody
conversion
normalization prediction
Prosody
prescription
Summary
Reasons to use text structure elements
instead of e.g. phoneme, prosody, break, emphasis
 Easier for human editor to add
 Replacing synthesis processor may necessitate
rewriting
phoneme specification
 prosody prescription

Overview Text structure Text-to-phoneme Text
Prosody
conversion
normalization prediction
Prosody
prescription
Summary
Suggested word element
<w [syllables=“…-…”]
[POS=“…” [number=“…” …]]> … </w>
E.g.
<w syllables="hosz-szú"> hosszú </w>
<w POS="noun" number="plural"
case="accusative"> halászsasokat </w>
Overview Text structure Text-to-phoneme Text
Prosody
conversion
normalization prediction
Prosody
prescription
Suggestion extended from other proposals
<w [syllables=“…-…”]
[POS=“…” [number=“…” gender=“…”
case=“…” …]
[morph=“…+…”]
[tone=“h+l+…”]]> … </w>
When not a word, but an expression is labeled:
<e [POS=“…” [number=“…” …]> … </e>
E.g. three kilos
<e POS=“cardinal” number=“plural”
gender=“neuter” case=“genitive”]>
3 k. </e>
Summary
Overview Text structure Text-to-phoneme Text
Prosody
conversion
normalization prediction
Prosody
prescription
Summary
Overview Text structure Text-to-phoneme Text
Prosody
conversion
normalization prediction
Prosody
prescription
Summary
When pronunciation cannot be determined, you can
1. Add a lexicon element
BUT hard to add all
2. Specify using phoneme:
BUT hard to write & read for human
3. Add a textual replacement using sub
4. Provide higher level information
Currently this is only say-as
Overview Text structure Text-to-phoneme Text
Prosody
conversion
normalization prediction
Prosody
prescription
Other types of higher level information
(easier, more natural)
 Syllable structure
 Part-of-speech information
 Language of included foreign text
We are going to give you some examples.
Summary
Overview Text structure Text-to-phoneme Text
Prosody
conversion
normalization prediction
Prosody
prescription
Summary
Syllable structure
Hungarian:
 highly agglutinative
 pronunciation inference rules are used
 rules can be tricked by some words
E.g. “egészség” (“health”)
Letter combinations might be “s+zs” [S]+[Z]→[Z]
but they are in fact
“sz+s” [s]+[S]→[S]
Overview Text structure Text-to-phoneme Text
Prosody
conversion
normalization prediction
Prosody
prescription
Summary
Syllable structure
Enough to know syllable structure.
Instead of
<phoneme alphabet="ipa"
ph="&#x25B;ge&#x2D0;#x283;#x283;
e&#x2D0;g"> egészség </phoneme>
you can write
<w syllables="e-gész-ség"> egészség </w>
(Note: here you could also write
<sub alias="e-gész-ség"> egészség </sub>)
Overview Text structure Text-to-phoneme Text
Prosody
conversion
normalization prediction
Prosody
prescription
Summary
Part-of-speech


Word forms may have several
meanings/pronunciations
Specifying part-of-speech may help
E.g.


I will <w
read </w>
I have <w
read </w>
POS=“verb” tense=“present”>
the book
POS=“participle”>
the book
Overview Text structure Text-to-phoneme Text
Prosody
conversion
normalization prediction
Language


Foreign parts often occur in texts
Using same voice, currently you can
Do nothing
 Specify using phoneme


Another desirable approach

Specify lexicon for language and
specify language of text
Prosody
prescription
Summary
Overview Text structure Text-to-phoneme Text
Prosody
conversion
normalization prediction
Prosody
prescription
Language
Instead of
…<speak … xml:lang="en-US">
The title of the movie is:
<phoneme alphabet="ipa"
ph="&#x2C8;l&#x251;
&#x2C8;vi&#x2D0;&#x27E;&#x259;
&#x2C8;&#x294;e&#x26A;
&#x2C8;b&#x25B;l&#x259;">
La vita è bella </phoneme>
(Life is beautiful).
Summary
Overview Text structure Text-to-phoneme Text
Prosody
conversion
normalization prediction
Language
you could write
…<speak … xml:lang="en-US">
The title of the movie is:
<phoneme lang="it">
La vita è bella </phoneme>
(Life is beautiful).
Prosody
prescription
Summary
Overview Text structure Text-to-phoneme Text
Prosody
conversion
normalization prediction
Prosody
prescription
Language
Suggested language attribute
<phoneme [lang=“…” | “x-unknown”]
[ph=“…” [alphabet=“…”]]> …
</phoneme>
If both lang and ph is given, lang has priority
If language is “x-unknown”,
LID (language identification) is used.
We suggest that “x-unknown” can be used with
xml:lang also.
Summary
Overview Text structure Text-to-phoneme Text
Prosody
conversion
normalization prediction
Prosody
prescription
Summary
Overview Text structure Text-to-phoneme Text
Prosody
conversion
normalization prediction



Prosody
prescription
Text normalization effectively assisted by
say-as element.
The constructs we found appropriate
in our practice include:
date, time (including time intervals like
opening hours), number, currency, name,
address.
Additionally suggest as standard values:
acronym/abbreviation, web, e-mail,
phone, program-code, table, equation.
Summary
Overview Text structure Text-to-phoneme Text
Prosody
conversion
normalization prediction
Prosody
prescription
Summary
Overview Text structure Text-to-phoneme Text
Prosody
conversion
normalization prediction




Prosody
prescription
Summary
We speak differently in different situations
(e.g. speaking with friends, giving a talk at a
conference, reading news, reading stories to
children) – speaking style
Differences in prosody can be quantified
Emotional speech also in the focus of research
Modern TTS systems are likely to be able to
imitate these to some extent
Overview Text structure Text-to-phoneme Text
Prosody
conversion
normalization prediction
Prosody
prescription
Summary
Speaking style
Suggested speaking-style attribute
 Can be used where the xml:lang element,
i.e. voice, speak, p, s, w
 Synthesis processors can define their own set of
supported speaking-styles
 They should support: "spelling"
– can be viewed a special reading style
 They may support e.g. "syllabification",
"causal", "news reading",
"story telling"
Overview Text structure Text-to-phoneme Text
Prosody
conversion
normalization prediction
Prosody
prescription
Summary
Emotion
Suggested emotion attribute
 Mentioned here, although prosody is only one of
its aspects
 Complementary to speaking-style, therefore
separate attribute is suggested
 Can be used where the xml:lang element,
i.e. voice, speak, p, s, w
 Possible values: "happiness", "sadness",
"anger", "surprise", "disgust", "fear".
Overview Text structure Text-to-phoneme Text
Prosody
conversion
normalization prediction
Prosody
prescription
Summary
Part-of-speech



Part-of-speech (POS) of word may affect
emphasis and other aspects of prosody
Not always possible to automatically determine
More desirable to specify POS than to prescribe
prosody (higher level, speaking style can override it)
Example in Hungarian:

“Mondd, hogy vagy?” (“Tell me, how are you?”)
– interrogative adverb, strong (focus) emphasis

“Igaz, hogy jól vagy?” (“Is it true that you
are alright?”)
– conjunction,
reduced emphasis
Overview Text structure Text-to-phoneme Text
Prosody
conversion
normalization prediction
Prosody
prescription
Summary
Overview Text structure Text-to-phoneme Text
Prosody
conversion
normalization prediction

Prosody
prescription
Summary
Analytic languages (e.g. English, Chinese)
Words are usually short
 They convey only one portion of the meaning
 Individual words can be stressed


Synthetic languages (e.g. Hungarian, Korean)
Words are often long
 Made up of several morphemes and have very
complex meanings
 Stress, pitch changes, etc. may need to be realized on
certain morphemes (~syllables)

Overview Text structure Text-to-phoneme Text
Prosody
conversion
normalization prediction
Prosody
prescription
Summary
Example 1: contrastive sentences
 English:
“The book is not in the box, but on the box.”
 Speaker can emphasize one word.

Hungarian:
“Nem a dobozon, hanem a dobozban van a
könyv.”
 Speaker sometimes has to emphasize one syllable.
 Stress expressed mainly by pitch; may be aided
by short pause, slower rate, higher volume.
Overview Text structure Text-to-phoneme Text
Prosody
conversion
normalization prediction
Prosody
prescription
Example 2: pitch change on syllable
1. “Elmentek.” – “They are gone.”
Pitch is continuously falling
2. “Elmentek?” – “Are they gone?”
Pitch rises at the beginning of the second
syllable and falls down on the third syllable
1.
2.
Summary
Overview Text structure Text-to-phoneme Text
Prosody
conversion
normalization prediction
Prosody
prescription
Summary
Suggestion for extensions to prosody:
 Stress and prosody can be described
on a per-syllable basis
 Extension to prosody: time can be syllable
position
decimal fractions can also be used
 negative values indicate nth position from end
 special symbol syl_end indicates end of expression

E.g.:
<prosody contour=“(syl1,…) (syl1.5,…)
(syl2,…) … (syl-1,…)(syl_end,…)”>
Overview Text structure Text-to-phoneme Text
Prosody
conversion
normalization prediction
Prosody
prescription
Summary
Suggestion for optional extensions:
some synthesis processors may process
 pitch-contour (=contour),
rate-contour, volume-contour
time positions: the same as in contour
rate / volume: described as in rate / volume
 emphasis and break extended with a position
attribute; value can be syllable position.
In this case break will not be an empty element.
Overview Text structure Text-to-phoneme Text
Prosody
conversion
normalization prediction
Prosody
prescription
Summary
Overview Text structure Text-to-phoneme Text
Prosody
conversion
normalization prediction
Prosody
prescription
Summary
Suggested extensions
1.
2.
3.
4.
<w [syllables=“…-…”]
[POS=“…” [number=“…” …]]
</w>
<phoneme lang=“…” | “x-unknown”
[ph=“…” [alphabet=“…”]]> …
</phoneme>
<voice | speak | p | s | w
[speaking-style=“spelling” |
“syllabification” | “causal” |
“news reading” | “story telling” | …]
[emotion=“happiness” | “sadness” | “anger” |
“surprise” | “disgust” | “fear”]
[<xml:lang=“…” | “xml-unknown”>]
</voice>
<prosody contour=“(syl1,…) (syl2,…) (syl2.5,…)
… (syl-2,…) (syl-1,…) (syl_end,…)”>
optionally: pitch-contour (=contour),
rate-contour, volume-contour; break, emphasis
Overview Text structure Text-to-phoneme Text
Prosody
conversion
normalization prediction
Prosody
Summary
prescription
Thank you
for your attention!