SSML - www.gu.se utslagen / not accessible
Download
Report
Transcript SSML - www.gu.se utslagen / not accessible
VoiceXML:
SSML (Speech Synthesis Markup
Language)
Recorded speech and audio
Acknowledgements
Prof. Mctear, Natural Language Processing,
http://www.infj.ulst.ac.uk/nlp/index.html, University of
Ulster.
Overview
Speech Synthesis Markup Language (SSML)
Phases of Text to Speech Synthesis
Structure analysis
Text normalisation
Text to phoneme conversion
Prosody analysis
Waveform production
Recorded speech
SSML
Speech Synthesis Markup Language
enables developers to override default specifications
Stages:
Structure analysis
Text normalisation
Text to phoneme conversion
Prosody analysis
Waveform production
Structure Analysis
Division of text into basic elements e.g. sentence,
paragraph to support more natural phrasing
<s> - sentence
<p> - paragraph
Structure inferred from punctuation and formatting, but …
Dr. Lewis works at the clinic on Sunset Dr. in western
Portland.
Dr. Smith lives at 214 Elm Dr. He weighs 214 lb. He plays
bass guitar. He also likes to fish; last week he caught a 20
lb. bass.
<p>
<s>Dr. Smith lives at 214 Elm Dr. </s>
<s>He weighs 214 lb.</s>
<s>He plays bass guitar. </s>
<s>He also likes to fish; last week he caught a 20 lb.
bass.</s>
</p>
Text Normalisation
Annotation of text so that it is spoken correctly
Ambiguous examples:
1/2 - may be spoken as “half,” “January second,” “February
first,” or “one of two.”
Dr. – may be ‘doctor’ or ‘drive’ e.g. Dr. John Dr.” is rewritten
as “Doctor John Drive”
St. – may be ‘saint; or ‘street’ e.g. St. John St. is written as
“Saint John Street.”
Acronyms e.g. ACM or IEEE should be spelled out, others are
pronounced as words e.g. RAM, ROM
Email addresses: e.g. [email protected]
First part: “Cat Azman,” “C.A.Tazman,” or “C. Atazman?”
Last part: “Bee dot com” or “B.E.E. dot com?”
<sub>
New in VoiceXML 2.0. Speech Synthesis Markup.
Syntax
<sub alias="substituteText" > OriginalText </sub>
Description
Language element whose alias attribute provides
substitute text to be spoken instead of the contained
text. This allows the document to contain both a written
and a spoken form for a string
<sub>
<sub alias ="doctor">Dr.
</sub>
Smith lives at 214 Elm
<sub alias = "drive">Dr.
</sub>
He weighs 214 <sub alias =
"pounds"> lb. </sub>
He plays bass guitar.
He also likes to fish; last week
he caught a 20 <sub alias
= "pound"> lb. </sub>
bass.
<sub alias =
"doctor">Dr.</sub>
Smith lives at
<sub alias = "two fourteen
">214 </sub>
Elm <sub alias = "drive">Dr.
</sub>
He weighs <sub alias = "two
hundred and
fourteen">214 </sub>
<sub alias = "pounds">
lb.</sub>
He plays bass guitar.
He also likes to fish; last week
he caught a <sub alias =
"twenty">20 </sub>
<sub alias = "pound"> lb.
</sub> bass.
<say-as>
Speak enclosed text in the given style
Implemented (with limitations) in some platforms
Example: numbers
Contained text can be interpreted as a number. The
allowed number formats are ordinal, cardinal, and digits.
<say-as type="number:ordinal">12</say-as> is spoken as
"twelfth“
<say-as type="number:digits">12</say-as> is spoken as
"one two".
Other types: acronyms, currency, time, date, duration,
measures, telephone, spell-out, names, and net.
Bevocal provides a set of extended tags for items such
as: airline, equity, street, city, state, citystate, address
Text to phoneme conversion
Specify pronunciation of words that are difficult to
pronounce, e.g.
read = ‘reed’ / ‘red’
wind: Wind the watch when you face into the wind
<phoneme> - uses the standard phonetic alphabet, the
International Phonetic Alphabet (IPA).
Unicode numbers
He plays
<phoneme alphabet = "ipa" ph="U0062 U0258 U0073">
bass </phoneme> guitar.
He also likes to fish; last week he caught a <sub alias =
"twenty">20 </sub>
<sub alias = "pound"> lb. </sub>
<phoneme alphabet = "ipa" ph="U0062 U00E6
U0073"> bass </phoneme>.
Attributes of <phoneme>
alphabet—The phonetic alphabet used to specify the
pronunciation of the word contained in the <phoneme>
element
ph—The phonetic spelling of this word expressed using the
alphabet. The only valid values for this attribute are ph="ipa"
and vendor-defined strings of the form ph = "x-organization"
or ph = "x-organization-alphabet ".
Using the IPA requires some linguistic training. For an
excellent tutorial on the IPA symbols and sounds, see
http://www.unil.ch/ling/english/phonetique/table-eng.html.
For an overview of the IPA and a full chart of symbols, see
http://www.arts.gla.ac.uk/IPA/ipa.html.
The sounds used in English and their IPA symbols are
illustrated in http://www.antimoon.com/how/pronuncsoundsipa.htm. You can hear each sound by clicking the
word that contains the sound.
To identify the corresponding Unicode number, go to
http://web.uvic.ca/ling/resources/ipa/charts/unicode_intro.ht
m, move the cursor above the IPA symbol, and the Unicode
value will appear.
Prosody analysis
Pitch (intonation or melody), timing (rhythm), pauses,
speech rate, emphasis on words, and the relative timing
of segments and pauses.
most TTS engines have a prosody analysis algorithm
responsible for producing the prosody of synthesized
speech, which is often based on the parts of
speech. For example, nouns, verbs, and adjectives
may be accented; whereas, auxiliary verbs and
prepositions may be distressed.
Spoken speech pauses for commas and properly
inflects the speech depending upon whether the
sentence is declarative, interrogative, or exclamatory.
Prosody rules and algorithms are not perfect and are a
topic of ongoing research. Prosody rules for different
spoken national languages may be quite different. For
example, the prosody for American, British, Indian, and
Jamaican pronunciations of English are different.
<prosody> : pitch
refers to the “highness or lowness” of speech
(currently not implemented in bevocal cafe)
measured by the frequency (Hz, vibrations per second)
of the sound
can be specified with:
A number followed by “Hz”
A relative change expressed as a percentage: for
example, "+18.2%" or "-10.3%"
A relative change as a relative number: for example,
"+10" or "-8.7"
One of the following words: "x-high", "high",
"medium", "low", "x-low", or "default"
<prosody> : range
Range - specifies the variability of the pitch.
specified using the same options as pitch e.g.
(currently not implemented in bevocal cafe)
<prosody pitch = "medium" range = "x-low">
<prosody>: contour
describes the actual pitch contour for the text.
(currently not implemented in bevocal cafe)
set of time segments with a target pitch specified for each
time segment.
Each time segment is defined as a percentage of the total
time for speaking the contained text e.g.
(25%, 25%, 25%, 25%) would speak the contained text in
four equal segments.
An interpolation algorithm smoothes the transitions
between the time segments. For example, a contour can
be used to describe the increase in pitch at the end of a
question as follows:
<prosody contour = "(90%, medium) (10%, high)"> You
said what? </prosody>
<prosody> : rate, duration
Rate. The speaking rate expressed using words-per-minute (currently
not implemented in bevocal cafe), specified using any of the
following:
A number
A relative change expressed as a percentage; for
example, "+18.2%" or "-10.3%"
A relative change as a relative number; for example, "+10"
or "-8.7"
One of the following words: "x-fast", "fast", "medium",
"slow", "x-slow", or "default"
The student’s name is <prosody rate=“-10%">
John Scott </prosody>
Duration. A value in seconds or milliseconds for the desired time to
read the element contents e.g.
<prosody duration = "10s">
<prosody> : volume
Volume. Specifies how loudly or quietly the words are
spoken, specified by:
A number in the range from 0.0 to 100.0
A relative change expressed as a percentage for
example; "+18.2%" or "-10.3%"
A relative change as a relative number; for example, "+10"
or "-8.7"
One of the following words: "loud", "medium", "soft", "low",
"x-soft", or "silent"
<prosody volume = "loud"> text to be
spoken </prosody>
<emphasis>
formerly <emph>
level: values “strong” “moderate,” “none” and
“reduced”.
“none” used to prevent the speech synthesis processor
from emphasizing words that it might typically
emphasize
<emphasis level = "strong">help</emphasis>
<break>
specifies when to insert silence (or pause) in text
strength - the strength of the prosodic break. Values
are "none" "x-small", "small","“medium" (the default
value), "large", or "x-large"
time – e.g. "250ms", "3s".
Welcome to the Student System
<break time = "250ms"/>
Please say one of the following: …
Waveform Production
Process of converting a textual representation to
acoustical sounds which humans hear and interpret as
human-like speech.
<voice> - uses a different voice from the default specified
for TTS
<voice age=“3" gender="female"> text to
speak </voice>
<audio> - specifies what audio to present to user
<desc> - specifies text-only output describing the audio
output (e.g. dog barking)
<audio>: playing prerecorded audio files
Output can consist of a combination of prerecorded
files, audio streams, or synthesised speech e.g.
<prompt>
Welcome to the Student System
<audio src = “AudioSample.wav” />
How can I help you?
</prompt>
<audio> can have alternative content in case the audio
sample is not available e.g.
<audio src = “welcome.wav” >
Welcome to the Student System
</audio>
Recording speech input using <record>
<record> is a form element similar to <field>
It is used to collect a recording from the user that
can be played back or submitted to a server
It has a <prompt> element and can have a
<filled> element
It can have a grammar for a spoken command to
terminate the recording
Attributes of <record>
name - The name of a variable that holds the value of the recorded
item.
expr - The value of the recorded item variable.
beep—There are two possible values: beep = "true" and beep =
"false" If true, a beep tone is presented to the user just before the
recording begins. The default is false.
maxtime—The maximum duration of the recording, beginning
when the recording starts. For example, maxtime = "10s" where
"10s" means 10 seconds.
finalsilence—The interval of silence indicating the end of
speech. For example, finalsilence = "3s" (not implemented in IBM
Voice Server SDK)
dtmfterm—There are two possible values: dtmfterm = "true“ and
dtmfterm = "false" If true, then any DTMF key press not matched
by an active grammar will terminate the input. The default is true.
type—Media format of the resulting recording. A media type is a
file format written in the form type/subtype. For audio files, the
type is always audio.
Example using <record>
<form>
<record name = "msg" beep = "true" maxtime = "5s”
finalsilence = "5000ms" dtmfterm = "true" type =
"audio/x-wav”>
<prompt timeout = "5s">
Record your message after the beep.
</prompt>
</record>
<filled>
<!-- when recording is completed, replay recorded
message –->
<prompt> You said <audio expr="msg"/> </prompt>
</filled>
</form>
Submitting recording to the server
In this example, a recording has been stored in the variable
‘msg’ and the system confirms if the user wishes to keep it:
<field name="confirm“ type = “boolean”>
<prompt> Your message is <audio expr="msg"/>. </prompt>
<prompt> To keep it, say yes. To discard it, say no. </prompt>
<filled>
<if cond="confirm">
<submit next="save_message.jsp" enctype="multipart/formdata" method="post" namelist="msg"/>
</if>
<clear/>
</filled>
</field>
Dealing with user hang up during
recording
When a user hangs up during recording, the
recording terminates and a
connection.disconnect.hangup event is thrown.
Audio recorded up until the hangup is available
through the <record> variable e.g.
<catch event=“connection.disconnect.hangup”>
… action such as submit recording to server…
</catch>
Exercise: SSML markup
Create a file using some SSML markup for TTS.
Examples:
He drove his new car, <prosody pitch="-10%" range="20%" volume="-20%">not his ugly old car</prosody>,
because he wanted to seem more <emphasis
level=“strong”> impressive </emphasis>
My user number is <say-as interpret-as=“digits”> 145678
</say-as>
Sample file: tts.vxml
Exercise: recording and using audio files
Create a simple application that includes a field in which
you ask the user to speak some information, such as
name and address, that is recorded by the system for
later playback.
Play back a pre-recorded file (music to be played as
introduction)