SSML - www.gu.se utslagen / not accessible

Transcript SSML - www.gu.se utslagen / not accessible

VoiceXML:
SSML (Speech Synthesis Markup
Language)
Recorded speech and audio
Acknowledgements
 Prof. Mctear, Natural Language Processing,
http://www.infj.ulst.ac.uk/nlp/index.html, University of
Ulster.
Overview
 Speech Synthesis Markup Language (SSML)
 Phases of Text to Speech Synthesis
 Structure analysis
 Text normalisation
 Text to phoneme conversion
 Prosody analysis
 Waveform production
 Recorded speech
SSML
 Speech Synthesis Markup Language
 enables developers to override default specifications
 Stages:





Structure analysis
Text normalisation
Text to phoneme conversion
Prosody analysis
Waveform production
Structure Analysis
 Division of text into basic elements e.g. sentence,
paragraph to support more natural phrasing
 <s> - sentence
 - paragraph
 Structure inferred from punctuation and formatting, but …
 Dr. Lewis works at the clinic on Sunset Dr. in western
Portland.
 Dr. Smith lives at 214 Elm Dr. He weighs 214 lb. He plays
bass guitar. He also likes to fish; last week he caught a 20
lb. bass.

<s>Dr. Smith lives at 214 Elm Dr. </s>
<s>He weighs 214 lb.</s>
<s>He plays bass guitar. </s>
<s>He also likes to fish; last week he caught a 20 lb.
bass.</s>

Text Normalisation
Annotation of text so that it is spoken correctly
Ambiguous examples:
1/2 - may be spoken as “half,” “January second,” “February
first,” or “one of two.”
Dr. – may be ‘doctor’ or ‘drive’ e.g. Dr. John Dr.” is rewritten
as “Doctor John Drive”
St. – may be ‘saint; or ‘street’ e.g. St. John St. is written as
“Saint John Street.”
Acronyms e.g. ACM or IEEE should be spelled out, others are
pronounced as words e.g. RAM, ROM
Email addresses: e.g. [email protected]
First part: “Cat Azman,” “C.A.Tazman,” or “C. Atazman?”
Last part: “Bee dot com” or “B.E.E. dot com?”

 New in VoiceXML 2.0. Speech Synthesis Markup.
 Syntax
 OriginalText 
 Description
Language element whose alias attribute provides
substitute text to be spoken instead of the contained
text. This allows the document to contain both a written
and a spoken form for a string

Dr.

Smith lives at 214 Elm
Dr.

He weighs 214 lb. 
He plays bass guitar.
He also likes to fish; last week
he caught a 20 lb. 
bass.
Dr.
Smith lives at
214 
Elm Dr.

He weighs 214 

lb.
He plays bass guitar.
He also likes to fish; last week
he caught a 20 
 lb.
 bass.
<say-as>
 Speak enclosed text in the given style
 Implemented (with limitations) in some platforms
 Example: numbers
 Contained text can be interpreted as a number. The
allowed number formats are ordinal, cardinal, and digits.


<say-as type="number:ordinal">12</say-as> is spoken as
"twelfth“
<say-as type="number:digits">12</say-as> is spoken as
"one two".
 Other types: acronyms, currency, time, date, duration,
measures, telephone, spell-out, names, and net.
 Bevocal provides a set of extended tags for items such
as: airline, equity, street, city, state, citystate, address
Text to phoneme conversion
Specify pronunciation of words that are difficult to
pronounce, e.g.
read = ‘reed’ / ‘red’
wind: Wind the watch when you face into the wind
<phoneme> - uses the standard phonetic alphabet, the
International Phonetic Alphabet (IPA).
Unicode numbers
He plays
<phoneme alphabet = "ipa" ph="U0062 U0258 U0073">
bass </phoneme> guitar.
He also likes to fish; last week he caught a 20 
 lb. 
<phoneme alphabet = "ipa" ph="U0062 U00E6
U0073"> bass </phoneme>.
Attributes of <phoneme>
 alphabet—The phonetic alphabet used to specify the
pronunciation of the word contained in the <phoneme>
element
 ph—The phonetic spelling of this word expressed using the
alphabet. The only valid values for this attribute are ph="ipa"
and vendor-defined strings of the form ph = "x-organization"
or ph = "x-organization-alphabet ".
 Using the IPA requires some linguistic training. For an
excellent tutorial on the IPA symbols and sounds, see
http://www.unil.ch/ling/english/phonetique/table-eng.html.
 For an overview of the IPA and a full chart of symbols, see
http://www.arts.gla.ac.uk/IPA/ipa.html.
 The sounds used in English and their IPA symbols are
illustrated in http://www.antimoon.com/how/pronuncsoundsipa.htm. You can hear each sound by clicking the
word that contains the sound.
 To identify the corresponding Unicode number, go to
http://web.uvic.ca/ling/resources/ipa/charts/unicode_intro.ht
m, move the cursor above the IPA symbol, and the Unicode
value will appear.
Prosody analysis
 Pitch (intonation or melody), timing (rhythm), pauses,
speech rate, emphasis on words, and the relative timing
of segments and pauses.
 most TTS engines have a prosody analysis algorithm
responsible for producing the prosody of synthesized
speech, which is often based on the parts of
speech. For example, nouns, verbs, and adjectives
may be accented; whereas, auxiliary verbs and
prepositions may be distressed.
 Spoken speech pauses for commas and properly
inflects the speech depending upon whether the
sentence is declarative, interrogative, or exclamatory.
 Prosody rules and algorithms are not perfect and are a
topic of ongoing research. Prosody rules for different
spoken national languages may be quite different. For
example, the prosody for American, British, Indian, and
Jamaican pronunciations of English are different.
<prosody> : pitch
 refers to the “highness or lowness” of speech
 (currently not implemented in bevocal cafe)
 measured by the frequency (Hz, vibrations per second)
of the sound
 can be specified with:
 A number followed by “Hz”
 A relative change expressed as a percentage: for
example, "+18.2%" or "-10.3%"
 A relative change as a relative number: for example,
"+10" or "-8.7"
 One of the following words: "x-high", "high",
"medium", "low", "x-low", or "default"
<prosody> : range
 Range - specifies the variability of the pitch.
 specified using the same options as pitch e.g.
 (currently not implemented in bevocal cafe)
<prosody pitch = "medium" range = "x-low">
<prosody>: contour
 describes the actual pitch contour for the text.
 (currently not implemented in bevocal cafe)
 set of time segments with a target pitch specified for each
time segment.
 Each time segment is defined as a percentage of the total
time for speaking the contained text e.g.
(25%, 25%, 25%, 25%) would speak the contained text in
four equal segments.
 An interpolation algorithm smoothes the transitions
between the time segments. For example, a contour can
be used to describe the increase in pitch at the end of a
question as follows:
<prosody contour = "(90%, medium) (10%, high)"> You
said what? </prosody>
<prosody> : rate, duration
Rate. The speaking rate expressed using words-per-minute (currently
not implemented in bevocal cafe), specified using any of the
following:
 A number
 A relative change expressed as a percentage; for
example, "+18.2%" or "-10.3%"
 A relative change as a relative number; for example, "+10"
or "-8.7"
 One of the following words: "x-fast", "fast", "medium",
"slow", "x-slow", or "default"
The student’s name is <prosody rate=“-10%">
John Scott </prosody>
Duration. A value in seconds or milliseconds for the desired time to
read the element contents e.g.
<prosody duration = "10s">
<prosody> : volume
Volume. Specifies how loudly or quietly the words are
spoken, specified by:
 A number in the range from 0.0 to 100.0
 A relative change expressed as a percentage for
example; "+18.2%" or "-10.3%"
 A relative change as a relative number; for example, "+10"
or "-8.7"
 One of the following words: "loud", "medium", "soft", "low",
"x-soft", or "silent"
<prosody volume = "loud"> text to be
spoken </prosody>
<emphasis>
 formerly <emph>
 level: values “strong” “moderate,” “none” and
“reduced”.
 “none” used to prevent the speech synthesis processor
from emphasizing words that it might typically
emphasize
<emphasis level = "strong">help</emphasis>
<break>
 specifies when to insert silence (or pause) in text
 strength - the strength of the prosodic break. Values
are "none" "x-small", "small","“medium" (the default
value), "large", or "x-large"
 time – e.g. "250ms", "3s".
Welcome to the Student System
<break time = "250ms"/>
Please say one of the following: …
Waveform Production
Process of converting a textual representation to
acoustical sounds which humans hear and interpret as
human-like speech.
<voice> - uses a different voice from the default specified
for TTS
<voice age=“3" gender="female"> text to
speak </voice>
<audio> - specifies what audio to present to user
<desc> - specifies text-only output describing the audio
output (e.g. dog barking)
<audio>: playing prerecorded audio files
 Output can consist of a combination of prerecorded
files, audio streams, or synthesised speech e.g.
<prompt>
Welcome to the Student System
<audio src = “AudioSample.wav” />
How can I help you?
</prompt>
 <audio> can have alternative content in case the audio
sample is not available e.g.
<audio src = “welcome.wav” >
Welcome to the Student System
</audio>
Recording speech input using <record>
 <record> is a form element similar to <field>
 It is used to collect a recording from the user that
can be played back or submitted to a server
 It has a <prompt> element and can have a
<filled> element
 It can have a grammar for a spoken command to
terminate the recording
Attributes of <record>
 name - The name of a variable that holds the value of the recorded
item.
 expr - The value of the recorded item variable.
 beep—There are two possible values: beep = "true" and beep =
"false" If true, a beep tone is presented to the user just before the
recording begins. The default is false.
 maxtime—The maximum duration of the recording, beginning
when the recording starts. For example, maxtime = "10s" where
"10s" means 10 seconds.
 finalsilence—The interval of silence indicating the end of
speech. For example, finalsilence = "3s" (not implemented in IBM
Voice Server SDK)
 dtmfterm—There are two possible values: dtmfterm = "true“ and
dtmfterm = "false" If true, then any DTMF key press not matched
by an active grammar will terminate the input. The default is true.
 type—Media format of the resulting recording. A media type is a
file format written in the form type/subtype. For audio files, the
type is always audio.
Example using <record>
<form>
<record name = "msg" beep = "true" maxtime = "5s”
finalsilence = "5000ms" dtmfterm = "true" type =
"audio/x-wav”>
<prompt timeout = "5s">
Record your message after the beep.
</prompt>
</record>
<filled>
<!-- when recording is completed, replay recorded
message –->
<prompt> You said <audio expr="msg"/> </prompt>
</filled>
</form>
Submitting recording to the server
 In this example, a recording has been stored in the variable
‘msg’ and the system confirms if the user wishes to keep it:
<field name="confirm“ type = “boolean”>
<prompt> Your message is <audio expr="msg"/>. </prompt>
<prompt> To keep it, say yes. To discard it, say no. </prompt>
<filled>
<if cond="confirm">
<submit next="save_message.jsp" enctype="multipart/formdata" method="post" namelist="msg"/>
</if>
<clear/>
</filled>
</field>
Dealing with user hang up during
recording
 When a user hangs up during recording, the
recording terminates and a
connection.disconnect.hangup event is thrown.
Audio recorded up until the hangup is available
through the <record> variable e.g.
<catch event=“connection.disconnect.hangup”>
… action such as submit recording to server…
</catch>
Exercise: SSML markup
Create a file using some SSML markup for TTS.
Examples:
He drove his new car, <prosody pitch="-10%" range="20%" volume="-20%">not his ugly old car</prosody>,
because he wanted to seem more <emphasis
level=“strong”> impressive </emphasis>
My user number is <say-as interpret-as=“digits”> 145678
</say-as>
Sample file: tts.vxml
Exercise: recording and using audio files
Create a simple application that includes a field in which
you ask the user to speak some information, such as
name and address, that is recorded by the system for
later playback.
Play back a pre-recorded file (music to be played as
introduction)

SSML - www.gu.se utslagen / not accessible

Transcript SSML - www.gu.se utslagen / not accessible

Directory