Document 7356160

Transcript Document 7356160

Voice Browsers
GeneralMagic Demo
Making the Web accessible to
more of us, more of the time.
SDBI November 2001,
Shani Shalgi
What is a Voice Browser?
 Expanding
access to the Web
 Will allow any telephone to be used to
access appropriately designed Webbased services
 Server-based
 Voice portals
2
What is a Voice Browser?
 Interaction
via key pads, spoken
commands, listening to prerecorded
speech, synthetic speech and music.
 An advantage to people with visual
impairment
 Web access while keeping hands &
eyes free for other things (eg. Driving).
3
What is a Voice Browser?
 Mobile
Web
 Naturalistic dialogs with Web-based
services.
4
Motivation
 Far
more people today have access to a
telephone than have access to a
computer with an Internet connection.
 Many of us have already or soon will
have a mobile phone within reach
wherever we go.
5
Motivation
 Easy
to use - for people with no
knowledge or fear of computers.
 Voice interaction can escape the physical
limitations on keypads and displays as
mobile devices become ever smaller.
6
Motivation
 Many
companies to offer services over
the phone via menus traversed using
the phone's keypad. Voice Browsers are
the next generation of call centers,
which will become Voice Web portals to
the company's services and related
websites, whether accessed via the
telephone network or via the Internet.
7
Motivation
 Disadvantages
to existing methods:
• WAP (Cellular phones, Palm Pilots)
–Small screens
–Access Speed
–Limited or fragmented availability
–Akward input
–Price
–Lack of user habit
8
Differences Between Graphical &
Voice Browsing
 Graphical
browsing is more passive
due to the persistence of the visual
information
The leading role is
 Voice browsing
moreto
active since the
turnedisover
user has tothe
issue
commands.
USER
 Graphical Browsers are client-based,
whereas Voice Browsers are serverbased.
9
Possible Applications
 Accessing
business information:
• The corporate "front desk" which asks
callers who or what they want
• Automated telephone ordering services
• Support desks
• Order tracking
• Airline arrival and departure information
• Cinema and theater booking services
• Home banking services
10
Possible Applications (2)
 Accessing
public information:
• Community information such as weather,
traffic conditions, school closures,
directions and events
• Local, national and international news
• National and international stock market
information
• Business and e-commerce transactions
11
Possible Applications (3)
 Accessing
•
•
•
•
•
personal information:
Voice mail
Calendars, address and telephone lists
Personal horoscope
Personal newsletter
To-do lists, shopping lists, and calorie
counters
12
Advancing Towards Voice
Until now, speech recognition and synthesis
technologies had to be handcrafted into
applications.
 Voice Browsers intend the voice technologies
to be handcrfted directly into web servers.
 This demands transformation of Web content
into formats better suited to the needs of voice
browsing or authoring content directly for voice
browsers.

13
 The
World Wide Web Consortium
(W3C) develops interoperable
technologies (specifications, guidelines,
software, and tools) to lead the Web to
its full potential as a forum for
information, commerce, communication,
and collective understanding.
14
WC3 Speech Interface
Framework
Pronunciation
Lexicon
 Speech Synthesis
 Call Control
 Speech Recognition
 Voice Browser
• DTMF Grammars
Interoperation
• Speech Grammars
 VoiceXML

• Stochastic (N-Gram)
Language Models
• Semantic Interpretation
15
VoiceXML
 VoiceXML
is a dialog markup language
designed for telephony applications,
where users are restricted to voice and
DTMF (touch tone) input.
Browser
text.html
Web
Server
text.vxml
Internet
Speech Synthesis
 The
specification defines a markup
language for prompting users via a
combination of prerecorded speech,
synthetic speech and music. You can
select voice characteristics (name, gender
and age) and the speed, volume, pitch,
and emphasis. There is also provision for
overriding the synthesis engine's default
pronunciation.
17
Speech Recognition
Speech
Grammars
Speech
USER
Touch Tone
Stochastic
Language
Models
DTMF
Grammars
Semantic
Interpretation
DTMF Grammars
 Touch
tone input is often used as an
alternative to speech recognition.
 Especially useful in noisy conditions or
when the social context makes it
awkward to speak.
 The W3C DTMF grammar format allows
authors to specify the expected
sequence of digits, and to bind them to
19
the appropriate results
Speech Grammars
In most cases, user prompts are very carefully
designed to encourage the user to answer in a
form that matches context free grammar rules.
 Speech Grammars allow authors to specify
rules covering the sequences of words that
users are expected to say in particular
contexts. These contexual clues allow the
recognition engine to focus on likely
utterances, improving the chances of a correct
match.
20

Stochastic (N-Gram) Language
Models
In some applications it is appropriate to use
open ended prompts (how can I help). In
these cases, context free grammars are
unuseful.
 The solution is to use a stochastic language
model. Such models specify the probability
that one word occurs following certain others.
The probabilities are computed from a
collection of utterances collected from many
users.
21

Semantic Interpretation
 The
recognition process matches an
utterance to a speech grammar, building
a parse tree as a byproduct.
 There are two approaches to harvesting
semantic results from the parse tree:
1. Annotating grammar rules with
semantic interpretation tags (ECMAScript).
2. Representing the result in XML.
22
Semantic Interpretation - Example
For example (1st approach), the user utterance:
"I would like a medium coca cola and a large
pizza with pepperoni and mushrooms.”
could be converted to the following semantic result
{
}
drink: {
beverage: "coke”
drinksize: "medium”
}
pizza: {
pizzasize: "large"
topping: [ "pepperoni", "mushrooms" ]
}
23
Pronunciation Lexicon
 Application
developers sometimes need to
ability to tune speech engines, whether for
synthesis or recognition.
 W3C is developing a markup language for
an open portable specification of
pronunciation information using a standard
phonetic alphabet.
 The most commonly needed pronunciations
are for proper nouns such as surnames or
24
business names.
Call Control
Fine-grained control of speech (signal
processing) resources and telephony
resources in a VoiceXML telephony platform.
 Will enable application developers to use
markup to perform call screening, whisper
call waiting, call transfer, and more.
 Can be used to transfer a user from one
voice browser to another on a competely
different machine.

25
Voice Browser Interoperation

Mechanisms to transfer application state, such as
a session identifier, along with the user's audio
connections.
The user could start with a visual interaction on
a cell phone and follow a link to switch to a
VoiceXML application.
 The ability to transfer a session identifier
makes it possible for the Voice Browser
application to pick up user preferences and
other data entered into the visual application. 26
Voice Browser Interoperation (2)
 Finally,

the user could transfer from a
VoiceXML application to a customer
service agent.
The agent needs the ability to use their
console to view information about the
customer, as collected during the
preceding VoiceXML application. The
ability to transfer a session identifier can
be used to retrieve this information from
the customer database.
27
Voice Style Sheets?
 Some
extensions are proposed to
HTML 4.0 and CSS2 to support voice
browsing
 Prerecorded
content is likely to include
music and different speakers. These
effects can be reproduced to some
extent via the aural style sheets
features in CSS2.
28
Voice Style Sheets!

Volume

Rate

Pitch

Direction

Spelling out text letter by letter

Speech fonts (male/female, adult/child etc.)

Inserted text before and after element content

Sound effects and music
Authors want control over how the
document is rendered. Aural style
sheets (part of CSS2) provide a basis
for controlling a range of features:
29
How Does It Work?
 How
do I connect?
 Do I speak to the browser or does
the browser speak to me?
 What is seen on the screen?
 How do I enter input?
30
Problems
 How
does the browser understand what
I say?
 How can I tell it what I want?
…what
if it doesn’t understand?
31
Overview on Speech Technologies
 Speech
Synthesis
• Text to Speech
 Speech
Recognition
• Speech Grammars
• Stochastic n-gram models
 Semantic
Interpretation
32
What is Speech Synthesis?
 Generating
machine voice by arranging
phonemes (k, ch, sh, etc.) into words.
 There are several algorithms for
performing Speech Synthesis. The
choice depends on the task they're
used for.
33
How is Speech Synthesis
Performed?
 The
easiest way is to just record the
voice of a person speaking the
desired phrases.
• This is useful if only a restricted volume of
phrases and sentences is used, e.g.
schedule information of incoming flights.
The quality depends on the way recording
is done.
34
How is Speech Synthesis
Performed?
 Another
option is to record a large
database of words.
• Requires large memory storage
• Limited vocabulary
• No prosodic information
 More
sophisticated but worse in quality
are Text-To-Speech algorithms.
35
How is Speech Synthesis Performed?
Text To Speech
Text-To-Speech algorithms split the speech
into smaller pieces. The smaller the units, the
less they are in number, but the quality also
decreases.
 An often used unit is the phoneme,
the smallest linguistic unit. Depending on the
language used, there are about 35-50
phonemes in western European languages,
i.e. we need only 35-50 single recordings.

february twenty fifth: f eh b r ax r iy t w eh n t iy f ih f th
36
Text To Speech
 The
problem is, combining them as
fluent speech requires fluent transitions
between the elements. The intelligibility
is therefore lower, but the memory
required is small.
 A solution is using diphones. Instead of
splitting at the transitions, the cut is done
at the center of the phonemes, leaving
37
the transitions themselves intact.
Text To Speech
 This
means there are now
approximately 1600 recordings needed
(40*40).
 The longer the units become, the more
elements there are, but the quality
increases along with the memory
required.
38
Text To Speech
 Other
units which are widely used
are half-syllables, syllables, words, or
combinations of them, e.g. word
stems and inflectional endings.
 TTS is dictionary-driven. The larger the
dictionary resident in the browser is, the
better the quality.
 For unknown words, falls back on rules
39
for regular pronunciation.
Text To Speech
 Vocabulary
is unlimited!!!
 But
what about the prosodic
information?
Pronunciation
depends on the context
in which a word occurs. Limited
linguistic analysis is needed.

How can I help?

Help is on the way!
40
Text To Speech

Another example:

I have read the first chapter.

I will read some more after lunch.
 For
these cases, and in the cases of
irregular words and name pronunciation,
authors need a way to provide
supplementary TTS information and to
indicate when it applies.
41
Text To Speech
 But
specialized representations for
phonemic and prosodic information can
be off putting for non-specialist users.
 For
this reason it is common to see
simplified ways to write down
pronunciation, for instance, the word
"station" can be defined as:
station: stay-shun
42
Text To Speech
This approach encourages users to add
pronunciation information, leading to an
increase in the quality of spoken documents,
compared to more complex and harder to learn
approaches.
 This is where W3C comes in:
Providing a specification to enable consistent
control (generating, authoring, processing) of
voice output by speech synthesizers for
varying speech content, for use in voice
43
browsing and in other contexts.

Overview on Speech Technologies
Speech Synthesis
Text to Speech
 Speech
Recognition
• Speech Grammars
• Stochastic n-gram models
 Semantic
Interpretation
44
Speech Recognition
45
Speech Recognition
46
Speech Recognition
47
Speech Recognition
48
Speech Recognition
 Automatic
speech recognition is the
process by which a computer maps an
acoustic speech signal to text.
 Speech is first digitized and then
matched against a dictionary of coded
waveforms. The matches are
converted into text.
49
Speech Recognition
Types of voice recognition applications:
 Command systems recognize a few hundred
words and eliminate using the mouse or
keyboard for repetitive commands.
 Discrete voice recognition systems are used
for dictation, but require a pause between
each word.
 Continuous voice recognition understands
natural speech without pauses and is the
50
most process intensive.
Speech Recognition
 A speaker
dependent system is
developed to operate for a single
speaker.
 These systems are usually easier to
develop, cheaper to buy and more
accurate, but not as flexible as speaker
adaptive or speaker independent
systems.
51
Speech Recognition
 A speaker
independent system is
developed to operate for any speaker of
a particular type (e.g. American
English).
 These systems are the most difficult to
develop, most expensive and accuracy
is lower than speaker dependent
systems. However, they are more
52
flexible.
Speech Recognition
 A speaker
adaptive system is developed
to adapt its operation to the
characteristics of new speakers. It's
difficulty lies somewhere between
speaker independent and speaker
dependent systems.
53
Speech Recognition
 Speech
recognition technologies today
are highly advanced.
 There is a huge gap between the ability
to recognize speech and the ability to
interpret speech.
54
How is Speech Recognition
Performed?
Speech recognition technology involves
complex statistical models that characterize
the properties of sounds, taking into account
factors such as male vs. female voices,
accents, speaking rate, background noise, etc.
 The process of speech recognition includes 5
stages: 1. Capture and digital sampling

2. Spectral representation and analysis
3. Segmentation.
4. Phonetic Modeling
5. Search and Match
55
How is Speech Recognition
Performed?
Speech Grammars
 HMM (Hidden Markov Modelling)
 DTW (Dynamic Time Warping)
 NNs (Neural Networks)
 Expert systems
 Combinations of techniques.

HMM-based systems are currently the
most commonly used and most
56
successful approach.
Speech Grammars
 The
grammar allows a speech
application to indicate to a recognizer
what it should listen for, specifically:
 Words that may be spoken,
 Patterns in which those words may
occur,
 Language of the spoken words.
57
Speech Grammars
 In
simple speech recognition/speech
understanding systems, the expected
input sentences are often modeled by a
strict grammar (such as a CFG).
 In
this case, the user is only allowed to
utter those sentences, that are explicitly
covered by the grammar.
• Good for menus, form filling, ordering
services, etc.
58
Speech Grammars
 Experience
shows that a context free
grammar with reasonable complexity
can never foresee all the different
sentence patterns, users come up with
in spontaneous speech input.
 This approach is therefore not sufficient
for robust speech recognition/
understanding tasks or free text input
59
applications such as dictation.
For Example
 Possible
answers to a question may be
"Yes" or "No”, but it could also be any
other word used for negative or positive
response. It could be "Ya," "you
betch'ya," "sure," "of course" and many
other expressions. It is necessary to
feed the speech recognition engine with
likely utterances representing the
desired response.
60
Speech Grammars
 What
is done?
• Beta and Pilot versions
• Upgrade versions
61
Speech Grammars - Example

<item repeat="0-1">very</item>

<item repeat="0-"> <ruleref uri="#digit"/> </item>

<item repeat="1-"> <ruleref uri="#digit"/> </item>

<item repeat="4-6"> <ruleref uri="#digit"/> </item>

<item repeat="10-"> <ruleref uri="#digit"/> </item>
62
Speech Grammars - Example




<item repeat="0-1">
<item repeat="0-1"> very </item>
big
</item>
pizza
<item repeat="0-">
<item repeat="0-1">
<one-of>
<item>with</item>
<item>and</item>
</one-of>
</item>
<ruleref uri="#topping"/>
63
Hidden Markov Model
Notations:
 T = Observation sequence length
 O = {o1,o2,…,oT} = Observation sequence
 N = Number of States (we either know or guess)
 Q = {q1…qN} = finite set of possible states
 M = number of possible observations
 V = {v1,v2,…,vM} finite set of possible
observations
 Xt = state at time t (state variable)
64
Hidden Markov Model
Distributional parameters
A =
{aij} where aij = P(Xt+1 = qj |Xt = qi)
(transition probabilities)
 B = {bi(k)} where bi(k) = P(Ot = vk | Xt =
qi) (observation probabilities)
 t = P(X0 = qi) (initial state distribution)
65
Hidden Markov Model
Definitions
 A Hidden
Markov Model (HMM) is a
five-tuple (Q,V,A,B,).
 = {A,B,} denote the parameters
for a given HMM with fixed Q and V.
 Let
66
Hidden Markov Model
Problems
1. Find P(O | ), the probability of the
observations given the model.
2. Find the most likely state trajectory
X = {x1,x2,…,xT} given the model and
observations. (Find X so that P(O,X | ) is maximized)
3. Adjust the  parameters to maximize
P(O | )
67
Language Models
 A Language
model is a probability
distribution over word sequences
• P(“And nothing but the truth”)  0.001
• P(“And nuts sing on the roof”)  0
68
The Equation
Notation:
W' = argmaxW P(O|W) P(W)
69
The N-Gram (Markovian)
Language Model
 Hard
•
to compute P(W)
P(“And nothing but the truth”)
 Step
1: Decompose probability P(“And nothing but the truth”) =
P(“And”) P(“nothing” | “and”) 
P(“but” | “and nothing”)  P(“the” | “and
nothing but”)  P(“truth” | “and nothing
but the”)
70
The Trigram Approximation
 Assume
each word depends only on the
previous two words (three words total –
tri means three, gram means writing)
P(“the”|“… whole truth and nothing but”) 
P(“the”|“nothing but”)
P(“truth”|“… whole truth and nothing but the”) 
P(“truth”|“but the”)
71
N-Gram - The Markovian Model
The Markovian state machine is an
automatation with statistical weights
 A state represents a phoneme, diphone or
word.
 We do not include all options, but only those
which are related to the context or subject.
 We calculate all probable paths from beginning
to end of phrase/word and return the one with
the maximum probability.

72
Back to Trigrams
 How
do we find the probabilities?
 Get real text, and start counting!
• P(“the” | “nothing but”) 
Count(“nothing but the”)
Count(“nothing but”)
73
N-grams
 Why
stop at 3-grams?
 If P(z|…rstuvwxy) P(z|xy) is good,
then P(z|…rstuvwxy)  P(z|vwxy) is
better!
 4-gram, 5-gram start to become
expensive...
74
The N-Gram (Markovian)
Language Model - Summary
 N-Gram
language models are used in
large vocabulary speech recognition
systems to provide the recognizer with
an a-priori likelihood P(W) of a given
word sequence W.
 The
N-Gram language model is usually
derived from large training texts that
share the same language
75
characteristics as expected input.
Combining Speech Grammars
and N-Gram Models

Using an N-Gram model in the recognizer
and a CFG in a (separate) understanding
component

Integrating special N-Gram rules at various
levels in a CFG to allow for flexible input in
specific context

using a CFG to model the structure of
phrases (e.g. numeric expressions) that
incorporated in a higher-level N-Gram model
76
(class N-Grams)
Overview on Speech Technologies
Speech Synthesis
Text to Speech
Speech Recognition
Speech Grammars
Stochastic n-gram models
 Semantic
Interpretation
77
Semantic Interpretation

We have recognized the phrases
and words, what now?
Problems
 What
does the user mean?
 We have the right keywords, but the
phrase is meaningless or unclear.
78
Semantic Interpretation
 As
stated before, the technologies of
speech recognition exceed those of
interpretation.
 Most interpreters are base on key
words.
• Sometimes this is not good enough!
79
Back To Voice Browsers
Making the Web accessible to more
of us, more of the time.
Personal Browser Demo
 Now
we’ll talk about voiceXML,
navigation and various problems
80
VoiceXML - Example 1
<?xml version="1.0"?>
<vxml version="2.0">
<form>
<block>Hello World!</block>
</form>
</vxml>

The top-level element is <vxml>, which is
mainly a container for dialogs. There are two
types of dialogs: forms and menus. Forms
present information and gather input; menus
offer choices of what to do next.
81
VoiceXML - Example 1
<?xml version="1.0"?>
<vxml version="2.0">
<form>
<block>Hello World!</block>
</form>
</vxml>

This example has a single form, which
contains a block that synthesizes and
presents "Hello World!" to the user. Since the
form does not specify a successor dialog, the
conversation ends.
82
VoiceXML - Example 2

Our second example
A fieldasks
is an the
inputuser
field.for a choice
user must
a value
for
of drink and thenThe
submits
it to provide
a server
script:
<?xml version="1.0"?>
the field before proceeding to the
<vxml version="2.0">
next element in the form.
<form>
<field name="drink">
<prompt>Would you like coffee,tea, milk, or
nothing?</prompt>
<grammar src="drink.grxml"
type="application/grammar+xml"/>
</field>
<block>
<submit
next="http://www.drink.example.com/drink2.asp"/>
</block>
</form>
83
</vxml>
VoiceXML - Example 2

A sample interaction is:
C (computer): Would you like coffee, tea, milk, o
nothing?
H (human): Orange juice.
C: I did not understand what you said. (a platform
specific default message.)
C: Would you like coffee, tea, milk, or nothing?
H: Tea
C: (continues in document drink2.asp)
84
VoiceXML - Architectural Model
Web Server
VoiceXML interpreter context
may listen for a special escape
phrase that takes the user to a
high-level personal assistant,
or for escape phrases that alter
user preferences like volume
or text-to-speech characteristics.
The implementation platform generates
events in response to user actions
(e.g. spoken or character input received,
disconnect) and system events (e.g. timer expiration).
Scope of VoiceXML
 Output
of synthesized speech (TTS)
The
language
means for collecting
 Output
ofprovides
audio files.
character and/or spoken input, assigning the
 Recognition of spoken input.
input to document-defined request variables,

of DTMF
input.
andRecognition
making decisions
that affect
the interpretation
of
documents written
in the language.
 Recording
of spoken
input.
A document may be linked to other documents
 Control of dialog flow.
through Universal Resource Identifiers (URIs).
 Telephony
features such as call transfer
86
and disconnect.
VoiceXML
 Voice
XML is intended to be analogous
to graphical surfing.
 There are limitations.
 Excellent for menu applications.
 Awkward for open dialog applications
 There are other languages: VoXML,
omniviewXML
87
Navigation
 The
user might be able to speak the
word "follow" when she hears a
hypertext link she wishes to follow.
 The
user could also interrupt the browser
to request a short list of the relevant
links.
88
Navigation example
User: links?
Browser: The links are:
1 company info
2 latest news
3 placing an order
4 search for product details
Please say the number now
User: 2
Browser: Retrieving latest news...
89
Navigation through Headings
 Another
command could be used to
request a list of the document's
headings. This would allow users to
browse an outline form of the document
as a means to get to the section that
interests them.
90
Navigation to Specific URLs
 Graphical
Browsers allow entering a
wanted URL in the browser window
 How
is this supported in Voice
Browsers?
 Think:
What problems do you anticipate?
• Will we be able to transfer from any voice
portal to any other?
• How do we know where to go?
91
How Slow / Fast ?
 If
voice browsers are meant to replace
human operator dialog, they must be
fast in response.
 Speech Recognition / Interpretation /
Synthesis depend on implementation
 When a user requests a certain
document, several related documents
can be downloaded for easier access.
92
Friendly vs. Annoying
 How
friendly do you want the service to
be?
 Friendly is sometimes time consuming.
 What percentage of the time does the
user talk and what percentage of the
time is he listening?
 What parameters can I control?
93
Voice and Graphics
 Can
I access the Voice Browser through
my computer?
• Some sites are authored only for voice.
• Some will be for both. This leads to more
difficulties which must be dealt with.
94
Inserted text

When a hypertext link is spoken by a speech
synthesizer, the author may wish to insert text
before and after the link's caption, to guide the
user's response.

For example:
<A href="driving.html">Driving instruction</A>
May be offered by the voice browser using the
following words:
For driving instructions press 1
95
Inserted text
words "For” and "Press 1"
were added to the text embedded in the
anchor element.
 The
 On
first glance it looks as if this 'wrapper'
text should be left for the voice browser to
generate, but on further examination you
can easily find problems with this
approach.
96
Inserted text
 For
example, the text for the following
element cannot be “For”
<A href="LeaveMessage.html">Leave us a
message</A>
We need to say:
To leave us a message, press 5
97
Inserted text

The CSS2 draft specification includes the
means to provide "generated text" before and
after element content.
For example:
<A accesskey="5"

style='cue-before: "To";
cue-after: ", press 5"'
href=LeaveMessage.html>Leave us a
message</A>
98
Handling Errors and Ambiguities

Users might easily enter unexpected or
ambiguous input, or just pause, providing no
input at all.

Some examples to errors which might generate
events:
 When presented with a numbered list of links, the
user enters a number that is outside the range
presented .
 The phrase uttered by the user matches more than
one template rule.
99
Handling Errors and Ambiguities
 The phrase\sound uttered doesn't match a known
command.
 The user looses track and the browser needs to
time-out and offer assistance
 “Um”s and “Err”s
Authors will have control over the browser
response to selection errors and timeouts.
 Other errors might be dealt with by the
browser or platform.

100
Some Nice Demos
 Email
assistant demo
 Bank service demo (cough, ambiguity)
 Financial Center Demo (“um”s)
 Telectronics Demo
101
Who has implemented VoiceXML
interpreters?
 BeVocal
Café
 General Magic
 HeyAnita's FreeSpeech Developer
Network
 IBM Voice Server SDK Beta Program
based on VoiceXML Version 1.0
 Motorola’s Mobile Application
Development Toolkit (MADK)
102
Who has implemented VoiceXML
interpreters?
 Nuance
Developer Network
 Open VXI VoiceXML interpreter
 PIPEBEACH’s speechWeb
 Telera’s DeVXchange
 Tellme Studio
 VoiceGenie
103

Document 7356160

Transcript Document 7356160

Directory