Part VI : Scalability Chapter 17: Mobile Speech

Download Report

Transcript Part VI : Scalability Chapter 17: Mobile Speech

Ubiquitous Computing
Max Mühlhäuser, Iryna Gurevych (Editors)
Part V : Ease-of-use
Chapter 18: Mouth-and-Ear Interaction
Dirk Schnelle
Ubiquitous
Computing
Introduction
• Mobile workers
–
–
–
Need information to perform their task
Have their hands busy with their task
Look at what they are doing
• Problem
–
Hands & Eyes devices hinder them performing their task
• Solution
–
Audio channel
• Advantages of audio
–
–
–
Can be used without changing focus
Can be used in parallel
Still functional under extreme cases
• Darkness
• Limited freedom of movement
–
Can be used in addition to other modalities
• Disadvantages of audio
–
Unusable in noisy environments
Mouth-and-Ear Interaction:
2
Ubiquitous
Computing
Voice is a natural way of communication?
“Reports and theses in speech recognition often begin with a cliché, namely
that speech is the most natural way for human beings to communicate with
each other''
(Melvyn Hunt, 1992)
• Speech is one of the most important means of communication
• Speech has been an efficient means of communication throughtout the
history of mankind
• Is this still valid for the computer as the counterpart?
• This vision requires active and passive communication competence of a
human
• Advances in Natural Language Processing are first steps towards this vision
• Today
– Still a vision
– E.g. Banking companies observe a tend to graphical over voice based use of their
interfaces
Mouth-and-Ear Interaction:
3
Ubiquitous
Computing
Dialog Strategies
• Main concepts of voice based interaction
–
–
–
–
Command & Control
Menu Hierarchy
Form based
Mixed Initiative
User Initiative
System Initiative
Mixed Initiative
• Voice based interaction must be
–
–
–
–
Easy to learn
Easy to remember
Natural
Acoustically different
Mouth-and-Ear Interaction:
4
Ubiquitous
Computing
Definitions
• Voice User Interfaces (VUI)s are user interfaces using speech
input through a speech recognizer and speech output through
speech synthesis.
• Audio User Interfaces (AUI)s are an extension to VUIs, allowing
also the use of sound as a means to communicate with the user.
• Today
– The term VUI is used in the sense of AUI
Mouth-and-Ear Interaction:
5
Ubiquitous
Computing
Domain Specific Development
• Applications have to be integrated into the domain
– Business context
– Infrastructure
• Manual adaption is
– Labor intensive
– Requires heavy involvement of experts
• Vocabulary
• Grammar
• Semantic
High costs
Mouth-and-Ear Interaction:
6
Ubiquitous
Computing
Attempts to reduce costs
• Speech contexts
– Modular solutions for handling specific tasks in the dialog
– E.g. widgets for entering time and date for a recurring appointment
– Can be used in RAD tools to develop applications
Mouth-and-Ear Interaction:
7
Ubiquitous
Computing
Speech Graffitti
• Universal Speech Interface by Carnegie
Mellon University
• Main idea
–
Transfer the look&feel from the world
of graphical interfaces to audio
• Compromise of
–
–
Unconstrained NLU
Hand crafted menu hierarchieas as they
are used in todays telephony
applications
• Example
System: …movie is Titanic … theatre is the
Manor …
User: What are the show times?
–
–
„what are“ is standardized
„show times“ still domain specific
Speech Graffiti is not self contained
Mouth-and-Ear Interaction:
8
Ubiquitous
Computing
Size of vocabulary
• Application domain has a strong impact onto the size of the vocabulary
Domain
Vocabulary size (words)
Dialog strategy
Alarm system
(alarm command)
1
command & control
menu selection
(Yes/No)
2
command & control
menu hierarchy
Digits
10+x
menu hierarchy
Control
20-200
command & control
Information Dialog
500-2,000
menu hierarchy
form filling
Daily talk
8,000 - 20,000
mixed initiative
Dictation
20,000 - 50,000
n/a
German
(no foreign words)
ca. 300,000
n/a
Mouth-and-Ear Interaction:
9
Ubiquitous
Computing
STAIRS
• STAIRS
– Structured Audio Information Retrieval System
– Aims at delivering information to mobile workers
• Possible scenarios
– Pick by voice
– Musuem Guide
– training@work
• Key features
– Audio output
– Use of contextual informaton
– Command set
Mouth-and-Ear Interaction:
10
Ubiquitous
Computing
STAIRS - Audio Output
• Structuring information
– no problem in visual media
– Helps the user to stay oriented
– Allows for fast and easy access to the desired information
• Structural meta data gets lost in audio
Structured audio
• Background sounds
• Auditory icons
• Earcons
Mouth-and-Ear Interaction:
11
Ubiquitous
Computing
STAIRS – Context information
• Location is the most important context
– Absolute positioning world model needed
– Relative positioning based on tags
• Artifacts of the world layer are
starting points for browing in the
layer
• Mobile Worker moves in
– The physical layer
– Document layer
Mouth-and-Ear Interaction:
12
Ubiquitous
Computing
STAIRS – Command Set
• Problem
– Embedded devices allow only few
commands
• Observation
– Only few commands are needed to
control a browser
– Allows to handle complex
applications
transfer this concept to audio
Mouth-and-Ear Interaction:
13
Ubiquitous
Computing
STAIRS – Minimal Command Set
• Audio commands must be
– Intuitive
– Easy to learn
– Acoustically different
• ETSI standardized several command
sets for different scenarios
• Survey
– Result close to ETSI
– Some commands are missing
Function
Command
Provide information
about current item
Details
Cancel current operation
Stop
Read prompt again
Repeat
Go to next node or item
Next
Go to previous node or item
Go back
Go to top level of service
Start
Go to last node or item
Last
Confirm operation
No
Reject operation
No
Help
Help
Stop temporarily
Pause
Resume interrupted playback
Continue
Mouth-and-Ear Interaction:
14
Ubiquitous
Computing
Challenges with audio
Speech recognition performance
Speech Synthesis quality
Recognition: Flexibility vs. Accuracy
Speech is transient
Invisibility
Assymetry
One-dimensionality
Conversation
Privacy
Service Availability
Technical problems
Inherent to audio
Additional Challenges with UC
Mouth-and-Ear Interaction:
15
Ubiquitous
Computing
Speech Recognition Performance
Spontaneous speech without interpunctuation
• Speech is not recognized with an
accuracy of 100%
(even humans can‘t do that)
Contiuous speech
• People are used to graphical user
interfaces with a recognition rate Variances
of 100%
Articulator slurring
• Sentences are NOT only sequences
of words and words being
sequences of phonemes
Channel problems
Plain text
Cocktail party effect
Spontaneous speech
Mouth-and-Ear Interaction:
16
Ubiquitous
Computing
Speech is…
• Continuous
• Flexible
– Inherent to the audio channel
– Acoustic problems
• Complex
• Ambiguous
Mouth-and-Ear Interaction:
17
Ubiquitous
Computing
Speech synthesis quality
• Quality of modern speech synthesizers is still low
• In general people prefer to listen to prerecorded audio
 Conflict
• Texts have to be known in advance
• Texts must be prerecorded
• But: Not all texts are known in advance
Current solutions
• Audio snippets are prerecorded and pasted together as needed
• Professional speakers are able to produce voice with a similar pitch
• Humans are very sensitive in listening and are able to hear small
differences
Mouth-and-Ear Interaction:
18
Ubiquitous
Computing
Flexibility vs. accuracy
• Speech can have many faces for the same issue
• Natural language interfaces must serve them all
Example task: Enter a date
• Flexible interface
– Allow the user to enter the date in any format
• „March 2nd“, „Yesterday“, „2nd of March 2004“, …
requires more work by the recognition software
error prone
preferred by users
• Restrictive interface
– Prompt the user individually for each component of the date
• „Say the year“, „Say the month“,…
less work required from the recognition software
less error prone
Mouth-and-Ear Interaction:
19
Ubiquitous
Computing
Medium inherent challenges
• One dimensionality
– Eye is an active organ, but the ear is passive
The ear cannot browse a set of recordings as the eye can browse a se of
documents
• Transience
– Listening is controlled by the STM (short term memory)
– Users forget what was said in the beginning
Audio is not ideal to deliver large amounts of data
Users get lost in VUIs (lost in space problem)
• Invisibilty
– VUIs must indicate the user what to say and what actions to perform
Leave the impression of not being in control
Advantage: Audio can be received without the need to switch context
• Assymetrie
– People speek faster than they type but listen more slowly than they can read
Can be exploited in multi-modal scenarios
Mouth-and-Ear Interaction:
20
Ubiquitous
Computing
Challenges for UC
• Conversation
– Scenario: User speaks to another person
– Risk of triggering unwanted commands is higher than in noisy
environments
• Privacy
– Audio input and output should be hidden from others
– Impossible to avoid
– Only solution: Do not deliver critical or confidental information
• Service availability
– Speech dependent service may become unavailable while the user is on
the move
may have an impact on the vocabulary that can be used
– User has to be notified about the change
Mouth-and-Ear Interaction:
21
Ubiquitous
Computing
Error Handling
• Possible causes for errors
–
–
–
–
–
Use of an out-of vocabulary word
Use of a word that cannot be used in the current context
Recognition error
Background noise
Incorrect pronounciation (e.g. no native speaker)
• Applications may detect errors but without knowing the source
• Ways to deal with errors
–
–
–
–
–
Do nothing and let the user find and correct the error
Prevent the user from continuing
Complain and warn the user using a feedback mechanism
Correct it automatically without user interaction
Initiate a procedure to correct the error with the user‘s help through
Users will refrain from working with the application
better error prevention
Mouth-and-Ear Interaction:
22
Ubiquitous
Computing
Error prevention
• Reduce the probability of misrecognized words
– Limit background noise (Noise cancelling head sets)
– Allow the user to turn off the device (push-to-talk)
Mouth-and-Ear Interaction:
23
Ubiquitous
Computing
Lost in space
• A user gets lost in space if she does not know where in the graph
she is
Possible Solutions to avoid lost in space
• Calling
– Someone calls you from somewhere and you can head in the direction to
find her
• Telling
– Someone next to you gives you directions
• Landmarks
– You hear a familiar sound and you can use it to orient yourself and work
out where to head
Mouth-and-Ear Interaction:
24
Ubiquitous
Computing
General solutions
• Guidelines
–
Try to capture design knowledge into
small rules
• Guidelines have disadvantages
–
–
–
–
–
Often too simplistic or too abstract
Often have authority issues concerning
their validity
Can be conflicting
Can be difficult to select
Can be difficult to interpret
Top 10 VUI No-No’s
• Don't use open ended prompts
• Make sure that you don't have endless
loops
• Don't use text-to-speech unless you
have to
• Don't mix voice and text-to-speech
• Don't put into your prompt something
that your grammar can't handle
• Don't repeat the same prompt over and
over again
• Don't force callers to listen to long
prompts
• Don’t switch modes on the caller
• Don’t go quiet for more than 3 seconds
• Don’t listen for short words
http://www.angel.com/services/vui-design/vui-nonos.jsp
Mouth-and-Ear Interaction:
25
Ubiquitous
Computing
VUI pattern language
• Patterns
– Capture design knowledge
• Form
• Context
• Solution
• Structure
– Preconditions
– Postconditions
• Current trend
– Transfer guidelines into patterns
Mouth-and-Ear Interaction:
26
Ubiquitous
Computing
Future research directions
• Challenges about audio are still not completely soved
• Current solutions
– Guidelines
– Patterns
• Patterns seem to be more promising
• More pattern mining needed
• Tools to apply the patterns
• Use patterns in code generators from modality independent UI
languages (e.g. XForms)
Mouth-and-Ear Interaction:
27