Speech Technology Systems Architecture

Download Report

Transcript Speech Technology Systems Architecture

KTH speech platform
• Generic framework
– for building demonstrators
– for research
– built mostly on in-house components
• Two major components
– Atlas – speech-technology platform
– SesaME - generic dialogue manager
KTH multimodal dialogue systems
Gulan
Waxholm
Olga
AdApt
August
The Waxholm system
IN
SPEECH
ASR
LEXICON
“WIZARD OF OZ”
NLP
DIALOGUE
MANAGEMENT
DATABASES
GRAFIK
TTS &
MULTIMODALA
GENT
OUT
Common features
• built on in-house components
– under continuos development
• limited reuse of software resources
• during development:
– expert knowledge is required
– highly labor intensive
Atlas
Flat model
application, dialog engine
SQL
SQL
datab.
ASR
ASR
ASR
ASR
ASV
ASV
TTS
TTS
TTS
TTS
audio
desktop
coder
audio
audio
desktop
device
audio
animated
animated
agent
agent
Single-layer model
application, dialog engine
component APIs
SQL
SQL
datab.
ASR
ASR
ASR
ASR
ASV
ASV
TTS
TTS
TTS
TTS
audio
desktop
coder
audio
audio
desktop
device
audio
animated
animated
agent
agent
Multi-layer model (1)
application, dialog engine
speech-tech API
component APIs
SQL
SQL
datab.
ASR
ASR
ASR
ASR
ASV
ASV
TTS
TTS
TTS
TTS
audio
desktop
coder
audio
audio
desktop
device
audio
animated
animated
agent
agent
Multi-layer model (2)
application, dialog engine
speech-tech API
dialog components
high-level primitives
services
component interaction
component APIs
SQL
SQL
datab.
ASR
ASR
ASR
ASR
ASV
ASV
TTS
TTS
TTS
TTS
audio
desktop
coder
audio
audio
desktop
device
audio
animated
animated
agent
agent
Components
component APIs
ASR
pseudo ASR
pseudo ASR
pseudo ASR
bridge
stub
?
Broker, CORBA
Communicator
(J)SAPI
stub
ASR
ASR
ASR
Middleware levels (1)
• Component interaction
– resource handling (create, monitor, allocate, ..)
– media streams (connect, disconnect, split)
– representing information (text-hypotheses,
syntactic and semantic info, speaker info, ...)
Middleware levels (2)
• Services
– resource access
– play
• load and send media data
• make media device(s) render it
• log the action
– say
•
•
•
•
TTS
send media data to media device(s)
make media device(s) render it
log the action
Middleware levels (3)
• Services
– listen
•
•
•
•
•
•
•
•
engage media processors (ASR, ASV, parser, …)
make media device record data
detect utterance
send data in right format to processor(s), file(s), and
other objects
make processors work
wait for processors to finish
fuse results and deliver the “answer”
log actions and results
Middleware levels (4)
• High-level primitives
– ask
•
•
•
•
‘say’ prompt
‘listen’ to answer
give caller full access to processors and their results
log actions and results
– askSimple
• same as ask, but returns fused results only
Middleware levels (5)
• Dialog components
– user interaction for a special purpose
– has domain knowledge
– error handling/recovery
• no answer
• invalid amount, account, etc.
• re-ask, formulation variation
– can provide help
– database lookup
– cf. Nuance “SpeechObjects”, Philips “Speech
Blocks”, ...
Middleware levels (6)
• Dialog components (cont.)
– login procedure
•
•
•
•
one or more operations (steps)
each step produces or validates speaker hypotheses
procedure returns a speaker hypothesis with status
includes database lookup, etc.
– enrollment procedure
• special case of login procedure
• enrollment operation is iterative when asking for
data
Middleware levels (7)
• Dialog components (cont.)
– “complex question”:
– in CTT-bank
• money amount
• account name
• yes/no
ATLAS
application, dialog engine (atlas.app)
speech-tech API
dialog comp. [atlas.login,..]
high-level prim. [atlas.app.SpeechActs]
services [atlas.app.SpeechIO / rc.api.AppResources]
component interaction [atlas.rc / media / rc.audio / uinfo]
component APIs [atlas.rc.api]
[atlas.internal.rc]
[atlas.broker.rc]
[atlas.communicator.rc]
Core packages
atlas.basic
atlas.uinfo
atlas.media
atlas.terminal
atlas.rc
atlas.rc.audio
atlas.rc.api
atlas.app
ATLAS
System model
Application
Terminal 1
Terminal 2
Terminal N
Resources
Session
Project packages
atlas.*
atlas.internal.*
cttbank.*
broker.*
atlas.broker.*
per.*
Common platform
CTT-bank, PER
Generic dialogue management?
speech-tech API
component APIs
SQL
SQL
datab.
ASR
ASR
ASR
ASR
ASV
ASV
TTS
TTS
TTS
TTS
audio
desktop
coder
audio
ATLAS
audio
desktop
device
audio
animated
animated
agent
agent
SesaME
SesaME – the playground
• focus on simple task oriented dialogues
– accessing information (personal, public)
– controlling appliances & services
• hypothesis - task oriented dialogues
can be described in a formalised way
Common platform
Application / Service platform
dialogue descriptions
Common platform
Generic dialogue manager - SesaME
speech-tech API
ATLAS
SQL
SQL
datab.
ASR
ASR
ASR
ASR
ASV
ASV
TTS
TTS
TTS
TTS
audio
desktop
coder
audio
audio
desktop
device
audio
animated
animated
agent
agent
SesaME - goals
• platform for research & demonstrators
• dialogue management
– task oriented
– generic, dynamic
– asynchronous
• support for
–
–
–
–
multi-domain approach
adaptations & personalisation
user modeling
situation awareness
SesaME
• features:
–
–
–
–
–
dynamic plug & play dialogues
modular, agent based architecture
information state approach
event based dialogue management
domain descriptions are based on
extended VoiceXML descriptions
Major components
• Interaction manager – IM
• controls the in formation flow
• interaction management with
– system components
– user
• Dialog engine - DE
• dialogue interpretation
• Application interface - AI
• application specific component
• communication with the application/service
On start
• AI – collects all available –
Dialogue Descriptions
• Dialogue Descriptions represented in an
extended VoiceXML formalism
– seminar.vxml, meeting.vxml, curs.vxml, visitor.vxml
• IM - builds a register over available DD
– the Dialogue Description Collection DDC
– a vector is built on topics and associated keywords
– ”seminarium”, ”möte”, ”besök”...
• IM – controls the activation of the DD
New utterance
”Jag vill gå på Mats Blombergs seminarium.”
• Prediction of the most plausible DD • through topic prediction ”seminarium”
• other mechanism are planed (context, user models)
• DE activates the chosen DD
– seminar.vxml
• internal data structures – are created
• DE performs the dialogue interpretation
Interaction Manager
• controls and synchronises the components
• priority structures
• topic prediction – predicts which DD to use
• supervises the DE
• may suggest plausible parameters based on the
context & user models
• supervises the interaction with the user
• error detection, management
• deadline management etc.
Interaction Manager – How?
• event based
• autonomous modules (software agents)
–
–
–
–
–
carry out one atomic task each
are triggered by a set of preconditions
high level of parallelism
concurrency
cooperation
• centralised information management - blackboard
– all information is available for all modules
– information is not destroyed
– information handling through:
prenumerate – notify – fetch mechanism
Plug & play dialogues
Application Interface
Interaction Manager
A-Agent
Black
board
Keyword
handler
Dialogue
description
collection
A-Agent
Dialogue
Engine
VoiceXML
notify
A-Agent
Dialogue
bridge
ATLAS Speech Technology API
VoiceXML
activator
(JAXB translator)
Dialog
interpreter
Dialogue Engine
• Internal parallel slot structures
• system prompt
• acceptable answers
• reprompts etc.
• Parallel system slots
• used for predictions,
• available for UM, CM
• Parallel application specific slots
• related information
• available for DKM
Interpretation
• go to next empty slot
– ask the prompt
– interpret the answer
• fill the slot
• … or re-prompt
• if all slots filled - successful transaction
• AI sends the required parameters, commands to the application
• eventual next DD is activated
• unsuccessful transaction
• the DD with all parameters is saved
• specific DD for error management is activated
• error management
What is left to be done?
• NLP analysis to be integrated
in Atlas and SesaME
• NLP generation in SesaME
• more elaborated dialogue management
formalism in SesaME
• support for adaptation and pesonalisation
• enabling conversational dialogues
The End