Transcript Slide 1
Forschungszentrum Telekommunikation Wien [Telecommunications Research Center Vienna] Interfaces between Speech and Non-Speech Audio Technology Michael Pucher (FTW Vienna, ICSI Berkeley) Contents Text-to-Speech Synthesis (TTS) Automatic Speech Recognition (ASR, STT) Dialog Systems Multimodal Mobile Applications Resources © ftw. 2005 Auditory representations Non-linguistic Sound signals Music Perspectival, spatial cues Paralinguistic Speaker characteristics Affective states and attitudes Linguistic Pragmatics and discourse Structural prosodic elements ASR © ftw. 2005 Lexical semantics and syntax Dialog Systems TTS TTS Examples 16kHz natural voice 16kHz unit selection synthesis (server-based) 8kHz diphone-based synthesis with lexicon (embedded or distributed) 8kHz diphone-based synthesis without lexicon (embedded) Application specific lexicon - Gerald R. Ford tSE-r6ld a:R fo:rd © ftw. 2005 TTS Evaluation Pronuniciation Articulation Overall Voice Listening Comprehension SmartPPCNoLex SmartPPCLex 100 95% CI Comprehension of words Natural Source MobileSPNoLex MobileSPLex CTTS CorpPPC PCM 80 60 40 20 CorpPPC GSM 0 PC PC tP ar Sm tP ar x Le No x Le SourceLabel Sm l ra tu x Na Le No SP ile ob M x Le SP ile ob M M PC C SM G © ftw. 2005 5 PP rp 95% CI 4 C 3 R 2 AM 1 Co PP rp Co C PP rp Co CorpPPC AMR TTS and Non-Speech Audio TTS FEATURE STATUS Comprehensible TTS Low worderror-rate Solved Diphone based TTS Natural TTS Single style prosody Solved Unit selection Expressive TTS © ftw. 2005 Various prosodic styles Not solved ? COMBINATION WITH NONSPEECH AUDIO TTS provides lexical information Add structural prosodic elements TTS provides structural prosodic elements Add affective states and attitudes Add pragmatic information, dialog turns Limited Expressiveness of Speech 1 Limited expressiveness of Expressive TTS = Limited expressiveness of speech Limited expressiveness of speech because of unlimited expressiveness1 of speech - Because everything is expressible in language, the messages are less useful for certain purposes (too complex) - Simpler, less expressive codes (sounds, icons) may be used in context and lead to shorter messages Disadvantages of speech - Seriality - Non-universality © ftw. 2005 Types of ASR and Applications Isolated word recogniton Command & control Large vocabulary Speech recognition Broadcast news transcription Conversational Speech recognition Meeting transcription Speech Recognition in noisy environments Car navigation Speaker dependent or speaker independent © ftw. 2005 Other Related Technologies Speech - Speaker verification NLP - Dialog act detection - Topic detection © ftw. 2005 Music Information Retrieval (MIR) Query By Humming (Fraunhofer) - Non-speech sound as an input pattern to search for other non-speech sounds - http://www.musicline.de/de/melodiesuche/input Performer Style Identification Melody and Rhythm Extraction Music Similarity Genre Classification © ftw. 2005 Dialog Systems - ASR 3 Types of Recognition in stateof-the-art Dialog Systems - Isolated word <rule id="exit"> <one-of> <item>exit</item> <item>quit</item> </one-of> </rule> - Recognition grammar - Statistical Language Model (SLM) + grammar for more robustness „um ah to san francisco from new york“ 1. Apply SLM 2. Apply grammar on results of SLM © ftw. 2005 <rule id=„commands"> <item repeat="0-1"> move </item> <one-of> <item>forward</item> <item>backward</item> </one-of> </rule> Dialog Systems – TTS and Audio Loquendo TTS Mixer - Play and mix TTS and audio files - Fadein, fadeout - Pause and resume - Record © ftw. 2005 Paolo Massimino : Loquendo S.p.A. From Marked Text to Mixed Speech and Sound Dialog Management 1 Usages of non-speech audio - Replace prompts Indicate dialog turns and dialog states Indicate menu structure (3Daudio) Create listen & feel of the application System response time Questions - Bargein, Streaming and Standardization © ftw. 2005 Dialog Management 2 A good bad example - Uses only speech - Audio enhancement for transitions - Audio enhancement for states © ftw. 2005 Bob Cooper : Avaya Corporation A Case Study on the planned and actual Use of Auditory Feedback and Audio Cues in the Realization of a Personal Virtual Assistant Dialog Managment 3 VoiceXML Version 2.0 - W3C (Word Wide Web Consortium) standard for voice dialog design - Form filling paradigm similar to web forms Synthesis Markup Language (SSML) Version 1.0 <prosody contour="(0%,+20Hz) (10%,+30%) (40%,+10Hz)"> good morning </prosody> <voice gender="female"> Any female voice here. <voice age="6"> A female child voice here. </voice> </voice> © ftw. 2005 Limited Expressiveness of Speech 2 Limited expressiveness of human-machine voice dialog compared to a natural dialog - Natural dialog is probable multimodal - Role of non-speech sound in human communication © ftw. 2005 The Importance of Multimodality for Mobile Applications Multimodal communication is perceived as natural Disadvantages of unimodal interfaces for mobile devices - Small displays - No comfortable alphanumeric keyboards - Visual access to the display is not always possible Disadvantages cannot be overcome by increasing processor and memory capabilities © ftw. 2005 Multimodal Dialog Managment Speech Application Language Tags (http://www.saltforum.org) Possible combination with non-speech audio at all states and transitions Similar to (unimodal) dialog systems Minhua Ma : University of Ulster Paul Mc Kevitt : University of Ulster Lexical Semantics and Auditory Display in Virtual Storytelling © ftw. 2005 Asymmetric Multimodality For Multiparty applications - Users select preferred modalities (e.g. speech, visual, music?) - System is able to translate content from one modality to another MONA – Mobile Multimodal Next Generation Applications - Multiuser quiz application Input Output Preference=Speech Output Preference=Visual Speech Speech Speech-To-Text Text Text-To-Speech Text © ftw. 2005 Resources TTS - Festival 2.0, to build unit selection voices Festival Lite, for embedded TTS FreeTTS, a Java speech synthesizer The Mbrola project, many synthetic voices available ASR - Sphinx - Htk Multimodal Systems - SALT implementations © ftw. 2005 Thank you for your attention Contact: [email protected] http://userver.ftw.at/~pucher © ftw. 2005