Fourier series

Download Report

Transcript Fourier series

Using Speech Recognition

Part 4 Commercial Systems used with Young Children

Commercial systems

• Key players (PC based) – Scansoft (Dragon) – IBM (Viavoice) distributed by Scansoft – Microsoft (built into windows XP)

Operating modes

• Dictation – Used to perform speech to text – In the order of 60,000 words • Command and Control – Used to control application (menus etc) – Need a context free grammar • =((send mail to)|call)(Fred|Jim))|(Exit)

System overview

• Write app and call ASR through COM using SDK Sound card 101001010 Speech recognition engine Custom application “HELLO”

Some research questions

• Is commercial speech rec usable with young children?

• How accurate is it when used with children?

• Can inaccuracies be reduced?

• Is it possible to integrate speech recognition into an effective interface for young children?

• Current systems are designed for adults – Dragon English – Microsoft American • Accuracy tests have shown recognition accuracy of 95% with adults but less than 5% with young children • Training improves this and accuracies can be as high as 75%

Out of the box accuracy

After training

Training

• User training – Speak clearly, steady pace – don’t fiddle with mic • Engine training – Need to read provided text – Pre-reading children need to be whispered to – Takes about 20 – 30 minutes

Example Application

• Vocabulary tutor block diagram Speech recognition engine Context free grammar Speech Recognised text Text file of reference words Vocabulary Speech files User feedback

Example app >>

Difficulties with children

• Access • Fiddle and fidget • Missing teeth • Don’t sound words out properly • Higher formant frequencies • Attention span • Background environmental noise • Out of turn speech (talking to friends or self) • Engine trained with adult speech using adult text for language model

Optimisation

• Train the engine • Don’t use dictation use C & C – Children don’t know 60,000 words so no point – Combinations like “Floppy said woof “ low prob • Create limited CFG • Avoid similar sounding words in same CFG – E.g. Read and Red – phonetically same, no context

• Use close-talk noise cancelling mic • Keep room quiet if possible (not practical) • Don’t try to respond to all speech – Some apps train to teacher’s voice – I used dialogue window (simple & effective) – Provide visual cues when user expected to speak • Optimise confidence threshold – Adjust level at which hypothesis is accepted – Can have three independent levels

Problems with homophones

• Problem with CFG as no context – “read” as in “He read the story” or the word “Red” ?

• Could solve with CFG but costly = | = red book = I read book = the | a – or the depends on whether the word is preceded by “I”, or either of the words “the” or “a”

User interface

• Provide feedback to eliminate misrecognitions (did you say?) • Use avatar for visual feedback • Use backchanneling to avoid deadlock – Very tricky to make natural • Use high threshold setting for yes and no • Use low threshold setting for other words – Found to be about 20 for Lancs • Barge-in during instructions and feedback

Further enhancements

• Groups of boys and girls sound similar – Train for groups to speed up training • Confidence level very important – Investigate the feasibility of further recognition optimisation by dynamically adjusting the confidence threshold setting on a per-user basis • Identify sets of words which the recognition system struggles to differentiate when present in the same dictionary

• Modify the recognition engine’s lexicon in an attempt to improve recognition effectiveness for children with regional accents • Improve the acknowledgement handshaking mechanism to avoid the child having to confirm every word – requesting from the engine, the confidence level – only confirm if below preset conf • Investigate the effectiveness of speech recognition in the classroom using a desk mounted microphone array

• Collect the n-best recognition matches to determine whether the expected response is in the list and if it is sufficiently near the top of the list, use it for confirmation rather than automatically choosing the recognition with the highest probability

Measuring effectiveness

• How is accuracy measured?

– Read paper: “Improving speech recognition in a listening interface for young children”

Speech guidelines

• • • • •

Avoid unconstrained speech input if possible

– For example, only provide the user with the opportunity to give single word answers or answers which are predictable.

Utilise user confirmation if the recognition accuracy needs to be very high

– For example, if the recognition confidence is below an acceptable level, ask the user to confirm what they said by repeating it and asking them for confirmation.

Use multimodal input

– Speech recognition is unreliable so only use it where the more reliable input modes can not be used.

Use backchanneling

– This technique not only naturalises the interface but it can be used strategically in situations such as dialogue deadlock avoidance. However, the implementation of sophisticated backchanneling is still an ongoing research area.

Select appropriate confidence threshold levels or make them configurable

– Research reported in this thesis has shown that the confidence threshold setting can have a significant impact on the effectiveness of the speech input.

• • • • •

Train the engine using a child’s voice, preferably the user’s voice

– Research reported in this thesis has shown that training the speech recognition system when used with young children is essential.

Enable the user to recover from misrecognitions and rejections

– If the user is not given the opportunity to correct misrecognitions they will become frustrated and lose faith in the system.

Provide an effective turn-taking feedback mechanism for dialogue synchronisation

– For example, press the spacebar to speak. This uses a reliable input mode and leaves the speech input disabled when spoken input is not expected; this reduces many out-of-turn speech and background noise problems.

Use natural variation in spoken output

– Concatenate appropriate speech phrases to avoid the system sounding monotonous.

Output human speech, preferably a child’s voice

– Research reported in this thesis has shown that children prefer the non dominant voice of a child to that of an adult. Teachers prefer natural speech for young users as it is clearer and uses the correct pronunciation and intonation.