Transcript Technical Aspects of the CALO Recorder By Satanjeev Banerjee
Technical Aspects of the CALO Recorder By Satanjeev Banerjee Thomas Quisel Jason Cohen Arthur Chan Yitao Sun David Huggins-Daines Alex Rudnicky
Role of the CALO recorder
A centralized mechanism to collect all perceptual events.
Speech, Text CMU provides technology on On Event Recording On Speech Recognition
Role of the CALO Recorder
One of the component of CAMPER The four: CALO recorder Speechalizer End-pointing Information Prosodic Information Speech Recognition CAMSeg Speech Segmentation Understanding
An Architecture Diagram (Client Side)
Audio Capturing Text Capturing through Keyboard Other Events End-Pointer Ring Buffers VU Meter Speech Decoder Storage
Persistence of Data
Background Intelligent Transfer System (BITS) Use to transfer data off-line
Technical Challenges in the Recorder
Threading Audio Buffering Time-synchronization Real-time processing End-pointing Speech processing Portability Maintenance/Distribution
Threading
Several processing needs to be concurrently VU meter Speech Processing and Higher-level Understanding Graphical User Interface Long development time was invested to make the communication between to be correct. (By Thomas Quisel) See Architecture Diagram next slides Example Issues: In some platforms, WX implementation will make GUI thread disallow other threads to call its drawing functions.
Audio Buffering
Sphinx 2, 3.X libaudio require, Capture audio Do processing on the audio buffer.
If the processing thread is slightly slower than 1xRT Audio will be lost (By Jason Cohen) A ring buffer structure is implemented.
Time Synchronization
By David Huggins Simple NTP (SNTP) is used in getting universal time coordinate (UTC) from arbitrary NTP server Clone of standard NTP implementation Internal Synchronization Synchronization time between machines 50-60ms Major challenge is the delay imposed by OS/audio capturing software.
Real-time Processing
Role of End-pointing and Recognition After long-time debate Two stage end-pointing and recognition architecture is chosen By Ziad High performance end-pointing routine is created Gaussian Mixture Model-based End-pointer implemented as a frames voter within segments The parameters are further manually tuned.
Speed optimized.
Now in s3ep, a customized version of Sphinx
Speech Recognizer
Resulting output is fed to the recognizer Speech Recognition in meeting Regards as one of the biggest challenge in the field Results largely varied from meeting style, number of attendants, topics, disfluencies of the speakers.
Accuracy Performance, still under heavy work, Currently…… In the cleanest meeting (Bdb001) With one very dominating male speaker With one very dominating female speaker Speaker speaking rate entropy is lowest Error rate 29.4%
Phase IV of Accuracy Improvement (Core)
Boosting-based training Confidence-based N-best re-ranking Speaker adaptation based on transformation Speaker normalization Include BN , SWB material in LM training Dictionary Refinement
Phase IV of Accuracy Improvement (Optional)
STC MLLT DT PLP, TRAP LM with disfluencies and back channeling
Speed
2.2G machine Communicator S2, 17.3%, 0.34xRT
S3.X BL 11.8%, 4xRT S3.X Tuned 12.8, 0.87xRT
WSJ 5k S3.X BL 7.4% 1.61xRT
S3.X BL 8.3% 0.5xRT
ICSI With tuning SVQ and CIGMMS, 0.7xRT is achieved.
We may possibly tune up the results. Benchmarking results need time to prepared
Maintenance and Distribution
All in local CVS C, Java Will soon move to SRI Regular release is created, usage of SRI’s CVS will blur this line.
Conclusion
Engineering work is mostly done for the recorder Time to improve individual components. Everyone is welcomed to join the effort.