Sound and Speech Recognition What is Sound ? Acoustics is the study of sound. Physical - sound as a disturbance in the.
Download ReportTranscript Sound and Speech Recognition What is Sound ? Acoustics is the study of sound. Physical - sound as a disturbance in the.
Sound and Speech Recognition What is Sound ? Acoustics is the study of sound. Physical - sound as a disturbance in the air Psychophysical - sound as perceived by the ear Sound as stimulus (physical event) & sound as a sensation. Pressures changes (in band from 20 Hz to 20 kHz) Physical terms Amplitude Frequency Spectrum © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 2 Carnegie Mellon Sound Waves In a free field, an ideal source of acoustical energy sends out sound of uniform intensity in all directions. => Sound is propagating as a spherical wave. Intensity of sound is inversely proportional to the square of the distance (Inverse distance law). 6 dB decrease of sound pressure level per doubling the distance. © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 3 Carnegie Mellon Sound Waves © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 4 Carnegie Mellon What is Sound © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 5 Carnegie Mellon How we hear – Ear connected to the brain left brain: speech right brain: music Ear's sensitivity to frequency is logarithmic Varying frequency response Dynamic range is about 120 dB (at 3-4 kHz) Frequency discrimination 2 Hz (at 1 kHz) Intensity change of 1 dB can be detected. © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 6 Carnegie Mellon Digitizing Sound © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 7 Carnegie Mellon Digitally Sampling © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 8 Carnegie Mellon Undersampling © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 9 Carnegie Mellon Clipping © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 10 Carnegie Mellon Quantization © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 11 Carnegie Mellon Digital Sampling • Sampling is dictated by the Nyquist sampling theorem which states how quickly samples must be taken to ensure an accurate representation of the analog signal. fs 2 f or T Ts 2 • The Nyquist sampling theorem states that the sampling frequency must be two times greater than the highest frequency in the original analog signal. © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 12 Carnegie Mellon Dithering a Sampled Signal • Analog signal added to the signal to remove the artifacts of quantization error. • Dither causes the audio signal to always move between quantization levels. • Otherwise, a low level signal would be encoded as a square wave => granulation noise. • Dithered, the A/D converter output is signal + noise => perceptually preferred, since noise is better tolerated than distortion. • Amplitude of dither signal: high dither amplitudes more easily remove quantization artifacts too much dither decreases the signal-to-noise ratio © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 13 Carnegie Mellon Common Sound Sampling Parameters • • • Common Sampling Rates • 8KHz (Phone) or 8.012820513kHz (Phone, NeXT) • 11.025kHz (1/4 CD std) • 16kHz (G.722 std) • 22.05kHz (1/2 CD std) • 44.1kHz (CD, DAT) • 48kHz (DAT) Bits per Sample • 8 or 16 Number of Channels • mono/stereo/quad/ etc. © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 14 Carnegie Mellon Audio Data Rates Quality Format Disk Space Disk Space (examples) Transfer Rate 1 hour 100,000 hours Netcasting RealAudio 20 Kbit/s 8.8 MByte 0.9 TByte Preview RealAudio 80 Kbit/s 35.2 MByte 3.5 TByte Preview MPEG Layer 3 (MP3) 192 Kbit/s 84.4 MByte 8.4 TByte Broadcasting or Editing MPEG Layer 2 384 Kbit/s 168.8 MByte 16.9 TByte Archive Waveform 1538 Kbit/s 675.9 MByte 67.6 TByte (uncompressed) PCM Space/Storage Requirements 1 Minute of Sound Type Mono Mono Stereo Stereo Resolution 8 bit 16 bit 8 bit 16 bit Sampling Rate 44.1k 2646k 5292k 5292k 10584k 22.05k 1323k 2646k 2646k 5292k 11.025k 661.5k 1323k 1323k 2646k 8k 480k 960k 960k 1920k © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 16 Carnegie Mellon Many (!) Sound File Formats • Mulaw (Sun, NeXT) .au • RIFF (Resource Interchange File Format) • MS WAV and .AVI • MPEG Audio Layer (MPEG) .mpa .mp3 • AIFC (Apple, SGI) .aiff .aif • HCOM (Mac) .hcom • SND (Sun, NeXT) .snd • VOC (Soundblaster card proprietary standard) .voc • AND MANY OTHERS! © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 17 Carnegie Mellon What’s in a Sound File Format • • Header Information • Magic Cookie • Sampling Rate • Bits/Sample • Channels • Byte Order • Endian • Compression type Data © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 18 Carnegie Mellon Example File Format (NIST SPHERE) NIST_1A 1024 sample_rate -i 16000 channel_count -i 1 sample_n_bytes -i 2 sample_byte_format -s2 10 sample_sig_bits -i 16 sample_count -i 594400 sample_coding -s3 pcm sample_checksum -i 20129 end_head © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 19 Carnegie Mellon WAV file format (Microsoft) RIFF A collection of data chunks. Each chunk has a 32-bit Id followed by a 32-bit chunk length followed by the chunk data. 0x00 0x04 0x08 0x0C 0x10 0x14 0x16 0x18 0x1C 0x20 0x22 0x24 0x28 0x2C chunk id 'RIFF' chunk size (32-bits) wave chunk id 'WAVE' format chunk id 'fmt ' format chunk size (32-bits) format tag (currently pcm) number of channels 1=mono, 2=stereo sample rate in hz average bytes per second number of bytes per sample 1 = 8-bit mono 2 = 8-bit stereo or 16-bit mono 4 = 16-bit stereo number of bits in a sample data chunk id 'data' length of data chunk (32-bits) Sample data © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 20 Carnegie Mellon Digital Audio Today Analog elements in the audio chain are replaced with digital elements. 16-bit wordlength, 32/44.1/48 kHz sampling rates. Mostly linear signal processing. Wide range of digital formats and storage media. Rapid development of technology => better SNR, phase and linearity. Rapid increase of signal processing power => possibility to implement new, complex features. Soon: Digital radio (satellite), HDTV © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 21 Carnegie Mellon Digital (CD) vs Analog (LP or cassette tape) Information is stored digitally. The length of its data pits represents a series of 1s and 0s. Both audio channels are stored along the same pit track. Data is read using laser beam. Information density about 100 times greater than in LP. CD player can correct disc errors. © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 22 Carnegie Mellon Benefits of Digital Representation (CD) Robust No degradation from repeated playings because data is read by the laser beam. Error correction Transport’s performance does not affect the quality of audio reproduction. Digital circuitry more immune to aging and temperature problems Data conversion is independent of variations in disc rotational speed, hence wow and flutter are negligible. SNR over 90 dB. Subcode for display, control and user information © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 23 Carnegie Mellon CD Format • • Sampling 44.1 kHz => 10 % margin with respect to the Nyquist frequency (audible frequencies below 20 kHz) 16-bit linear => theoretical SNR about 98 dB (for sinusoidal signal with maximum amplitude) audio bit rate 1.41 Mbit/s (44.1 kHz * 16 bits * 2 channels) Cross Interleaved Reed-Solomon Code (CIRC) for error correction Subcode Original Specifications Playing time max. 74.7 min Disc diameter 120 mm Disc thickness 1.2 mm One sided medium, rotates clockwise Signal is recorded from inside to outside Pit is about 0.5 µm wide Pit edge is 1 and all other areas whether inside or outside a pit, are 0s © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 24 Carnegie Mellon Speech Recognition in Brief © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 25 Carnegie Mellon Acoustic Origins • A wave for the words “speech lab” looks like: s p ee ch l a b “l” to “a” transition: Graphs from Simon Arnfield’s web tutorial on speech, Sheffield: http://lethe.leeds.ac.uk/research/cogn/speech/tutorial/ © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 26 Carnegie Mellon Speech Recognition Knowledge Sources Acoustic Modeling Describes the sounds that make up speech Speech Recognition Lexicon Describes which sequences of speech sounds make up valid words Language Model Describes the likelihood of various sequences of words being spoken Speech Recognition THE FUNDAMENTAL EQUATION O is an acoustical ‘Observation’ w is a ‘word’ we are trying to recognize Maximize w = argmax (P(W) | O) P(W|O) is unknown so by Bayes’ rule: P(O|W) P(W) P(W|O) = -----------------------P(O) © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 28 Carnegie Mellon Mechanism of state-of-the-art speech recognizers Speech in Acoustic analysis x1 ... xT Recognition: Maximize P (x1... xT | w1... wk )・ P(w1... wk ) Recognized Sentence P(x1... xT | w1... wk ) P(w1 ... wk ) Pronunciation lexicon Language model Acoustic Sampling • 10 ms frame (ms = millisecond = 1/1000 second) • ~25 ms window around frame to smooth signal processing 25 ms ... 10ms a1 a2 Result: Acoustic Feature Vectors a3 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 30 Carnegie Mellon Spectral Analysis • Frequency gives pitch; amplitude gives volume • sampling at ~8 kHz phone, ~16 kHz mic (kHz=1000 cycles/sec) s p ee ch l a b • Fourier transform of wave yields a spectrogram • darkness indicates energy at each frequency • hundreds to thousands of frequency samples © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 31 Carnegie Mellon Features for Speech Recognition Coding scheme (typical) • 10 millisecond step size; 25 millisecond window • ~39 coefficients each step: • mel-scale cepstra derived from frequency representation • and coefficients • power © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 32 Carnegie Mellon The Markov Assumption • Only immediately preceding history matters n P( X 1 , X 2 , X 3 , , X n ) P( X i | X 1 , X 2 , X 3 , , X i 1 ) i 1 P( X i | X 1 , X 2 , X 3 , , X n ) P( X i | X i 1 ) n P( X 1 , X 2 , X 3 , , X n ) P( X i | X i 1 ) i 1 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 33 Carnegie Mellon Hidden Markov Models • In speech recognition the number of states is very large; we can simplify the problem by factoring the problem into two components p(s2 | s1 ) q( y1 | s2 , s1 ) S1 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann S2 34 S3 Carnegie Mellon Hidden Markov Model © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 35 Carnegie Mellon Searching the Speech Signal Trellis © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 36 Carnegie Mellon Lexicon - links words to phones in acoustic model Aaron EH R AX N Aaron(2) AE R AX N abandon AX B AE N D AX N abandoned AX B AE N D AX N DD abandoning AX B AE N D AX N IX NG abandonment AX B AE N D AX N M AX N TD abated AX B EY DX IX DD abatement AX B EY TD M AX N TD abbey AE B IY Abbott AE B AX TD Abboud AA B UW DD abby AE B IY abducted AE BD D AH KD T IX DD Abdul AE BD D UW L © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 37 Carnegie Mellon When Language Modeling Goes Wrong © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 38 Carnegie Mellon When P(w) is incorrect © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 39 Carnegie Mellon Language Modeling © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 40 Carnegie Mellon Language Models A language model is a probability distribution over word sequences n p(W ) p( w1,...wn) p( wi | w0,...,wi 1) i 1 • n = 3,4,5 [lose the rest of the context] • Hard to estimate large contexts: consider 64,000^3 words Need large collections of text Smoothing P(wi| wi-2, wi-1) is necessary © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 41 Carnegie Mellon Creating models for recognition Speech data Transcribe* Text data © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 42 Train Acoustic models Train Language models Carnegie Mellon Continual Progress in Speech Recognition Increasingly Difficult Tasks, Steadily Declining Error Rates CONVERSATIONAL SPEECH 100 Non-English English Word Error Rate (%) 50 READ SPEECH 5000 word BROADCAST NEWS 20,000 Word 1000 Word vocabulary Varied microphones 10 Standard microphone Noisy environment Unlimited Vocabulary All results are Speaker -Independent 1 1988 1989 1990 1991 1992 1993 NSA/Wayne/Doddington © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 43 1994 1995 1996 1997 1998 Carnegie Mellon References • Speech Recognition resource links can be found at: http://svr-www.eng.cam.ac.uk/comp.speech/Section2/speechlinks.html An excellent tutorial on speech recognition by Wayne Ward: http://www-2.cs.cmu.edu/~roni/11761-s01/Presentations/whw%20hmm's%20in%20speech%20recognition%203.0.pdf © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 44 Carnegie Mellon Sound + Speech Recognition That’s all for today