STUDY OF SPEAKER RECOGNITION

Download Report

Transcript STUDY OF SPEAKER RECOGNITION

STUDY
OF
SPEAKER
CEERI PILANI
RECOGNITION
NIKITA KRISHNIA
(08EBKEC035)
www.powerpointpresentationon.blogspot.com
OUTLINES
• Introduction
• Principle of Speaker Recognition
• Speech Feature Extraction
• Mel-frequency Cepstrum coefficients processor
• Feature Matching
• LBG Theorem
• Applications
• Conclusion
INTRODUCTION
• Speaker recognition is the process of automatically
recognizing who is speaking on the basis of individual
information included in the input speech waves.
• This technique makes it possible to use the speaker’s
voice to verify their identity and control access to
services such as voice dialing, voice mail, telephone
shopping, security control for confidential information
areas and many more.
OBJECTIVE
• To extract, characterize and recognize the information
about speaker identity.
• It consists of comparing a speech signal from an
unknown speaker to a set of stored data of known
speakers. The system can recognize the speaker, which
has been trained with a number of speakers. This
process determines who has spoken by matching input
signal with pre- stored samples.
Principles of Speaker Recognition
The human speech contains numerous discriminative features
that can be used to identify speakers. Speech contains
significant energy from zero frequency up to around 5kHz. The
speech signal is a slowly timed varying signal but when
examined over a sufficiently short period of time, its
characteristics are fairly stationery. Therefore, short-time
spectral analysis is the most common way to characterize the
speech signal.
Speech signal section of Speech Signal
Speaker recognition methods can be divided into
• text-independent
• text-dependent
In a text-independent system, task is to identify the person
who speaks irrespective of what one is saying whereas in
text-dependent system , the recognition of the speaker’s
identity is based on his or her speaking one or more
specific phrases, like passwords, PIN codes, etc.
Here we are describing text-independent speaker
identification system.
• Speaker recognition is basically identification and verification.
Speaker identification is the process of determining which
registered speaker provides a given utterance, on the other hand
verification is the process of accepting or rejecting the identity
claim of a speaker.
• Speaker recognition systems contain two main modules:
1.Feature extraction
2.Feature matching
• Feature extraction is the process that extracts a small amount
•
of data from the voice signal that can later be used to
represent each speaker.
Feature matching involves the actual procedure to identify the
unknown speaker by comparing extracted features from
his/her voice input with the ones from a set of known
speakers.
All speaker recognition systems have two
distinguished phases:
•Enrolment or training phase
It is the process of familiarizing the system with the voice characteristics
of the speakers registering so that the system can build reference
models for those speakers.
Input speech → feature extraction →generate reference model
•Operational or testing phase
Testing is the actual recognition task. In this phase, the input speech is
matched with stored reference models and a recognition decision is
made.
Test speech→ feature extraction → comparison→ decision
↑
Reference
Speech Feature extraction
It is Signal-processing front end :
In this sampled speech signal is converted into set of
feature vectors which characterize the properties of speech
that can separate different speakers, performed both in
training and testing phases.
Here, Parametrical representation of speech signal is done
using Mel-frequency Ceptrum coefficients(MFCC).
MFCC is based on the human peripheral auditory system.
This technique uses two types of filters, linearly spaced
filters and logarithmically spaced filters to capture the
important characteristics of speech. This is expressed in
the mel-frequency scale (linear frequency spacing below
1000Hz and a logarithmic spacing above 1000Hz).
Mel-frequency cepstrum coefficients
processor
The main purpose of the MFCC processor is to mimic the behaviour of
the human ear. The input speech signal is sampled and sampling
frequency is chosen to minimize the effects of aliasing in the analog to
digital conversion.
Block diagram of MFCC processor
• Framing
In this step the continuous speech signal is blocked into frames of N samples,
with adjacent frames being separated by M (M < N).
• Windowing
we window each individual frame so as to minimize the signal discontinuities at
the beginning and end of each frame. The concept here is to minimize the
spectral distortion by using the window to taper the signal to zero at the
beginning and end of each frame.
Hamming window is used, which has the form:
• Fast Fourier Transform (FFT)
The next processing step is the Fast Fourier Transform, which converts
each frame of N samples from the time domain into the frequency
domain. The FFT is a fast algorithm to implement the Discrete Fourier
Transform (DFT)
• Mel-frequency Wrapping
Human perception of the frequency contents of sounds for speech
signals does not follow a linear scale. Thus, for each tone with an
actual frequency, f, a subjective pitch is measured on a scale called the
‘mel’ scale.Filter bank has a triangular bandpass frequency response,
and the spacing is determined by a constant mel frequency interval.
• Cepstrum
In this final step, we convert the log mel spectrum back to time. The
result is called the mel frequency cepstrum coefficients
(MFCC).Because the mel spectrum coefficients (and so their logarithm)
are real numbers, we can convert them to the time domain using the
Discrete Cosine Transform (DCT).
Feature matching
• Vector quantization(VQ)approach is used for its ease of
implementation and high accuracy.
• It is a process of mapping vectors from a large vector
space to a finite no. of regions in that space.
• Each region is called a cluster and can be represented
by its centre called codeword.
• The collection of codeword is called codebook.
• Codebook effectively reduces the amount of data by
preserving the essential information of the original
distribution.
Thanks