TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

July 31, 2017 | Author: Ganesh Tiwari | Category: Speech Recognition, Markov Chain, Normal Distribution, Pattern Recognition, Pitch (Music)

Share Embed Donate

Report this link

Short Description

Biometric is physical characteristic unique to each individual. It has a very useful application in authentication and ...

Description

TRIBHUVAN UNIVERSITY

INSTITUTE OF ENGINEERING PULCHOWK CAMPUS

TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION By:

GANESH TIWARI (063/BCT/510) MADHAV PANDEY (063/BCT/514) MANOJ SHRESTHA (063/BCT/518)

A PROJECT WAS SUBMITTED TO THE DEPARTMENT OF ELECTRONICS AND COMPUTER ENGINEERING IN PARTIAL FULFILLMENT OF THE REQUIREMENT FOR THE BACHELOR’S DEGREE IN COMPUTER ENGINEERING

DEPARTMENT OF ELECTRONICS AND COMPUTER ENGINEERING LALITPUR, NEPAL

January, 2011

TRIBHUVAN UNIVERSITY INSTITUTE OF ENGINEERING PULCHOWK CAMPUS DEPARTMENT OF ELECTRONICS AND COMPUTER ENGINEERING The undersigned certify that they have read, and recommended to the Institute of Engineering for acceptance, a project report entitled “Text-Prompted Remote Speaker Authentication” submitted by Ganesh Tiwari, Madhav Pandey and Manoj Shrestha in partial fulfillment of the requirements for the Bachelor’s degree in Computer Engineering.

__________________________________ Supervisor, Dr. Subarna Shakya Associate Professor Department of Electronics and Computer Engineering

__________________________________ Internal Examiner,

_________________________________ External Examiner,

DATE OF APPROVAL:

COPYRIGHT The author has agreed that the Library, Department of Electronics and Computer Engineering, Pulchowk Campus, Institute of Engineering may make this report freely available for inspection. Moreover, the author has agreed that permission for extensive copying of this project report for scholarly purpose may be granted by the supervisors who supervised the project work recorded herein or, in their absence, by the Head of the Department wherein the project report was done. It is understood that the recognition will be given to the author of this report and to the Department of Electronics and Computer Engineering, Pulchowk Campus, Institute of Engineering in any use of the material of this project report. Copying or publication or the other use of this report for financial gain without approval of to the Department of Electronics and Computer Engineering, Pulchowk Campus, Institute of Engineering and author’s written permission is prohibited. Request for permission to copy or to make any other use of the material in this report in whole or in part should be addressed to:

Head Department of Electronics and Computer Engineering Pulchowk Campus, Institute of Engineering Lalitpur, Kathmandu Nepal

ACKNOWLEDGEMENT We are very thankful to Institute of Engineering (IOE), Pulchowk Campus for offering the course on major project. We also thank all teachers and staffs of Electronics and Computer Engineering Department who assisted during the project conduction period by giving suitable suggestions and lectures on different subject matters relating to the conduction and achievement of the project goals. We are very much obliged to Dr. Subarna Shakya, Department, Electronics and Computer Engineering, IOE Pulchowk Campus, for their inspiration and valuable suggestions that we got throughout the working period. We would like to thank to the forum members of askmeflash.com, stackoverflow.com, dsprelated.com for their quick response and value able opinion to our queries. We also express our gratitude to all the friends and juniors who helped a lot for training data collection. Members of Project Ganesh Tiwari (063BCT510) Madhav Pandey (063BCT514) Manoj Shrestha (063BCT518)

IOE, PULCHOWK CAMPUS

ABSTRACT Biometric is physical characteristic unique to each individual. It has a very useful application in authentication and access control.

The designed system is a text-prompted version of voice biometric which incorporates textindependent speaker verification and speaker-independent speech verification system implemented independently. The foundation for this joint system is that the speech signal conveys both the speech content and speaker identity. Such systems are more-secure from playback attack, since the word to speak during authentication is not previously set.

During the course of the project various digital signal processing and pattern classification algorithms were studied. Short time spectral analysis was performed to obtain MFCC, energy and their deltas as feature. Feature extraction module is same for both systems. Speaker modeling was done by GMM and Left to Right Discrete HMM with VQ was used for isolated word modeling. And results of both systems were combined to authenticate the user. The speech model for each word was pre-trained by using utterance of 45 English words. The speaker model was trained by utterance of about 2 minutes each by 15 speakers. While uttering the individual words, the recognition rate of the speech recognition system is 92 % and speaker recognition system is 66%. For longer duration of utterance (>5sec) the recognition rate of speaker recognition system improves to 78%.

TABLE OF CONTENTS PAGE OF APPROVAL…………………………………………………………………….I COPYRIGHT ................................................................................................................... 2 ACKNOWLEDGEMENT ................................................................................................. 3 ABSTRACT ..................................................................................................................... 4 TABLE OF CONTENTS .............................................................................................. …V LIST OF FIGURES .......................................................................................................... 1 LIST OF SYMBOLS AND ABBREVIATIONS ............................................................. IX 1. INTRODUCTION......................................................................................................... 1 1.2 Objectives ................................................................................................................... 2 2. LITERATURE REVIEW .............................................................................................. 3 2.1 Pattern Recognition..................................................................................................... 3 2.2 Generation of Voice ................................................................................................... 4 2.3 Voice as Biometric ..................................................................................................... 6 2.4 Speech Recognition .................................................................................................... 7 2.5 Speaker Recognition ................................................................................................... 7 2.5.1. Types of Speaker Recognition ............................................................................. 8 2.5.2. Modes of Speaker Recognition ............................................................................ 9 2.6 Feature Extraction for Speech/Speaker Recognition System...................................... 10 2.6.1. Short Time Analysis .......................................................................................... 10 2.6.2. MFCC Feature................................................................................................... 11 2.7 Speaker/Speech Modeling ......................................................................................... 12 2.7.1. Gaussian Mixture Model ................................................................................... 12 2.7.2. Hidden Markov Model ...................................................................................... 15 2.7.3. K-Means Clustering .......................................................................................... 19 3. IMPLEMENTATION DETAILS ................................................................................ 20 3.1 Pre-Processing and Feature Extraction ...................................................................... 20

3.1.1. Capture.............................................................................................................. 20 3.1.2. End point Detection and Silence Removal ......................................................... 21 3.1.3. PCM Normalization .......................................................................................... 22 3.1.4. Pre-emphasis ..................................................................................................... 22 3.1.5. Framing and Windowing ................................................................................... 23 3.1.6. Discrete Fourier Transform ............................................................................... 25 3.1.7. Mel Filter .......................................................................................................... 25 3.1.8. Cepstrum by Inverse Discrete Fourier Transform .............................................. 27 3.2 GMM Implementation .............................................................................................. 30 3.2.1. Block Diagram of GMM Based Speaker Recognition System, ........................... 30 3.2.2. GMM Training .................................................................................................. 31 3.2.3. Verification ....................................................................................................... 34 3.2.4. Performance Measure of Speaker Verification System....................................... 34 3.3 Implementation of HMM for Speech Recognition ..................................................... 36 3.3.1. Isolated Word Recognition ................................................................................ 39 3.3.2. Application of HMM ......................................................................................... 40 3.3.3. Scaling .............................................................................................................. 47 4. UML CLASS DIAGRAMS OF THE SYSTEMS ........................................................ 48 5. DATA COLLECTION AND TRAINING ................................................................... 50 6. RESULTS ................................................................................................................... 51 7. APPLICATION AREA ............................................................................................... 52 8. CONCLUSION ........................................................................................................... 52 REFERENCES ............................................................................................................... 53 APPENDIX A: BlazeDS Configuration for Remoting Service ........................................ 54 APPENDIX B: Words Used for HMM Training.............................................................. 55 APPENDIX C: Development Tools and Environment ..................................................... 56 APPENDIX D: Snapshots of Output GUI ....................................................................... 57

LIST OF FIGURES Figure 1.1: System Architecture………………………………………………………….….1 Figure 1.2: Block Diagram of Text Prompted Speaker Verification System ………….……2 Figure 2.1: General block diagram of pattern recognition system ……………………..……3 Figure 2.2: Vocal Schematic………………………………………………………………....4 Figure 2.3: Audio Sample for /i:/ phoneme ……………………………………………...….5 Figure 2.4: Audio Magnitude Spectrum for /i:/ phoneme …………………………….….…6 Figure 2.5: GMM with four Gaussian components and their equivalent model …………..13 Figure 2.6: Ergodic Model of HMM ………………………………………………….…...17 Figure 2.7: Left to Right HMM …………………………………………………………...18 Figure 3.1: Pre-Processing and Feature Extraction………………………………………..20 Figure 3.2: Input signal to End-point detection system …………………………………...22 Figure 3.3: Output signal from End point Detection System …………………………….. 22 Figure 3.4: Signal before Pre-Emphasis …………………………………………………. .23 Figure 3.5: Signal after Pre-Emphasis …………………………………………………… 23 Figure 3.6: Frame Blocking of the Signal …………………………………………………23 Figure 3.7: Hamming window ……………………………………………………………. 24 Figure 3.8: A single frame before and after windowing ………………………………….. 24 Figure 3.9: Equally spaced Mel values …………………………………………………… 26 Figure 3.10: Mel Scale Filter Bank ……………………………………………………….. 26

Figure 3.11: Block diagram of GMM based Speaker Recognition System ………………. 30 Figure 3.12: Equal Error Rate (EER) ……………………………………………………... 35 Figure 3.13: Speech Recognition algorithm flow ………………………………………….36 Figure 3.14: Pronunciation model of word TOMATO …………………………………….37 Figure 3.15: Vector Quantization …………………………………………………………..38 Figure 3.16: Average word error rate (for a digits vocabulary) versus the number of states N in the HMM ……………………………………………………………….39 Figure 3.17: Curve showing tradeoff of VQ average distortion as a function of the size of the VQ, M (shown of a log scale) ……………………………….………… 40 Figure 3.18: Forward Procedure - Induction Step ……………………………………….. 42 Figure 3.19: Backward Procedures - Induction Step …………………………………….. 43 Figure 3.20: Viterbi Search …………………………………………. …………………... 45 Figure 3.21: Computation of ξt (i, j) ………………………………………………………. 46 Figure 4.1: UML diagram of Client System ……………………………………………….48 Figure 4.2: UML Diagram of Server System ……………………………………………....49

LIST OF SYMBOLS AND ABBREVIATIONS 

GMM/HMM Model

T

Threshold Variance

∆( )

Likelihood Ratio

µ

Mean

Initial State Distribution A

State Transition Probability Distribution Observation Symbol Probability Distribution

Cm

Covariance Matrix for mth Component State at time t

Wm

Weighting Factor for mth Gaussian Component Feature Vector

AIR

Adobe Integrated Runtime

DC

Direct Current

DCT

Discrete Cosine Transform

DFT

Discrete Fourier Transform

DHMM

Discrete Hidden Markov Model

DTW

Dynamic Time Warping

EM

Expectation-Maximization

FAR

False Acceptance Rate

FRR

False Rejection Rate

GMM

Gaussian Markov Model

HMM

Hidden Markov Model

LPC

Linear Prediction Coding

MFCC

Mel Frequency Cepstral Coefficient

ML

Maximum Likelihood

PDF

Probability Distribution Function

PLP

Perceptual Linear Prediction

RIA

Rich Internet Application

RPC

Remote Procedure Call

SID

Speaker IDentification

TER

Total Error Rate

UBM

Universal Background Model

UML

Unified Modeling Language

VQ

Vector Quantization

WTP

Web Tool Platform

1. INTRODUCTION Biometrics is, in the simplest definition, something you are. It is a physical characteristic unique to each individual such as fingerprint, retina, iris, speech. Biometrics has a very useful application in security; it can be used to authenticate a person’s identity and control access to a restricted area, based on the premise that the set of these physical characteristics can be used to uniquely identify individuals. Speech signal conveys two important types of information, the primarily the speech content and on the secondary level, the speaker identity. Speech recognizers aim to extract the lexical information from the speech signal independently of the speaker by reducing the interspeaker variability. On the other hand, speaker recognition is concerned with extracting the identity of the person speaking the utterance. So both speech recognition and speaker recognition system is possible from same voice input. Text Prompted Remote Speaker Authentication is a voice biometric system that authenticates a user before permitting the user to log into a system on the basis of the user's input voice. It is a web application. Voice signal acquisition and feature extraction is done on the client. Training and Authentication task based on the voice feature obtained from client side is done on Server. The authentication task is based on text-prompted version of speaker recognition, which incorporates both speaker recognition and speech recognition. This joint implementation of speech and speaker recognition includes text-independent speaker recognition and speaker-independent speech recognition. Speaker Recognition verifies whether the speaker is claimed one or not while Speech Recognition verifies whether or not spoken word matches the prompted word. The client side is realized in Adobe Flex whereas the server side is realized in Java. The communication between these two cross-platforms is made possible with the help of Blaze DS’s RPC remote object.

User

Browser Application in Client (Flex)

BlazeDS RPC

Figure 1.1: System Architecture

1

Server (Java)

Mel Filter Cepstral Coefficient (MFCC) is used as feature for both speech and speaker recognition task. We also combined energy features and delta and delta-delta features of energy and MFCC. After calculating feature, Gaussian Mixture Model (GMM) is used to model the speaker modeling and Left to Right Discrete Hidden Markov Model with Vector Quantization (DHMM/VQ) for speech modeling. Based on the speech model the system decides whether or not the uttered speech matches what was prompted to utter. Similarly, based on the speaker model, the system decides whether or not the speaker is claimed one. Then the speaker is authenticated with the help of combined result of these two tests. Referring to figure 1.2, the feature extraction module is same for both speech and speaker recognition. And these recognition systems are implemented independent of each other.

Figure 1.2: Block Diagram of Text Prompted Speaker Verification System

1.2 Objectives The objectives of this project are: 

To design and build a speaker verification system



To design and build a speech verification system



To implement these systems jointly to control remote access to secret area

2

2. LITERATURE REVIEW 2.1 Pattern Recognition Pattern recognition, one of the branches of artificial intelligence, sub-section of machine learning, is the study of how machines can observe the environment, learn to distinguish patterns of interest from their background, and make sound and reasonable decisions about the categories of the patterns. A pattern can be a fingerprint image, a handwritten cursive word, a human face, or a speech signal, sales pattern etc… The applications of pattern recognition include data mining, document classification, financial forecasting, organization and retrieval of multimedia databases, and biometrics (personal identification based on various physical attributes such as face, retina, speech, ear and fingerprints). The essential steps of pattern recognition are: Data Acquisition, Preprocessing, Feature Extraction, Training and Classification.

Figure 2.1: General block diagram of pattern recognition system

Features are used to denote the descriptor. Features must be selected so that they are discriminative and invariant. They can be represented as a vector, matrix, tree, graph, or string. They are ideally similar for objects in the same class and very different for objects in different class. Pattern class is a family of patterns that share some common properties. Pattern recognition by machine involves techniques for assigning patterns to their respective classes automatically and with as little human intervention as possible. Learning and Classification usually use one of the following approaches: Statistical Pattern Recognition is based on statistical characterizations of patterns, assuming that the patterns are 3

generated by a probabilistic system. Syntactical (or Structural) Pattern Recognition is based on the structural interrelationships of features. Given a pattern, its recognition/classification may consist of one of the following two tasks according to the type of learning procedure: 1) Supervised Classification (e.g., Discriminant Analysis) in which the input pattern is identified as a member of a predefined class. 2) Unsupervised Classification (e.g., clustering) in which the pattern is assigned to a previously unknown class. 2.2 Generation of Voice Speech begins with the generation of an airstream, usually by the lungs and diaphragm process called initiation. This air then passes through the larynx tube, where it is modulated by the glottis (vocal chords). This step is called phonation or voicing, and is responsible for the generation of pitch and tone. Finally, the modulated air is filtered by the mouth, nose, and throat - a process called articulation - and the resultant pressure wave excites the air.

Figure 2.2: Vocal Schematic

Depending upon the positions of the various articulators different sounds are produced. Position of articulators can be modeled by linear time- invariant system that has frequency response characterized by several peaks called formants. The change in frequency of formants characterizes the phoneme being articulated. 4

As a consequence of this physiology, we can notice several characteristics of the frequency domain spectrum of speech. First of all, the oscillation of the glottis results in an underlying fundamental frequency and a series of harmonics at multiples of this fundamental. This is shown in the figure below, where we have plotted a brief audio waveform for the phoneme /i:/ and its magnitude spectrum. The fundamental frequency (180 Hz) and its harmonics appear as spikes in the spectrum. The location of the fundamental frequency is speaker dependent, and is a function of the dimensions and tension of the vocal chords. For adults it usually falls between 100 Hz and 250 Hz, and females’ average significantly higher than that of males.

0.15

0.1

Amplitude

0.05

0

-0.05

-0.1

-0.15

-0.2

0

500

1000

1500

2000

2500

Samples

Figure 2.3: Audio Sample for /i:/ phoneme showing stationary property of phonemes for a short period

The sound comes out in phonemes which are the building blocks of speech. Each phoneme resonates at a fundamental frequency and harmonics of it and thus has high energy at those frequencies in other words have different formats. It is the feature that enables the identification of each phoneme at the recognition stage. The variations in inter-speaker features of speech signal during utterance of a word are modeled in word training in speech recognition. And for speaker recognition the intraspeaker variations in features in long speech content is modeled.

5

0.04

0.035

0.03

|Y(f)|

0.025

0.02

0.015

0.01

0.005

0

0

500

1000

1500

2000 Frequency (Hz)

2500

3000

3500

4000

Figure 2.4: Audio Magnitude Spectrum for /i:/ phoneme showing fundamental frequency and its harmonics

Besides the configuration of articulators, the acoustic manifestation of a phoneme is affected by: 

Physiology and emotional state of speaker



Phonetic context



Accent

2.3 Voice as Biometric The underlying premise for voice authentication is that each person’s voice differs in pitch, tone, and volume enough to make it uniquely distinguishable. Several factors contribute to this uniqueness: size and shape of the mouth, throat, nose, and teeth (articulators) and the size, shape, and tension of the vocal cords. The chance that all of these are exactly the same in any two people is very low. Voice Biometric has following advantages from other form of biometrics 

Natural signal to produce



Implementation cost is low since, doesn’t require specialized input device



Acceptable by user



Easily mixed with other form of authentication system for multifactor authentication



Only biometric that allows users to authenticate remotely

6

2.4 Speech recognition Speech is the dominant means for communication between humans, and promises to be important for communication between humans and machines, if it can just be made a little more reliable. Speech recognition is the process of converting an acoustic signal to a set of words. The applications include voice commands and control, data entry, voice user interface, automating the telephone operator’s job in telephony, etc. They can also serve as the input to natural language processing. There is two variant of speech recognition based on the duration of speech signal : Isolated word recognition, in which each word is surrounded by some sort of pause, is much easier than recognizing continuous speech, in which words run into each other and have to be segmented. Speech recognition is a difficult task because of the many source of variability associated with the signal such as the acoustic realizations of phonemes, the smallest sound units of which words are composed, are highly dependent on the context. Acoustic variability can result from changes in the environment as well as in the position and characteristics of the transducer. Third, within speaker variability can result from changes in the speaker's physical and emotional state, speaking rate, or voice quality. Finally, differences in socio linguistic background, dialect, and vocal tract size and shape can contribute to cross-speaker variability. Such variability is modeled in various ways. At the level of signal representation, the representation that emphasizes the speaker independent features is developed. 2.5 Speaker Recognition Speaker recognition is the process of automatically recognizing who is speaking on the basis of individual’s information included in speech waves. Speaker recognition can be classified into identification and verification. Speaker recognition has been applied most often as means of biometric authentication.

7

2.5.1. Types of Speaker Recognition 2.5.1.1 Speaker Identification Speaker identification is the process of determining which registered speaker provides a given utterance. In Speaker IDentification (SID) system, no identity claim is provided, the test utterance is scored against a set of known (registered) references for each potential speaker and the one whose model best matches the test utterance is selected. There is two types of speaker identification task closed-set and open-set speaker identification. In closed-set, the test utterance belongs to one of the registered speakers. During testing, a matching score is estimated for each registered speaker. The speaker corresponding to the model with the best matching score is selected. This requires N comparisons for a population of N speakers. In open-set, any speaker can access the system; those who are not registered should be rejected. This requires another model referred to as garbage model or imposter model or background model, which is trained with data provided by other speakers different from the registered speakers. During testing, the matching score corresponding to the best speaker model is compared with the matching score estimated using the garbage model. In order to accept or reject the speaker, making the total number of comparisons equal to N + 1. Speaker identification performance tends to decrease as the population size increases. 2.5.1.2 Speaker verification Speaker verification, on the other hand, is the process of accepting or rejecting the identity claim of a speaker. That is, the goal is to automatically accept or reject an identity that is claimed by the speaker. During testing, a verification score is estimated using the claimed speaker model and the anti-speaker model. This verification score is then compared to a threshold. If the score is higher than the threshold, the speaker is accepted, otherwise, the speaker is rejected. Thus, speaker verification, involves a hypothesis test requiring a simple binary decision: accept or reject the claimed identity regardless of the population size. Hence, the performance is quite independent of the population size, but it depends on the number of test utterances used to evaluate the performance of the system.

8

2.5.2. Modes of Speaker Recognition There are 3 modes in which speaker verification/identification can be done. 2.5.2.1 Text Independent In text independent mode, the system relies only on the voice characteristics of the speaker; the lexical content of the utterance is not used. System models the characteristics of his speech which show up irrespective of what one is saying. This mode is used in surveillance or forensic applications where there is no control over the speakers to access the system. The test utterances can be different from those used for enrollment; hence, text-independent speaker verification needs a large and rich training data set to model the characteristics of the speaker's voice and to cover the phonetic space. A large training set and long test segments is required to appropriately model the feature variations from current user in uttering different phonemes, than that for text-dependent. 2.5.2.2 Text Dependent In the text dependent mode of verification, the user is expected to say a pre-determined text a voice password. Since recognition is based on the speaker characteristics as well as the lexical content of the password, text dependent speaker recognition systems are generally more robust and achieve good performance. However, this system is not yet used in large scale due to fear of playback attack, since, the system has a priori knowledge about the password i.e., the training and the test texts are the same. The speaker model encodes the speaker's voice characteristics associated with the phonemic or syllabic content of the password. 2.5.2.3 Text-prompted Both text-dependent and text-independent systems are susceptible to fraud, since for typical applications the voice of a speaker could be captured, recorded, and reproduced. To limit this risk, a particular kind of text-dependent speaker verification systems based on prompted text has been developed. The password i.e., the text to speak is not pre-determined; rather he/she is asked to speak a prompted text (digits or word or phrase). If the number of distinct random passwords is large, the playback attack is not feasible. Hence the text prompted system is more secure.

9

As in the case of text-independent systems, the text-prompted systems also need a large and rich training data set for each registered speaker to create robust speaker-dependent models. Because of that reason, we have chosen text prompted system. 2.6 Feature Extraction for speech/speaker recognition system Signal representation or coding from short-term spectrum into feature vectors is one of the most important steps in automatic speaker recognition and continues being subject of research. Many different techniques have been proposed in the literature and generally they are based on speech production models or speech perception models. Goal of feature extraction is to transform the input waveform into a sequence of acoustic feature vectors, each vector representing the information in a small time window of the signal. Feature extraction transforms high-dimensional input signal into lower dimensional vectors. For speaker recognition purposes, optimal feature has the following properties 1. High inter-speaker variation, 2. Low intra-speaker variation, 3. Easy to measure, 4. Robust against disguise and mimicry, 5. Robust against distortion and noise, 6. Maximally independent of the other features. 2.6.1. Short time analysis The analysis at spectral level of the speech signal is based on classic Fourier analysis to the whole speech signal. However, an exact definition of Fourier transform cannot be directly applied because speech signal cannot be considered stationary due to constant changes in the articulatory system within each speech utterance. To solve these problems, speech signal is split into a sequence of short segments in such a way that each one is short enough to be considered pseudo-stationary. The length of each segment, also called window or frame, ranges between 10 and 40ms (in such a short time period our articulatory system is not able to significantly change). Finally, a feature vector will be extracted from the short-time spectrum in each window. The whole process, known as short-term Spectral analysis,

10

2.6.2. MFCC Feature The commonly used feature extraction method for speech/ speaker recognition is LPC (linear prediction coding), MFCC (Mel Frequency Cepstral Coefficients) and PLP (Perceptual Linear Prediction). LPC is based on assumption that a speech sample can be approximated by a linearly weighted summation of determined number of preceding samples. PLP is calculated in a similar way as LPC coefficients, but previous transformations are carried out in the spectrum of each window aiming at introducing about human hearing behavior. The most popular feature extraction method, MFCC mimic the human hearing behavior by emphasizing lower frequencies and penalizing higher frequencies. The Mel scale, proposed by Stevens, Volkman and Newman in 1937 is a perceptual scale of pitches judged by listeners to be equal in distance from one another. The Mel scale is based on an empirical study of the human perceived pitch or frequency. Human hearing, however, is not equally sensitive at all frequency bands. It is less sensitive at higher frequencies, roughly above 1000 Hertz. It is a unit of pitch defined so that pairs of sounds which are perceptually equidistant in pitch are separated by an equal number of Mels. The mapping between frequency in Hertz and the Mel scale is linear below 1000 Hz and the logarithmic above 1000 Hz. ( ) = 2595 log 1 +

700

Modeling this property of human hearing during feature extraction improves speech recognition performance. The form of the model used in MFCCs is to warp the frequencies output by the DFT onto the Mel scale. During MFCC computation, this insight is implemented by creating a bank of filters which collect energy from each frequency band.

11

2.7 Speaker/Speech Modeling There are various pattern modeling/matching techniques. They include Dynamic Time Warping (DTW), Gaussian Mixture Model (GMM), Hidden Markov Modeling (HMM), Artificial Neural Network (ANN), and Vector Quantization (VQ). These are interchangeably used for speech, speaker modeling. The best approach is statistical learning methods: GMM for Speaker Recognition, which models the variations in features of a speaker for a long sequence of utterance. And another statistical method widely used for speech recognition is HMM. HMM models the Markovian nature of speech signal where each phoneme represents a state and sequence of such phonemes represents a word. Sequence of Features of such phonemes from different speakers is modeled by HMM. 2.7.1. Gaussian Mixture Model 2.7.1.1 Univariate Gaussian The Gaussian distribution, also known as the normal distribution, is the bell-curve function. A Gaussian distribution is a function parameterized by a mean: µ and a variance: . The following formula for a Gaussian functions: ( |µ, ) =

−

√

( − µ)

2.7.1.2 Mixture Model In statistics, a mixture model is a probabilistic model which assumes the underlying data to belong to a mixture distribution. In a mixture distribution, its density function is just a convex combination (a linear combination in which all coefficients or weights sum to one) of other probability density functions: ( )=

( )+

( )+⋯+

( )

The individual pi(x) density functions that are combined to make the mixture density p(x) are called the mixture components, and the weights w1, w2,…, wn associated with each component are called the mixture weights or mixture coefficients. 2.7.1.3 Gaussian Mixture Model A Gaussian Mixture Model (GMM) is a parametric probability density function commonly used as a model of continuous data and most notably biometric features in speaker 12

recognition systems due to their capability of representing large class of sample distributions. Like K-Means, Gaussian Mixture Models (GMM) can be regarded as a type of unsupervised learning or clustering methods. GMM is based on clustering technique, where the entire set of experimental data set is modeled by a mixture of Gaussians. But unlike K-Means, GMMs are able to build soft clustering boundaries, i.e., points in space can belong to any class with a given probability. In a Gaussian mixture distribution, its density function is just a convex combination (a linear combination in which all coefficients or weights sum to one) of Gaussian probability density functions:

Figure: 2.5: GMM with four Gaussian components and their equivalent model

Mathematically, A GMM is the weighted sum of M Gaussian component densities given by the equation ( ⃗/) =

.

( ⃗/ ⃗ ,

)

where, ⃗ is a k dimensional random vector, wm are the mixture weights that shows the relative importance of each component and satisfies the constraint that ∑ ( ⃗/ ⃗

,

= 1.

), m=1,2,…,M are the component densities where each component

density is a k-dimensional Gaussian function (pdf) of the form ( ⃗/ ⃗ ,

) =

1 (2 ) . |

1 exp {− ( ⃗ − ⃗ ). ( 2 |

13

( ⃗ − ⃗ ))}

Where, ⃗ s the mean vector of length k of mth Gaussian PDF, Cm is the covariance matrix of k×k of mth Gaussian PDF Thus the complete Gaussian Mixture Model is parameterized by mixture weights, mean vectors and covariance matrices for all component densities. The parameters are collectively represented by the notation,

 = {

, ⃗ ,

} , m = 1, 2,…, M

These parameters are estimated in training section. For speaker recognition system, each speaker is represented by a GMM and is referred by his/her model . GMM is widely used in speaker modeling and classification due to its two important benefits: first the individual Gaussian component in a speaker-dependent GMM are interpreted to represent some broad acoustic classes such as speaker-dependent vocal tract configurations that are useful for modeling speaker identity. A speaker voice can be characterized by a set of acoustic classes representing some broad phonetic events such as vowels, nasals, fricatives. These acoustic classes reflect some general speaker-dependent vocal tract configurations that are useful for characterizing speaker identity. The spectral shape of the ith acoustic class can in turn be represented by mean of the ith component density and variations of the average spectral shape can be represented by the covariance matrix. These acoustic classes are hidden before training. Secondly Gaussian mixture density provides a smooth approximation to the long term sample distribution of training utterances by a given speaker. The unimodal Gaussian speaker model represents a speaker’s feature distribution by a mean vector and covariance matrix and the VQ model represents a speaker’s distribution by a discrete set of characteristic templates. GMM acts as a hybrid between these two models using a discrete set of Gaussian functions, each with their own mean and covariance matrix to allow better modeling capability.

14

2.7.2. Hidden Markov Model In general, a Markov model is a way of describing a process that goes through a series of states. The model describes all the possible paths through the state space and assigns a probability to each one. The probability of transitioning from the current state to another one depends only on the current state, not on any prior part of the path. HMMs can be applied in many fields where the goal is to recover a data sequence that is not immediately observable. Common applications include: Cryptanalysis, Speech recognition, Part-of-speech tagging, Machine translation, Partial discharge, Gene prediction, Alignment of bio-sequences, Activity recognition. 2.7.2.1 Discrete Markov Processes The transition probability

with N distinct states,

,

,

,…,

, for the first order

Markov chain is given by: = where

=

=

,

1≤ , ≤

is the state at time t.

The state transition coefficients have the following properties (due to standard stochastic constraints): ≥ 0 ∀ , = 1 ∀ The transition probabilities for all states in a model can be described by a transition probability matrix:

A=

⋮

… … ⋱ …

⋮

⋮ ×

The initial state distribution matrix is given by:

=

= ( = ( ⋮ = ( 15

= 1) = 2) =

)

×

The stochastic property for initial state distribution vector is:

=1 where the

is defined as: = (

= ), 1 ≤ ≤

The Markov model can be described by =( , ) This stochastic process could be called an observable Markov model since the output of the process is the set of states at each instant of time, where each state corresponds to physical (observable) event. 2.7.2.3 Hidden Markov Model Markov model is too restrictive to be applicable to many problems of interest. So the concept of Markov model is extended to Hidden Markov model to include the case where the observation is a probabilistic function of the state. The resulting model is doubly embedded stochastic process with an underlying stochastic process that is not observable (i.e. hidden), but can only be observed through another set of stochastic processes that produce the sequence of observations. The difference is that in Markov Chain the output state is completely determined at each time t. In the Hidden Markov Model the state at each time t must be inferred from observations. An observation is a probabilistic function of a state. Elements of HMM The HMM is characterized by the following: 1) Set of hidden states S = {S1,S2,…,SN} and state at time t, qt ϵ S 2) Set of observation symbols per state V = {v1,v2,…,vM} observation at time t, Ot ϵ V 16

3) The initial state distribution π ={ π i }

π i = P[q 1 = S1] 1 ≤ i ≤ N

4) State transition probability distribution aij =P[qt+1 = Si|q t = Si] 1 ≤ i, j ≤ N

A = {aij}

5) Observation symbol probability distribution in state j B = {bj(k)}

1 ≤ j ≤ N, 1 ≤ k ≤ M

bj(k) = P[vk at t|qt = Sj]

Normally, an HMM is typically written as: = ( , , ) 2.7.2.4 Types of HMMS An ergodic or fully connected HMMs has the property that every state can be reached from every other state in a finite number of steps. This type of model has the property that every aij coefficient is positive. For some applications, other types of HMMs have been found to account for observed properties of the signal being modeled better than the standard ergodic model.

Figure 2.6: Ergodic Model of HMM

One such model is left-right model or Bakis model because the underlying state sequence associated with the model has the property that as time increases the state index increases (or stays the same), i.e. the state proceed from left to right. Clearly, the left-right type of HMMs 17

has the desirable property that it can readily model signals whose properties change over time – e.g., Speech.

a11 State 1

a12

a22

a33

a23

a34

State 3

State 2

a44 State 4

a24

a13 Figure 2.7: Left to Right HMM

The properties of left-right HMMs are: 1) The state transition coefficients have the property = 0,

<

i.e., no transition is allowed to states whose indices are lower than the current state. 3) The state transition coefficient for the last state in a left-right model are specified as =1 2) The initial state probabilities have the property πi = 1, = 1 = 0, ≠ 1 Since the state sequence must begin in state 1 and end in state N. With left-right models, additional constraints are placed on the state transition coefficients to make sure that large changes in state indices do not occur, hence a constraint of the form = 0, > +∆ is often used. The value of Δ is 2 in this speech recognition system, i.e., no jumps of more than 2 states are allowed. The form of the state transition matrix for Δ = 2 and N=4 is as follows.

18

2.7.3. K-Means Clustering Clustering can be considered the most important unsupervised learning problem; so, as every other problem of this kind, it deals with finding a structure in a collection of unlabeled data. A loose definition of clustering could be “the process of organizing objects into groups whose members are similar in some way”. A cluster is therefore a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters. In statistics and machine learning, k-means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. The algorithm is composed of the following steps: 1. Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids. 2. Assign each object to the group that has the closest centroid. 3. When all objects have been assigned, recalculate the positions of the K centroids. 4. Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated. Both the clustering process and the decoding process require a distance metric or distortion metric, that specifies how similar two acoustic feature vectors are. The distance metric is used to build clusters, to find a prototype vector for each cluster, and to compare incoming vectors to the prototypes. The simplest distance metric for acoustic feature vectors is Euclidean distance. Euclidean distance is the distance in N-dimensional space between the two points defined by the two vectors.

19

3. IMPLEMENTATION DETAILS The implementation of joint speaker/speech recognition system includes common preprocessing and feature extraction module, text independent speaker modeling and classification by GMM and speaker independent speech modeling and classification by HMM/VQ. 3.1 Pre-Processing and Feature Extraction Starting from the capturing of audio signal, feature extraction consists of the following steps as shown in the block diagram below: SPEECH Silence SIGNAL Removal

Preemphasis

Framing

Window

DFT

Mel-Filter bank

Log

IDFT MFCC 12 Coefficients CMS

Energy 1 Energy Feature

12 MFCC 12 Δ MFCC 12 ΔΔ MFCC Deltas 1 energy 1Δ energy 1ΔΔ energy

Figure 3.1: Pre-Processing and Feature Extraction

3.1.1. Capture The first step in processing speech is to convert the analog representation (first air pressure, and then analog electric signals in a microphone) into a digital signal x[n], where n is an index over time. Analysis of the audio spectrum shows that nearly all energy resides in the band between DC and 4 kHz, and beyond 10 kHz there is virtually no energy whatsoever Used sound format 

22050 Hz



16-bits, Signed



Little Endian



Mono Channel



Uncompressed PCM 20

3.1.2. End point detection and Silence removal The captured audio signal may contain silence at different positions such as beginning of signal, in between the words of a sentence, end of signal…. etc. If silent frames are included, modeling resources are spent on parts of the signal which do not contribute to the identification. The silence present must be removed before further processing. There are several ways for doing this: most popular are Short Time Energy and Zeros Crossing Rate. But they have their own limitation regarding setting thresholds as an ad hoc basis. The algorithm we used [Ref.4] uses statistical properties of background noise as well as physiological aspect of speech production and does not assume any ad hoc threshold. It assumes that background noise present in the utterances is Gaussian in nature. Usually first 200msec or more (we used 4410 samples for the sampling rate 22050 samples/sec) of a speech recording corresponds to silence (or background noise) because the speaker takes some time to read when recording starts. Endpoint Detection Algorithm Step 1: Calculate the mean (µ) and standard deviation (σ) of the first 200ms samples of the given utterance. The background noise is characterized by this µ and σ. Step 2: Go from 1st sample to the last sample of the speech recording. In each sample, check whether one-dimensional Mahalanobis distance functions i.e. |x-µ|/σ greater than 3 or not. If Mahalanobis distance function is greater than 3, the sample is to be treated as voiced sample otherwise it is an unvoiced/silence. The threshold reject the samples up to 99.7% as per given by P[|x−μ|≤3σ]=0.997 in a Gaussian Distribution thus accepting only the voiced samples. Step 3: Mark the voiced sample as 1 and unvoiced sample as 0. Divide the whole speech signal into 10 ms non-overlapping windows. Represent the complete speech by only zeros and ones. Step 4: Consider there are M number of zeros and N number of ones in a window. If M ≥ N then convert each of ones to zeros and vice versa. This method adopted here keeping in mind that a speech production system consisting of vocal cord, tongue, vocal tract etc. cannot change abruptly in a short period of time window taken here as 10ms. 21

Step 5: Collect the voiced part only according to the labeled ‘1’ samples from the windowed array and dump it in a new array. Retrieve the voiced part of the original speech signal from labeled 1 sample. 1

0.5

0

-0.5

-1

0

1

2

3

4

5

6

7

8

9 4

x 10

Figure 3.2: Input signal to End-point detection system 1

0.5

0

-0.5

-1

0

0.5

1

1.5

2

2.5

3

3.5

4 4

x 10

Figure 3.3: Output signal from End point Detection System

3.1.3. PCM Normalization The extracted pulse code modulated values of amplitude is normalized, to avoid amplitude variation during capturing. 3.1.4. Pre-emphasis Usually speech signal is pre-emphasized before any further processing, if we look at the spectrum for voiced segments like vowels, there is more energy at lower frequencies than the higher frequencies. This drop in energy across frequencies is caused by the nature of the glottal pulse. Boosting the high frequency energy makes information from these higher formants more available to the acoustic model and improves phone detection accuracy. The pre-emphasis filter is a first-order high-pass filter. In the time domain, with input x[n] and 0.9 ≤  ≤ 1.0, the filter equation is: y[n] = x[n]−x[n−1] We used =0.95. 22

0.05 0.04

|Y(f)|

0.03 0.02 0.01 0

0

2000

4000

6000 Frequency (Hz)

8000

10000

12000

Figure 3.4: Signal before Pre-Emphasis

5

x 10

-3

4

|Y(f)|

3 2 1 0

0

2000

4000

6000 Frequency (Hz)

8000

10000

12000

Figure 3.5: Signal after Pre-Emphasis

3.1.5. Framing and windowing Speech is a non-stationary signal, meaning that its statistical properties are not constant across time. Instead, we want to extract spectral features from a small window of speech that characterizes a particular sub phone and for which we can make the (rough) assumption that the signal is stationary (i.e. its statistical properties are constant within this region). We used frame block of 23.22ms with 50% overlapping i.e., 512 samples per frame.

Figure 3.6: Frame Blocking of the Signal 23

The rectangular window (i.e., no window) can cause problems, when we do Fourier analysis; it abruptly cuts of the signal at its boundaries. A good window function has a narrow main lobe and low side lobe levels in their transfer functions, which shrinks the values of the signal toward zero at the window boundaries, avoiding discontinuities. The most commonly used window function in speech processing is the Hamming window defined as follows: ( ) = 0.54 − 0.46 cos

2 ( − 1) −1

, 1 ≤

≤

1 Hamming Window

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

10

20

30

40

50

60

Figure 3.7: Hamming window

The extraction of the signal takes place by multiplying the value of the signal at time n, sframe[n], with the value of the window at time n, Sw[n]: y[n] = Sw[n]×Sframe[n]

0.04

0.05 0.04

0.03

0.03

0.02

0.02 0.01

0.01 0

0

-0.01

-0.01

-0.02 -0.02

-0.03 -0.03

-0.04 -0.05

0

200

400

600

800

1000

-0.04

1200

0

200

400

600

Figure 3.8: A single frame before and after windowing

24

800

1000

1200

3.1.6. Discrete Fourier Transform A Discrete Fourier Transform (DFT) of the windowed signal is used to extract the frequency content (the spectrum) of the current frame. The tool for extracting spectral information i.e., how much energy the signal contains at discrete frequency bands for a discrete-time (sampled) signal is the Discrete Fourier Transform or DFT. The input to the DFT is a windowed signal x[n]...x[m], and the output, for each of N discrete frequency bands, is a complex number X[k] representing the magnitude and phase of that frequency component in the original signal.

=

( )

(

)

, = 0, 1, 2, … ,

−1

The commonly used algorithm for computing the DFT is the Fast Fourier Transform or in short FFT. 3.1.7. Mel Filter For calculating the MFCC, first, a transformation is applied according to the following formula: ( ) = 2595 log 1 +

700

Where, x is the linear frequency. Then, a filter bank is applied to the amplitude of the mel-scaled spectrum. The Mel frequency warping is most conveniently done by utilizing a filter bank with filters centered according to Mel frequencies. The width of the triangular filters varies according to the Mel scale, so that the log total energy in a critical band around the center frequency is included. The centers of the filters are uniformly spaced in the mel scale.

25

Figure 3.9: Equally spaced Mel values

The result of Mel filter is information about distribution of energy at each Mel scale band. We obtain a vector of outputs (12 coeffs) from each filter.

Figure 3.10: Triangular filter bank in frequency scale

We have used 30 filters in the filter bank.

26

The Mel frequency m can be computed from the raw acoustic frequency as follows:

=

−

+1 +1

− = 1, 2, … ,

− −

+

+1

,

−1

where,

=

=

;

=

=

{

⎧ ⎪

2 ⎫ ⎪

⎨ ⎪ ⎩

⎬ ⎪ ⎭

{

}+

}

( ) = 2595 ln 1 +

= 700. 10

=

2

= 1,2,3, … ,

−1

700

−1

3.1.8. Cepstrum by Inverse Discrete Fourier Transform Cepstrum transform is applied to the filter outputs in order to obtain MFCC feature of each frame. The triangular filter outputs Y (i), i=0, 1, 2… M are compressed using logarithm, and discrete cosine transform (DCT) is applied. Here, M is equal to number of filters in filter bank i.e., 30.

[ ]=

log ( ) cos

1 ( − ) 2

Where, C[n] is the MFCC vector for each frame.

27

The resulting vector is called the Mel-frequency cepstrum (MFC), and the individual components are the Mel-frequency cepstral coefficients (MFCCs). We extracted 12 features from each speech frame. 3.1.9. Post Processing 3.1.9.1 Cepstral Mean Subtraction (CMS) A speech signal may be subjected to some channel noise when recorded, also referred to as the channel effect. A problem arises if the channel effect when recording training data for a given person is different from the channel effect in later recordings when the person uses the system. The problem is that a false distance between the training data and newly recorded data is introduced due to the different channel effects. The channel effect is eliminated by subtracting the Mel-cepstrum coefficients with the mean Mel-cepstrum coefficients:

( )=

( )−

1

( ) , = 1,2, … ,12

3.1.9.2 The energy feature The energy in a frame is the sum over time of the power of the samples in the frame; thus for a signal x in a window from time sample t1 to time sample t2, the energy is:

=

[ ]

3.1.9.3 Delta feature Another interesting fact about the speech signal is that it is not constant from frame to frame. Co-articulation (influence of a speech sound during another adjacent or nearby speech sound) can provide a useful cue for phone identity. It can be preserved by using delta features. Velocity (delta) and acceleration (delta delta) coefficients are usually obtained from the static window based information. This delta and delta delta coefficients model the speed and acceleration of the variation of cepstral feature vectors across adjacent windows. A simple way to compute deltas would be just to compute the difference between frames; thus the delta value d(t) for a particular cepstral value c(t) at time t can be estimated as: ( )=∆ [ ]= 28

[ ]−

[]

The differentiating method is simple, but since it acts as a high-pass filtering operation on the parameter domain, it tends to amplify noise. The solution to this is linear regression, i.e. firstorder polynomial, the least squares solution is easily shown to be of the following form: ∆ [ ]=

∑

[] ∑

Where, M is regression window size. We used M=4.

3.1.9.4 Composition of Feature Vector We calculated 39 Features from each frame 

12 MFCC Features



12 Delta MFCC



12 Delta-Delta MFCC



1 Energy Feature



1 Delta Energy Feature



1 Delta-Delta Energy Feature

29

3.2 GMM Implementation It is also important to note that because the component Gaussians are acting together to model the overall pdf, full covariance matrices are not necessary even if the features are not statistically independent. So, the linear combination of diagonal covariance basis Gaussians is capable of modeling the correlations between feature vector elements. In addition, the use of diagonal covariance matrices greatly reduces the complexity in computation. Hence in our project, the mth covariance matrix is Cm = diag ( am1 , am2,…, amK ), Where, amj, j = 1, 2,…,K are the diagonal elements or variances K=Number of features in each feature vector The effect of a set of using a set of M full covariance Gaussians can be compensated by using by using a larger set of diagonal covariance Gaussians (M=16 in our case). M=16 is best for speaker Modeling, according to research papers. The components pdfs can now be expressed as,

( ⃗/ ⃗ ,

) =

1

exp {−

(2 ) . ∑

1 2

(( ⃗ − ⃗ )/

,

Where, µm,j are the elements of mth mean vector ⃗ . 3.2.1. Block diagram of GMM based Speaker Recognition System, Model training Enrollment

Speech

Feature Extraction

Model DB

Accepted / Rejected

Verification Matching

Decision

Figure 3.11: Block diagram of GMM based Speaker Recognition System 30

,

)}

3.2.2. GMM Training Given the training speech from a speaker, the goal of speaker model training is to estimate the parameters of the GMM that best match the distribution of training features vectors and hence develop a robust model for the speaker. Out of several techniques available for estimating the parameters of GMM, the most popular method is Maximum Likelihood (ML) estimation or Expectation-Maximization (EM). It is a well-established maximum likelihood algorithm for fitting a mixture model to a set of training data. EM requires an a priori selection of model order, the number of M components to be incorporated into the model and initial estimate of training parameters before iterating through the training. The aim of the ML estimation method is to maximize the likelihood of GMM, given the training data. Under the assumption of independent feature vectors, the likelihood of GMM, for the sequence of T training vectors X = { ⃗ , ⃗ , … , ⃗ } can be written as,

( ⃗ / )

P(X/) =

In practice, the above computation is done in log domain to avoid underflow. That is, instead of multiplying lots of very small probabilities, we can simply add them in log domain. Thus, the log-likelihood of a model  for a sequence of feature vectors X = { ⃗ , ⃗ , … , ⃗ } is computed as follows: 1 log P(X/) =

log ( ⃗ /)

Note that in the above equation, the average log likelihood value is used so as to normalize out duration effects from the log-likelihood value. Also, since the incorrect assumption of independence is underestimating the actual likelihood value with dependencies, scaling by T can be considered as a rough compensation factor. The direct maximization of this likelihood function is not possible as it is a non-linear function of the parameter  . So, the likelihood function is maximized using Expectation Maximization algorithm. The basic idea of EM algorithm is beginning with the initial model  , to estimate a new model 

such that P(X/

) ≥ P(X/ ). The new model  31

then becomes the initial

model for the next iteration and the process is repeated until some convergence threshold is reached. i.e., P(X/

) - P(X/ ) <  .

3.2.2.1 The Expectation Maximization Algorithm On each iteration, the following formulas are used to estimate the parameters of new model



that guarantee a monotonic increase in the likelihood of the model.

Means:

Variances:

( / ⃗ ,  )

= ∑

Mixture weight:

⃗

=

∑

=

( / ⃗ , ) ⃗ ∑

( / ⃗ , )

∑

( / ⃗ , ) ⃗ ∑

( / ⃗ , )

- ⃗

where,  (m/ ⃗ ,) is the probability that the observation ⃗ was drawn from mth component and is given by,

( / ⃗ ,  ) =

.

∑

(⃗ / .

,

)

(⃗ / ,

)

3.2.2.2 Estimation of initial parameters for training: Commonly, the initializations of the GMM parameters are done as follows: Mixture weights: 1/mixtureDimension Mean: Random feature vector from training data Covariance: generally initialized to 1. But, it is important to initialize the covariance matrices with rather large variances, to reduce the risk that the EM training gets stuck in some local maximum. So we require larger values. K-Means can be used for the good estimate of initial estimate of covariance matrix. We used a method found on [Ref 1], because of its ease to compute. To set reasonable values for the covariance matrices, we need an estimate of the covariance of the whole training set, Cdata. Start by estimating the mean of the training data like

32

µ

=

1

The j-th diagonal element of Cdata can be estimated as

,

=

1 ,

−µ

A measure of the “volume” that the training data occupies can be given by =

,

Finally the covariance can be calculated as ,

= (

⁄

⁄ )

For minimum covariance (threshold) value to avoid NaN (Not a Number) error during EM iterations, = (

⁄ )

⁄

⁄

Covariance limiting was done as calculated above for each mixture. For simplicity we initialized covariance values to be same for all gaussian components.

For Training the GMM parameters we used the following constants: Number of Iterations: MINIMUM_ITERATION = 100; MAXIMUM_ITERATION = 500; And Minimum log likelihood change for Convergence: LOGLIKELIHOOD_CHANGE = 0.000001;

33

3.2.3. Verification After training section, now we have a complete model (GMM) of speakers. The speaker verification task is a hypothesis testing problem where based on the input speech observations, it must be decided whether the claimed identity of the speaker is correct or not. So, the hypothesis test can be set as: H0 : the speaker is the claimed speaker H1 : the speaker is an imposter The right decision between these two hypotheses is based on the likelihood ratio given by P(X/) P(X/) Where, P(X/) is the likelihood that the utterance was produced by speaker model  while P(X/) is the likelihood that he utterance was produced by imposter model  . Here, the imposter model , also called as Universal Background Model (UBM), is obtained by training a collection of speech samples from a large no. of speakers, representative of the population of speakers. The likelihood ratio is often expressed in logarithm as ∆( ) = log (

( /) ( /)

) = log P(X/) − log P(X/)

The decision is made as follows: If ∆( ) < T , reject null hypothesis i.e. the speaker is an imposter. If ∆( ) > T , accept null hypothesis i.e. the speaker is the claimed one. where, the threshold value T is set in suck a way that, the error of the system is minimum so that the true claimants are always accepted and false claimants are always rejected. 3.2.4. Performance measure of Speaker Verification System In general, the performance of the speaker verification system is determined by False Rejection Rate (FRR) and False Acceptance Rate (FAR). 34

1) False Rejection Rate(FRR) FRR is the measure of the likelihood that the system will incorrectly reject an access attempted by an authorized user. A system’s FRR typically is the ratio of the number of false rejections divided by the number of verification tests. 2) False Acceptance Rate(FAR) FAR is the measure of the likelihood that the system will incorrectly accept an access attempt by an unauthorized user. A system’s FAR usually is stated as the ratio of the number of false acceptances divided by the number of verification tests. Total Error Rate (TER) is the combination of false rejection and false acceptance rate. And the requirement of the system is to minimize the Total Error Rate. These errors are dependent on the choice of threshold value used during verification. It seems that, at lower threshold value, FAR is predominant while at higher threshold value, FRR is predominant. This dependency of the two errors can be seen in the figure below. At certain threshold value, these errors are equal and TER is minimum.

Figure 3.12: Equal Error Rate (EER)

35

3.3 Implementation of HMM for Speech Recognition The basic block diagram for isolated word recognition is given below:

K-means Clustering Speech Signal

Preprocess

CODEBOOK Viterbi Discrete Algorithm Recognition Observation Result Sequence Vector Quantization HMM Recognition (VQ)

MFCC Features

  ( A, B,  ) Baum-Welch Algorithm

HMM Model

Figure 3.13: Speech Recognition algorithm flow

In order to do isolated word speech recognition, we must perform the following: 1) The codebook is generated using the feature vector of the training data and Vector quantization uses the codebook to map the feature vector to discrete observation symbol. 2) For each word v in the vocabulary, an HMM λv is built, i.e., we must estimate the model parameters (A, B, π) that optimize the likelihood of the training set observation vectors for the vth word. In order to make reliable estimates of all model parameters, multiple observation sequences must be used. Baum-Welch algorithm is used for estimation of HMM parameters. 3) For each unknown word which is to be recognized, processing of some steps must be carried out, namely measurement of the observation sequence O={O1,O2,..,OT}, via feature analysis of the speech corresponding to the word, followed by calculation of model likelihoods for all possible models, P(O| λv), 1 ≤ v ≤ V; followed by selection of the word whose model likelihood is highest. ∗=

max [P(O| λ ]

The probability computation step is performed using the Viterbi algorithm and requires on the order of V.N2.T computations.

36

Figure 3.14: Pronunciation model of word TOMATO

The above figure shows the pronunciation model of word “tomato”. The circles represent the states and the numbers above the arrows represent transition probabilities. The pronunciation of the same word may differ from person to person. The above figure reflects the two pronunciation styles for the same word “tomato”. So, in order to best model each word, we need to train the word for as large set of persons as possible so that it models all the variation in pronunciation for that word. Vector Quantization: HMM is used in speech recognition because a speech signal can be viewed as a piecewise stationary signal or a short-time stationary signal. In a short-time speech can be approximated as a stationary process. Each acoustic feature vector represents information such as the amount of energy in different frequency bands at a particular point in time. The observation sequence for speech recognition is a sequence of acoustic feature vectors (MFCC vectors) and the phonemes are the hidden states. One way to make MFCC vectors look like symbols that we could count is to build a mapping function that maps each input vector into one of a small number of symbols. This idea of mapping input vectors to discrete quantized symbols is called vector quantization or VQ. The type of HMM that models speech signals based on VQ technique to produce the observations is called Discrete Hidden Markov Model (DHMM). However, VQ is responsible for losing some information from the speech signal even when we try to increase the codewords. This lose is due to the quantization error (distortion). This distortion can be reduced by increasing the number of codewords in the codebook but cannot be eliminated. The long sequence of speech samples will be represented by stream of indices representing frames of different window lengths. Hence, VQ is considered as a process of redundancy removal, which minimizes the number of bits required to identify each frame of speech 37

signal. In vector quantization, we create the small symbol set by mapping each training feature vector into a small number of classes, and then we represent each class by a discrete symbol. More formally, a vector quantization system is characterized by a codebook, a clustering algorithm, and a distance metric. A codebook is a list of possible classes, a set of symbols constituting features F = {f1, f2, ..., fn}. All feature vector from training speech data are clustered into 256 classes thereby generating a Codebook with 256 centroids with the help of K-Means clustering technique. Vector Quantization (VQ) is used to get discrete observation sequence from input feature vector by applying distance metric to Codebook.

Figure 3.15: Vector Quantization

As shown in the above figure, to make the feature vectors discrete, each incoming feature vector is compared with each of the 256 prototype vectors in the codebook. And the one which is closest (Euclidian distance) is selected, and then the input vector is replaced by the index of corresponding centroid in codebook. In this way all continuous input feature vectors are quantized to a discrete set of symbols.

38

3.3.1. Isolated Word Recognition For isolated word recognition with a distinct HMM designed for each word in the vocabulary, a left-right model is more appropriate than an ergodic model, since we can then associate time with model states in a fairly straightforward manner. Furthermore we can envision the physical meaning of the model states as distinct sounds (e.g., phonemes, syllables) of the word being modeled. The issue of the number of states to use in each word model leads to two schools of thought. One idea is to let the number of states correspond roughly to the number of sounds (phonemes) within the word – hence model with from 2 to 10 states would be appropriate. The other idea is to let the number of states correspond roughly to the average number of observations in a spoken version of the word. In this manner each state corresponds to an observation interval – i.e., about 15 ms for the analysis we use. The former approach is used in our speech recognition system. Furthermore we restrict each word model to have the same number of states; this implies that the models will work best when they represent words with the same number of sounds.

Figure 3.16: Average word error rate (for a digits vocabulary) versus the number of states N in the HMM

Above figure shows a plot of average word error rate versus N, for the case of recognition of isolated digits (i.e., a 10-word vocabulary). It can be seen that the error is somewhat insensitive to N, achieving a local minimum at N=6; however, differences in error rate for values of N close to 6 are small. The next issue is the choice of observation vector and the way it is represented. Since we are representing an entire region of the vector space by a single vector, distortion penalty is 39

associated with VQ. It is advantageous to keep the distortion penalty as small as possible. However, this implies a large size codebook, and that leads to problems in implementing HMMs with a large number of parameters. Although the distortion steadily decreases as M increases, only small decreases in distortion accrue beyond a value of M=32. Hence HMMs with codebooks sizes of from M=32 to 256 vectors have been in speech recognition experiments using HMMs. For the discrete symbol models we have used codebook to generate the discrete symbols with M=256 codewords.

Figure 3.17: Curve showing tradeoff of VQ average distortion as a function of the size of the VQ, M (shown of a log scale)

Another main issue is to initialize the parameters of HMM. The parameters that constitute any model λ are π, A, and B. The values of π are given by π = [1 0 0 0 0........0] because the left-right model of HMM is used in our speech recognition system which always starts with first state and ends in the last state. The random values between 0 and 1 are assigned as the initial value to the elements of A and B parameters. 3.3.2. Application of HMM Given the form of HMM, there are three basic problems of interest that must be solved for the model to be useful in real-world applications. These problems are the following:

40

3.3.2.1 Evaluation Problem: Calculating Parameters Given the observation sequence O = O1O2…OT , and Markov Model

λ = (A, B, π)

,

how do we efficiently compute P(O | λ) , the probability of the observation sequence, given the model? Solution: The aim of this problem is to find the probability of the observation sequence, O = (O1, O2 , …, OT ) given the model λ, i.e. P(O | λ). Because the observations produced by states are assumed to be independent of each other and the time t, the probability of observation sequence, O = (O1, O2 , …, OT ) being generated by a certain state sequence q can be calculated by a product: ( | , )=

(

).

(

)…..

(

)

And the probability of the state sequence, q can be found as: ( | , )=

.

.

…..

The aim was to find P(O | λ), and this probability of O (given the model λ) is obtained by summing the joint probability over all possible state sequence q, giving: ( | )=

( | , ). ( | , )

This direct computation has one major drawback. It is infeasible due to the exponential growth of computations as a function of sequence length T. To be precise, it needs (2T-1)NT multiplications and NT-1 additions. An excellent tool which cuts the computational requirements to linear, relative to T, is the well-known forward algorithm. The forward algorithm has N(N+1)(T-1)+1 multiplications and N(N-1)(T-1) additions. Forward Algorithm Initially consider a new forward probability variable t(i) , at instant t and state i , has the following formula: t (i) P(O1, O2 , O3, ......., Ot , qt Si /)

This probability function could be solved for N states and T observations iteratively: Step 1: Initialization ( )=

.

(

) ≤ ≤ 41

Figure 3.18: Forward Procedure - Induction Step

Step 2: Induction ( )=

()

(

) ≤ ≤

− ,

≤ ≤

Step 3: Termination ( | )=

()

This stage is just a sum of all the values of the probability function T(i) over all the states at instant T. This sum will represent the probability of the given observations to be driven from the given model. That is how likely the given model produces the given observations.

Backward Algorithm This procedure is similar to the forward procedure but it takes into consideration the state flow as if in backward direction from the last observation entity, instant T, till the first one,

42

instant 1. That means that the access to any state will be from the states that are coming just after that state in time. To formulate this approach let us consider the backward probability function t (i) which can be defined as: ( )= ( , ,…… | = , )

Figure 3.19: Backward Procedures - Induction Step

In analogy to the forward procedure we can solve for t(i) in the following two steps: 1 - Initialization: ( ) = , ≤ ≤ These initial values for  ’s of all states at instant T is arbitrarily selected. 2 – Induction: ( )=

.

(

).

( ) =

43

− , − , … … , ,

≤ ≤

3.3.2.2 Decoding Problem: Finding the best path λ = (A, B, π), find

Given the observation sequence O = O1O2…OT , and Markov Model optimal state sequence q = q1q2 …q T . Solution:

The problem is to find the optimal sequence of states, given the observation sequence and the model. This means that we have to find the optimal state sequence Q= ( q1 , q2 , q3 ,....., qT-1 , qT ) associated with the given observation sequence O = (O1 , O2 , O3 ,........., OT-1 , OT ) presented to the model = (A , B , ). The criteria of optimality here is to search for a single best state sequence through modified dynamic programming technique called Viterbi Algorithm. To explain the Viterbi Algorithm, the probability quantity  t (i) is defined which represents the maximum probability along the best probable state sequence path of a given observation sequence after t instants and being in state i. This quantity can be defined mathematically by: ()=

,

[

,……

,

,……

,

=

,

……

| ]

The best state sequence is backtracked by another function  t (j). The complete algorithm can be described by the following steps: Step 1: Initialization: ()=

(

), ≤ ≤

()= Step 2: Recursion: ( )=

()

()=

( ()

[

) , ≤ ≤ , ≤ ≤

] (

) , ≤ ≤ , ≤ ≤

Step 3: Termination: ∗

∗

=

( )]

[

=

[

( )]

Step 4: Path (state sequence) backtracking: ∗

=

(

∗

) , − 44

≥ ≥

Viterbi Algorithm can also be used to calculate the P(O/) approximately by considering the use of P* instead.

Figure 3.20: Viterbi Search

3.3.2.3 Training Problem: Estimating the Model Parameters Given the observation sequence O = O1O2…OT , estimate parameters for Model λ = (A, B, π) that maximize P(O | λ) . Solution: This problem deals with the training issue which is the most difficult one in all the three cases. The task of this problem is to adjust the model parameters, (A, B, ), according to a certain optimality criteria. Baum-Welch Algorithm (Forward–Backward Algorithm) is one of the well-known techniques to solve the problem. It is an iterative method to estimate the new values for the model parameters. To explain the training procedure, first a posteriori probability function γt(i) is defined, the probability of being in state i at instant t, given the observation sequence O and the model

as:

( )= ( ()=

()=

( ,

=

| , )

= | ) ( | ) () () () ()

∑ 45

Then another probability function t (i, j) is defined, the probability of being in state i at instant t and going to state j at instant t+1, given the model

and the observation sequence O.

t (i, j) can be mathematically defined as: (, )= (

=

,

=

| , )

Figure 3.21: Computation of ξt (i, j)

From the definition of the forward and backward variables, we can write t (i, j) in the form ()

(, )=

(, )=

(

∑

()

() ∑

∑

( ()

()

) () ) (

() )

()

The relation between γt(i) and t (i, j) can be easily deduced from their definitions : ()=

(, )

Now, if γt(i) is summed over all instants (excluding instant T) we get the expected number of times that state Si has left, or the number of times this state has been visited over all instants. 46

On the other hand if we sum t (i, j) over all instants (excluding T) we will get the expected number of transitions that have been made from i to j. From the behavior of γt(i) and t (i, j) the following re-estimations of the model parameters could be deduced: Initial state distribution:

ˆ i  expected number of times in state s i at time t  1

î   1 (i) Transition probabilities:

aîj 

expected number of transitions from state s i to state s j expected number of transitions from state s i T 1

 (i, j) t

aˆ ij 

t 1 T 1

  (i) t

t 1

Emission probabilities: expected number of times in state s j and observing symbol v k bˆ j ( k )  expected number of times in state s j T

bî ( k ) 



t

( j)



t

( j)

t 1 ot  vk T t 1

3.3.3. Scaling αt(i) consists of the sum of a large number of terms. Since transition matrix element (a) and emission matrix element (b) are less than 1, as t starts to get big, each term of αt(i) starts to head exponentially to zero. For large t the dynamic range of αt(i) computation will exceed the precision range of computer (even in double precision ). This is accomplished by multiplying αt(i) and βt(i) by a scaling factor that is independent of i (i.e., it depends only on t), with the goal of keeping the scaled αt(i) within the dynamic range of the computer for 1 ≤ t ≤ T . Then at the end of the computation, the scaling coefficients are canceled out exactly. When using the Viterbi Algorithm, if logarithms are used to give the maximum likelihood state sequence, no scaling is required. 47

4. UML CLASS DIAGRAMS OF THE SYSTEMS

«interface» Algorithm +execute()

WaveData

SilenceRemoval Pre-Emphasis

Windowing

FFT

Framing

DCT

MFCC

Delta

Energy

+doCMS()

PreProcessing -capturedSignal -processedSignal +doPreprocessing() +doPCMNormalization()

RegistrationPanel -regFormData +register() +reset()

FeatureExtraction

1

1 1

Client 1

1

LoginPanel -loginFormData +getNextWord()

1 TrainingPanel

1

SoundRecorder

1

-bitRate -samplingRate -audioData -status +capture() +play()

-trainingSentence -currentUser +getNextSentence()

+captureSignal() +preprocess() +extractFeatures() +registerUser() +checkUser ()

1 1

1

1 1

-inputSignal -featureVector -fftPts -numCepsCoeffs -deltaWindowSize -numOfFilters +doFeatureExtraction() +doCepstralMeanSubtraction() +combineFeatures() +calculateDeltas() +calculateEnergy()

1

Figure 4.1: UML diagram of Client System 48

FeatureVector -featureVector -mfcc

1 1

Figure 4.2: UML Diagram of Server System 49

5. DATA COLLECTION AND TRAINING For Speech modeling, the hmm models were trained by using 45 English words (multiphoneme >3) each spoken three times by 15 speakers. The trained words listed in Appendix B. For Speaker modeling, about 2 minutes spoken data was collected from 15 speakers and trained accordingly. The speech model for each word is pre-trained. Speaker model is trained for a new speaker after registering or when a user wants to re-train.

50

6. RESULTS The performance of speech recognition system will be high, as the words are trained by many speakers with variations in tones, pronunciation, speed, etc. Similarly, for speaker recognition the length of training voice should be large enough (at least 1 minute) to model the distinct features of speaker during pronunciation of different phonemes. According to Ref[3], GMM identification performance for different amounts of training data and model orders is shown below:

We could not collect the sufficient amount of data required to train the systems. So the system has been trained with limited available data. Still the performance of the system with the available data (training and testing) is very good. The performance of isolated speech recognition rate is 92%. The performance of isolated speaker recognition rate is 78% with long test data (greater than 5 seconds). In the combined system, the same word (prompted word) is used as test word for both HMM and GMM. This word can easily be recognized by HMM but the length of test word is very small (1-2 seconds) to recognize by GMM. So the overall performance of the system degrades 66%. We can use long words to improve the performance of joint system.

51

7. APPLICATION AREA The voice based remote authentication has many application domains. Using this system for the multifactor authentication provides the extra amount of security. 

Web page login -- alternative of password



In security critical web apps, as multifactor authentication or re-verification purpose o such as password change o while making transactions in E-banking o Toll fraud prevention



The standalone version of the proposed system (with some modifications), can be used for o Access Control o Forensics

8. CONCLUSION The proposed system is academic project. Various signal processing algorithms for MFCC feature extraction were studied. Similarly for speech and speaker modeling and classification various machine learning algorithms (GMM, HMM, VQ, K-means Clustering) were studied and implemented successfully. The designed system is trained with limited data. The performance of the system can be improved by utilizing various noise reduction /removal algorithms and training with large dataset. The performance of individual systems is very good and overall performance of the joint system is good for large utterance of words.

52

REFERENCES 1. Assignment 3: GMM Based Speaker Identification EN2300 Speech Signal Processing, [ www.kth.se/polopoly_fs/1.41342!assignment_03.pdf] 2. Conrad Sanderson, Automatic Person Verification Using Speech and Face Information - A Dissertation Presented to The School of Microelectronic Engineering Faculty of Engineering and Information Technology, Griffith University, August 2002, [revised February 2003]. 3. Douglas A Reynolds and Richard C Rose, Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Transactions on Speech and Audio Processing, 3(1):72–83, 1995. 4. G. Saha, Sandipan Chakroborty, Suman Senapati , A New Silence Removal and Endpoint Detection Algorithm for Speech and Speaker Recognition Applications, Department of Electronics and Electrical Communication Engineering, Indian Institute of Technology, Khragpur, Kharagpur, India. 5. J P Campbell, Jr. Speaker recognition: A tutorial. Proc. IEEE, 85(9):1437–1462, 1997. 6. K.R. Aida–Zade, C. Ardil and S.S. Rustamov, Investigation of Combined use of MFCC and LPC Features in Speech Recognition Systems, World Academy of Science, Engineering and Technology, 2006 7. L. R. Rabiner, “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition”, vol-77, no. 2, pp. 257-286, 1989. 8. Lasse L Mølgaard, Kasper W Jørgensen, Speaker Recognition: Special Course; IMM-DTU; 2005 9. Mohamed Faouzi BenZeghibaa , Joint Speech and Speaker Recognition,IDIAP Research Report, 2005. 10. Robin Teo Choon Guan @ Myo Thant, Majority Rule- Based Non-Intrusive User Authentication by Speech: Part 2 (Speaker Verification), Thesis, School of Science and Technology, Sim University,2009. 11. Shi-Huang Chen and Yu-Ren Luo , Speaker Verification Using MFCC and Support Vector Machine, Proceedings of the International Multi Conference of Engineers and Computer Scientists 2009, vol – I, IMECS 2009. 12. Tomi Kinnunen , Spectral Features for Automatic Text-Independent Speaker Recognition- Licentiate’s Thesis, University of Joensuu, Department of Computer Science, Finland, 2003. 13. Waleed H. Abdulla and Nikola K. Kasabov, The Concepts of Hidden Markov Model in Speech Recognition, Knowledge Engineering Lab, Department of Information Science, University of Otago,New Zealand, 1999.

53

APPENDIX A: BLAZEDS CONFIGURATION FOR REMOTING SERVICE The Remoting service is one of the RPC Service included in Blaze DS. We need to configure the remoting-config.xml for the endpoint class in java [myPackage].MainServer session

In services-config.xml, we have to change the channel-defenition end points to use server’s port, host and context.

The Flex client application uses the Remote Object component to access the Java method, for doing so client application uses the destination property to specify the destination, which is described in remoting-config.xml.

TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

Short Description

Description

Comments

We need your help!