This paper presents a new score normalization method for speaker identification using Gaussian Mixture Model (GMM). The new GMM normalization method has two main advantages: (1) the thresholds are independent to dataset and mapped to the range of $$\left [ 0\% \div 100\% \right ]$$ corresponding to your expected accuracy of the system and (2) better performance comparing to common methods. The experimental results suggest the viability of the proposed approach in terms of shortening the development time and providing regular update for model’s parameters.

Introduction

Considering a set  $$\beta =\left \{ \beta _{i} \right \},i=1,…,N$$ in which $$\beta_i$$ is a statistical model to represent a known speaker $$S_i \in S$$, and a group of unknown speakers $$U$$ represented by $$\beta_0$$, where $$U ∩ \beta = ø$$.  With a given acoustic utterance $$X$$, in statistical approach, a speaker identification system will decode $$X$$ as speaker $$\hat S$$ as following [‎1].

$$\hat {S}=argmax_{\beta}P(X|\beta_i,\beta_0)$$    (1)

where $$P(X|\beta)$$ and $$P(X|\beta_0)$$ are posteriori probabilities measured on $$\beta$$ and $$\beta_0$$ for the given $$X$$.

If $$\hat S=\beta_i$$, the system accepts $$X$$ and identifies $$X$$ that it was spoken by $$S_i$$, otherwise it is rejected as an unknown speaker in $$U$$. Assuming that the accuracy of identification for the known speakers is AA (Accurate Accepts), and for unknown speaker is AR (Accurate Rejects). Research efforts are still ongoing in attempts to develop a method which achieves an optimal balance between AA and AR. At present, the common parameterizing method for statistical model (i.e. $$\beta_i$$ and $$\beta_0$$) includes Neural Network (NN) and Gaussian Mixture Model (GMM).

Examples of applying NN are (1) the model used to detect any speaker in group $$S$$, and (2) larger and balanced training dataset. For both cases, it is always possible to obtain a good quality training database for the group $$S$$ as the number of known speakers is finite. On the other hand, it is more challenging for group $$U$$ which generally consists of all speakers who have never been in $$S$$.

Comparatively, the GMM model is more suitable to handle the unknown speaker set. By applying GMM to parameterize speaker model with a given input acoustic vector $$X$$, the decoder will identify speaker $$\hat S$$ represented by model $$\beta_i$$, if the impostor log-likelihood score meet the condition described in (2) is true:

$$\Lambda (X)=(\log P(X|\beta_i)-\log P(X|\beta_0)) > \delta$$    (2)

Where: $$\beta_i$$ is parameterized by a set $$\beta_i=\left \{ w_k,\mu_k,\Sigma_k \right \}$$, $$k=1,…,M$$, $$M$$ is number of mixtures. $$w_k,\mu_k,\Sigma_k$$ are mixture weight, mean vector and covariance matric. $$P(X|\beta_i)$$ is a probability density function described in (3). $$\delta$$ is a threshold,

$$P(X|\beta_i)=\sum_{w}w_kN(\mu_k,\Sigma _k)$$    (3)

where $$N(\mu_k,\Sigma _k)=\frac{1}{(2\pi)^{D/2}\left | \Sigma _k \right |^{1/2}}\exp \left [ -\frac{1}{2}(X-\mu_k)^T\Sigma_k^{-1}(X-\mu_k) \right ]$$.

The term $$\delta$$ in equation (2) is an experimental and manual parameter. The value range of $$\delta$$ is not the same for every system since the parameters of model $$\beta_i$$ depend on the training data and the variants of specific speaker voice. That would be confusing and difficult for developers. The question is that how much the threshold $$\delta$$ should be to stabilize the accuracy in the decoding step for any speaker voice even if their acoustic features at the decoding time have never seen in the training data. To answer that question, score normalization methods have been proposed such as Z-norm [‎2], T-Norm [‎3], h-Norm [‎4], u-Norm [‎5], and r-Norm [‎6]. But all these methods could be considered as based on standard deviation normalization theory over a particular training, testing or selected dataset. The basic idea is shown in the equation (4). Therefore $$\delta$$ is still a magical and dataset-dependent number.

$$\tilde{\Lambda}(X)=\frac{\Lambda(X)-Mean(\Lambda(X^*))}{Dev(\Lambda(X^*))}$$    (4)

where $$X^*$$  is a specific training, testing or selected acoustic dataset.

In this paper, we propose a new normalization method named 2S-Norm (2 Scores Normalization), consisting of two scores: the identification score (IS) and the confidence score (CS). IS is used to accept an utterance if it was spoken by a known speaker in $$S$$ or reject otherwise. IS is estimated over a training dataset, similar to the Z-Norm method. CS represents the confidence of a decoding decision, avoiding mistakes where a known speaker was correctly accepted but incorrectly identified as a different individual. IS and CS are first normalized and then mapped to the range of $$\left [ 0\% \div 100\% \right ]$$. Thresholds will be simple as the choices according to your expected accuracy are in range from 0% to 100%.

This paper is structured as follows. In Section 2, a basic idea of speaker identification based on Gaussian Mixture Model is introduced, which is followed by the description of score normalization and common methods in Session 3. The detail of the proposed method including definitions and decision algorithm are presented in Session 4. Session 5 provides the evaluation results for the new method compared to T-Norm and Z-Norm. Finally, Section 6 describes the salient pointer derived from this study.

See full the paper HERE.

Nguyen Van Huy
Thai Nguyen University of Technology, Thai Nguyen, Vietnam

Related posts:
SHARE
Previous articleAmazon’s next pillar is AI