next up previous
次へ: Model of the environment 上へ: Model Composition by Lagrange 戻る: Model Composition by Lagrange

Introduction

The performance of speech recognizers trained with clean speech degrades when used in noisy environment, due to mismatch between training and testing conditions. The difference between training and testing conditions may result from background noise, channel distortion, speaker's stress and other factors as well. A number of techniques have been developed to deal with this robustness issue, that can be broadly categorized into robust front-ends, multiple microphones, enhancement techniques and model-based compensation schemes. This paper deals with the model-based or HMM composition approach as considered in works [2] by Varga et al., [3] by Martin et al. and [4] by Gales and Young. Model composition method estimates the model for noisy acoustical environment by combining clean HMM and noise HMM, and thus reduces the mismatch between training and testing conditions.

Parallel Model Combination (PMC) [4,5] has been an effective method to cope with the robustness issue, and has been extensively studied. Many variations of PMC exist that attempt to estimate noise-adapted model from clean speech HMM and noise HMM. However, an accurate estimation of model parameters by PMC involves numerical integration, which is computationally very expensive. Data-driven PMC [6], that is based on generating samples of corrupted speech vectors by Monte-Carlo simulation, is sufficiently accurate compared to numerical integration, but still slow. Other approximations such as log-normal, log-add and log-max are computationally efficient, but less accurate [7]. Further, PMC log-normal approximation, that is most commonly used, assumes that the sum of two log-normally distributed random variables is itself log-normally distributed [5].

図 1: Model of the acoustical environment.
[width=.7]eps/envmodel.eps

Jacobian approach to model adaptation, proposed by Sagayama et al. in [8], attempts to compensate model by Jacobian matrices with the difference between assumed and observed noise cepstra. However efficient and effective, the method requires some training data, and assumes that cepstral difference and variance of mixtures stay within the linearity range.

Use of neural network (NN) for combining clean speech HMM and noise HMM has been investigated in [9]. Neural networks are used to learn the non-linearity involved in combination, and thus to produce noise-adapted HMM, by using clean speech HMM, noise HMM and SNR as inputs. Neural networks need to be trained first, using a set of input and output HMMs. Output HMM for training is obtained by a combination of MLLR, MAP and VFS adaptation techniques for a particular combination of inputs, viz. clean speech HMM, noise HMM and SNR[9]. The method has been found to be effective, however it involves building large number of sample output noisy HMMs and training of NNs, that is slow, computationally inefficient and a tedious task.

Vector Taylor Series (VTS) [10,11] is yet another approach to combine the models by approximating the non-linear relationship between speech and noise with a truncated vector Taylor series. However, other polynomials optimized to approximate the parameters of distribution can give better result than the Taylor series [11].

In this paper, we approximate the non-linear function governing the relationship between speech and noise by a Lagrange polynomial, and then estimate the model parameters for noisy speech. The accuracy of the approximation is compared with other methods, and the performance of the Lagrange Polynomial Approximation method is also evaluated on a speaker-dependent isolated word recognition task.


next up previous
次へ: Model of the environment 上へ: Model Composition by Lagrange 戻る: Model Composition by Lagrange
平成16年9月23日