The goal under consideration is to find the distribution (mean and variance) of noisy speech
parameter , given the distributions for noise parameters and ,
and clean speech parameter . First, mean, i.e. expectation value,
of is given as:
Therefore, to find the value of , we first expand the function into a polynomial that can closely approximate it within a given range, with its low order as far as possible. The polynomial approximation can be simplified by reducing function to univariate one, by taking , where .
In this work, 2nd-order Lagrange interpolating polynomial is
used to approximate the function , given as:
The points , and can be specified manually (one point at and other two chosen to minimize error in the required range), or instead, Chebyshev-Lagrange polynomial can be used that specifies the points itself.
Figure 2 shows different polynomials used to approximate function when . In Figure 2a, the points selected for Lagrange polynomial expansion are , and . As seen in the figure, Lagrange polynomial has been able to approximate the function up to larger range and more accurately than even 2-nd order Taylor's series. When variance of is low, we can keep points and closer to ; however, when has large variance, the points should be extended farther from . However, extending them too farther introduces inaccuracies in the approximation in region close to , where most of data occurs. Therefore, the points and should be placed at some optimum values depending on the variance of .
Finally, Eq.(6) reduces to
form, where ,
and are constants. Therefore:
Using this estimated value of in Eq. 5, the mean for corrupted speech vector can be computed. As, the accurate value of mean is more important than that of variance, the covariance matrix can be retained as it is for the clean speech. However, expression for adapting diagonal variances can be derived from above approximation, in terms of higher-order moments (up to 4th moment) of , and diagonal variances can be adapted as well.
The method for estimating model parameters for corrupted speech is shown in Figure 4. As approximation is done in log-spectral domain, the HMM parameters of clean speech and noises in cepstral domain are converted into log-spectral domain by taking inverse DCT. This conversion of parameter vector from cepstral to log-spectral domain requires knowledge of . In case if the given model parameters do not include , it can be computed, as worked out in , by noticing the fact that the sum of the energies of Mel bands in linear spectral domain equals to total frame energy.
The statistics to account for channel distortions can be obtained by using EM based approach by maximizing the likelihood score as described in . Some adaptation data are required to estimate the statistics for channel distortion.
In all cases, only diagonal elements of covariance matrix of speech and noise HMMs are considered, in order to avoid the complexity and reduce the computational expense of the algorithm.