next up previous
次へ: Analysis of the approximation 上へ: Model Composition by Lagrange 戻る: Model of the environment

Polynomial-approximation

The goal under consideration is to find the distribution (mean and variance) of noisy speech parameter $ y$, given the distributions for noise parameters $ n$ and $ h$, and clean speech parameter $ x$. First, mean, i.e. expectation value, of $ y$ is given as:

$\displaystyle E(y)$ $\displaystyle =$ $\displaystyle E(x)+E(h)+E[\ln(1+e^{n-x-h})]$  
  $\displaystyle =$ $\displaystyle E(x)+E(h)+E[g(x,n,h)]$ (5)

Provided $ x$, $ n$ and $ h$ have Gaussian distributions, $ E[g(x,n,h)]$ does not have any closed-form expressions.

Therefore, to find the value of $ E[g(x,n,h)]$, we first expand the function $ g(x,n,h)$ into a polynomial that can closely approximate it within a given range, with its low order as far as possible. The polynomial approximation can be simplified by reducing function $ g(x,n,h)$ to univariate one, by taking $ z=n-x-h$, where $ z\sim\mathcal{N}(\mu_{n}-\mu_{x}-\mu_{h}, \sigma_{n}^{2}+\sigma_{x}^2+\sigma_{h}^2)$.

In this work, 2nd-order Lagrange interpolating polynomial is used to approximate the function $ g(z)$, given as:

$\displaystyle g(z)\approx P_2(z)=\sum_{k=0}^2 g(z_k) \prod_{i=0, i\neq k}^2 \frac{(z-z_i)}{(z_k-z_i)}$     (6)

図 2: Comparison of Polynomial Approximations
[width=1.0]eps/polycomp-r.eps

The points $ z_0$, $ z_1$ and $ z_2$ can be specified manually (one point at $ z=\mu_z$ and other two chosen to minimize error in the required range), or instead, Chebyshev-Lagrange polynomial can be used that specifies the points itself.

Figure 2 shows different polynomials used to approximate function $ g(z)=\ln(1+e^z)$ when $ \mu_z=0$. In Figure 2a, the points selected for Lagrange polynomial expansion are $ z_0=\mu_z$, $ z_1=z_0-5$ and $ z_2=z_0+5$. As seen in the figure, Lagrange polynomial has been able to approximate the function up to larger range and more accurately than even 2-nd order Taylor's series. When variance of $ z$ is low, we can keep points $ z_1$ and $ z_2$ closer to $ z_0$; however, when $ z$ has large variance, the points should be extended farther from $ z_0$. However, extending them too farther introduces inaccuracies in the approximation in region close to $ z=z_0=\mu_z$, where most of data occurs. Therefore, the points $ z_1$ and $ z_2$ should be placed at some optimum values depending on the variance of $ z$.

Finally, Eq.(6) reduces to $ g(z)=az^2+bz+c$ form, where $ a$, $ b$ and $ c$ are constants. Therefore:

$\displaystyle E[g(z)]=a(\sigma_{z}^2+\mu_{z}^2)+b\mu_z+c$     (7)

Using this estimated value of $ E[g(z)]$ in Eq. 5, the mean for corrupted speech vector can be computed. As, the accurate value of mean is more important than that of variance, the covariance matrix can be retained as it is for the clean speech. However, expression for adapting diagonal variances can be derived from above approximation, in terms of higher-order moments (up to 4th moment) of $ z$, and diagonal variances can be adapted as well.

The method for estimating model parameters for corrupted speech is shown in Figure 4. As approximation is done in log-spectral domain, the HMM parameters of clean speech and noises in cepstral domain are converted into log-spectral domain by taking inverse DCT. This conversion of parameter vector from cepstral to log-spectral domain requires knowledge of $ C_0$. In case if the given model parameters do not include $ C_0$, it can be computed, as worked out in [12], by noticing the fact that the sum of the energies of Mel bands in linear spectral domain equals to total frame energy.

The statistics to account for channel distortions can be obtained by using EM based approach by maximizing the likelihood score as described in [13]. Some adaptation data are required to estimate the statistics for channel distortion.

In all cases, only diagonal elements of covariance matrix of speech and noise HMMs are considered, in order to avoid the complexity and reduce the computational expense of the algorithm.

図 3: Estimated mean of corrupted speech: by Monte-Carlo simulation, Log-max approximation, Vector Taylor Series-1, and Lagrange Polynomial Approximation (LPA) for $ \mu _n=10$, $ \sigma _n^2=0.1$, $ \sigma _x^2=6$ and $ \mu _x $ varying from 0 to 20.
[width=.9]eps/mean6p1.eps

図 4: Model Composition by Lagrange Polynomial Approximation
[width=0.9]eps/algol-r.eps


next up previous
次へ: Analysis of the approximation 上へ: Model Composition by Lagrange 戻る: Model of the environment
平成16年9月23日