Sagayama/Ono Laboratory (Lab. #1), Department of Information Physics and Computing,
Graduate School of Information Science and Technology, The University of Tokyo.
(Last updated: 2007.04.07)

Harmonic-Temporal Clustering of Speech
for Single and Multiple F0 Contour Estimation
in Noisy Environments

Jonathan Le Roux Hirokazu Kameoka Nobutaka Ono Alain de Cheveigné Shigeki Sagayama

Harmonic-Temporal Clustering of Speech - Precise parametric description of the voiced parts of speech from the power spectrum

[Motivation] The design of an algorithm for the robust estimation of the F0 contour of harmonic signals such as speech is a challenging problem which has been widely investigated but not yet solved satisfactorily. An algorithm that would perform with high accuracy in a wide range of background noises (white noise, pink noise, noise bursts, music, other speech...), and which would extract simultaneously the F0 contours of several concurrent voices would have a very broad range of applications in computational auditory scene analysis (CASA), speech recognition, prosody analysis, speech enhancement or speaker identification. Several algorithms already exist that deal with the tracking of multiple F0s often relying on an initial frame-by-frame analysis followed by post-processing to reduce errors and obtain a smooth F0 contour, for example using hidden Markov models (HMM).

[Focus] In contrast with the previous methods, we propose to perform estimation and model-based interpolation simultaneously, through a parametric model of the time and frequency shape of the spectral envelope of speech, based on a multi-pitch analysis method initially developed for feature extraction of music signals, the Harmonic-Temporal structured Clustering (HTC) method.

[Method] The speech spectrum is modeled as a sequence of spectral clusters governed by a common F0 contour expressed as a spline curve. These clusters are obtained by an unsupervised 2D time-frequency clustering of the power density using a new formulation of the EM algorithm, and their common F0 contour is estimated at the same time. A smooth F0 contour is extracted for the whole utterance, linking together its voiced parts. A noise model is used to cope with non-harmonic background noise, which would otherwise interfere with the clustering of the harmonic portions of speech.

[Characteristics] This method performs simultaneously the analysis in the time and frequency directions and is thus expected to be more robust than previous methods. The introduction of a noise model enables to perform F0 estimation in noisy environments, and the algorithm can be used not only for single F0 estimation but also for co-channel concurrent speech.

Fig. 1. Comparison of observed and modeled spectra (``Tsuuyaku denwa kokusai kaigi jimukyoku desu'', female speaker)(Click for a larger view) Fig. 2. Estimation of the clean part of a noisy spectrogram(Click for a larger view) Fig. 3. Parametric representation of separated spectrograms(Click for a larger view)

Keywords: acoustic scene analysis, EM algorithm, harmonic-temporal structured clustering (HTC), multi-pitch estimation, noisy speech, spline F0 contour

Bibliography

This idea and preliminary results were published in [LeRoux2006ASJ03]. The results of thorough experiments were presented in [LeRoux2007ICASSP04], and a full presentation of both theoretical aspects and experimental results were published in [LeRoux2007IEEETrans05]. The publications regarding HTC can be found in the HTC presentation page.

Experimental evaluation

Table 1. Gross error rates for several F0 estimation algorithms on clean single speaker speech. Table 2. Accuracy (%) of the F0 estimation of single speaker speech mixed with white and pink noises. Tables 3 and 4. Categorization of interference signals and Accuracy (%) of the F0 estimation of voiced speech with several kinds of interferences. Table 5. F0 estimation of concurrent speech by multiple speakers, gross error for a difference with the reference higher than 20% and 10%.

For further information, please read the explanation page for [LeRoux2007ICASSP04] and other PDF papers.


[ Back to Lab Home ]