It is known that prosodic information offers many useful clues for speech recognition, such as location of important words and phrases, topic segment boundaries, location of disfluencies, identification of languages and others. The process of extracting prosodic information is generally conducted on the assumption that pattern is already (roughly) extracted. Yet patterns can not always be extracted simply in spontaneous dialogue speech in which simultaneous utterances by two or more speakers often occur. Thus, in order to incorporate proper prosodic information into spontaneous dialogue speech recognition, a number of simultaneous speakers and respective patterns are desired to be extracted precisely. However, the multi-pitch detection problem is hardly simple and is difficult to be solved analytically.
Until now, numerous multi-pitch detection methods have been reported not only in speech signal processing [1,2] but also in musical signal processing[3,4,5] and auditory scene analysis [6,7]. Chazan et al. addressed a speech separation method by introducing a time warped signal model which allows a continuous pitch variations within a long analysis frame . Wu et al. described a multi-pitch tracking method in noisy environment by filter bank process and pitch tracking using HMM . Although these methods actualize an accurate detection of s, either of them does not include specific process of determining the number of speakers.
Our objective is to develop a multi-pitch detection algorithm which enables to detect the number of simultaneous speakers, the accurate s as a continuous values, and moreover, respective spectral envelopes with spectral domain procedure. The basic approach is stated in Section 2, and the detection algorithm is described in Section 3. And the results of operation experiments are reported in Section 4.