It is known that prosodic information offers many useful clues for
speech recognition, such as location of important words and phrases,
topic segment boundaries, location of disfluencies, identification of
languages and others.
The process of extracting prosodic information is generally conducted on
the assumption that pattern is already (roughly) extracted.
Yet
patterns can not always be extracted simply in spontaneous
dialogue speech in which simultaneous utterances by two or more speakers
often occur.
Thus, in order to incorporate proper prosodic information
into spontaneous dialogue speech recognition,
a number of simultaneous
speakers and respective
patterns are desired to
be extracted precisely.
However, the multi-pitch detection problem is hardly simple and is
difficult to be solved analytically.
Until now, numerous multi-pitch detection methods have been reported not only in
speech signal processing [1,2] but
also in musical signal processing[3,4,5] and auditory scene
analysis [6,7].
Chazan et al. addressed a speech separation method by introducing a time
warped signal model which allows a
continuous pitch variations within a long analysis frame [1].
Wu et al. described a multi-pitch tracking method in noisy environment
by filter bank process and pitch tracking using HMM [2].
Although these methods actualize an accurate detection of s, either
of them does not include specific process of determining the
number of speakers.
Our objective is to develop a multi-pitch detection algorithm which
enables to detect the number of simultaneous speakers, the accurate s as a
continuous values, and moreover, respective spectral
envelopes with spectral domain procedure.
The basic approach is stated in Section 2, and
the detection algorithm is described in Section
3. And the results of operation experiments are
reported in Section 4.