Major Contributions
Selected technical contributions are listed below with first appearances in Japanese national conferences. Best viewed with Japanese font settings. If necessary corrections are found, please let me know at .

Speech Analysis and Features

  • Lag Window (1975)
    Proposed lag windowing of autocorrelation to reduce the pitch effect in PARCOR speech analysis/synthesis. Jointly patented with Tohkura and Hashimoto. Commonly used in LPC-based speech coding standards.
    Bibliography: 特許01358638 ``音声分析装置'' (Japan Patent No. 01358638).
  • Cepstrum Distance for Speech Recognition (1978)
    Used LPC-cepstrum distance for DP-based speech recognition. Collaborated with H. Nagashima. Cepstrum is widely used in speech recognition. (Perhaps, the first use of cepstral features in speech recognition though application to speaker recognition was existing.)
    Bibliography: 好田 正紀, 長島 広海, 嵯峨山 茂樹, ``音韻単位の標準パ ターンを用いた単 語音声認識装置,'' 電子通信学会全国大会予稿集, S9-4, Vol. 5, pp. 335-336, 1978.
  • Delta Cepstrum (1979)
    Proposed delta cepstrum for capturing the dynamic characteristics of speech. Primarily for speaker recognition. This feature was later successfully applied to DP-based speaker-independent speech recognition by S. Furui. With a name of ``delta-cepstrum'', extensively used in modern HMM-based speech recognition.
    Bibliography: 嵯峨山 茂樹, 板倉 文忠, ``音声の動的尺度に含まれる個 人性情報,'' 日本音 響学会昭和54年度春季研究発表会講演論文集, 3-2-7, pp. 589-590 (1979-06). (Shigei Sagayama and Fumitada Itakura, ``On Individuality in a Dynamic Measure of Speech,'' Proc. ASJ Spring Spring Conf. 1979, 3-2-7, pp. 589-590, June 1979.)
  • LSP Analysis Theory (1979-)
    Collaborating with Itakura, produced a number of research results about theoretical properties of LSP, e.g., Duality theory of LPC and LSP, Theoretical properties of LSP frequency distributions, etc. Ph.D. thesis.

Speech Analysis and Synthesis

  • Composite Sinusoidal Modeling (CSM) (1979)
    Proposed modeling speech by a sum of $n$ sinusoids and equate autocorrelations of the speech signal and the model at lowest $2n-1$ points. The model frequencies were proven equivalent to Line Spectrum Pair frequencies. Applied to Yamaha's best selling sound IC, the CSM Speech Synthesis patent earned best among all NTT patents for several years.
    Bibliography: 嵯峨山 茂樹, 板倉 文忠, ``複合正弦波モデルによる音声 分析,'' 電子情報通 信学会 情報・システム部門別 全国大会予稿集, 63, p. 63, 1979.; 嵯峨山 茂樹, 板倉 文忠, ``複合正弦波による簡易な音声 合成法,'' 日本音響 学会昭和54年度秋季研究発表会講演論文集, 3-2-3, pp. 557-558, Oct. 1979.
  • LSP Speech Synthesizer LSI (1980)
    Designed LSP Speech Synthesizer LSI collaborating with the Fujitsu's telephone switching hardware team. It became the first LSI for LSP speech synthesis and also the first C-MOS LSI for speech synthesis.
    Bibliography: 嵯峨山 茂樹, 管村 昇, 板倉 文忠, 小池 恒彦, ``線スペ クトル対パラメータ による音声合成器LSIとその応用,'' 電子通信学会 通 信部門 全国大会予稿集, S5-5, pp. 2-393-394, 1980.
  • Japanese Text-to-Speech System (1982)
    Collaborated with H. Sato, Y. Sagisaka, and K. Kogure to construct the first Japanese text-to-speech system.
    Bibliography: 佐藤 大和, 匂坂 芳典, 小暮 潔, 嵯峨山 茂樹, ``日本語 テキストからの音声 合成,'' 電子情報通信学会全国大会予稿集, S6-3, Vol. 5, pp. 399-400, 1982.

Phone Modeling

  • Tree-based Allophone Clustering (1987)
    Proposed tree-based clustering technique for context-dependent phones. After presented in English in 1989 (IASSP89, Glasgow), this idea was adopted and modified by Kai-Fu Lee et al. in 1990, and extensively used by IBM and Cambridge University. Commonly used in modern phoneme-based high-performance speech recognition.
    Bibliography: 嵯峨山 茂樹, ``音素環境のクラスタリング,'' 音学講論, 1-5-15, pp. 29-30, Oct. 1987. (Sagayama, S. (1987). "Phoneme Environment Clustering," 1-5-15, Proc. ASJ Fall Conference, pp. 29-30, Oct. 1987.
  • Hidden Markov Network (1991)
    Combined the above-stated tree-based clustering and state tying ideas to represent context-dependent phones. Successive State Splitting (SSS) algorithm was also proposed for automatically obtaining a network structure of allophones. HMnet and SSS algorithm were extensively applied to speech recognition and other areas (e.g., protein analysis, automatic grammar acquisition, etc.).
    Bibliography: 鷹見 淳一, 嵯峨山 茂樹, ``逐次状態分割法(SSS)による隠れ マルコフネット ワークの自動生成,'' 日本音響学会平成3年度秋季研究発 表会講演論文集, 2-5-13, pp. 73-74, Oct. 1991. (Jun-ichi Takami and Shigeki Sagayama, ``Automatic Generation of the Hidden Markov Network by the Successive State Splitting Algorithm,'' Proc. ASJ Fall Conf., 2-5-13, pp. 73-74, Oct. 1991.)
  • Context-Dependent HMM-LR Continuous Speech Recognition (1996)
    Context-dependent HMM (HMnet) was combined with generalized LR parser for continuous speech recognition using a given context-free grammar (CFG).
    Bibliography: 永井 明人, 鷹見 淳一, 嵯峨山 茂樹, ``環境依存連続 HMMを用いたHMM-LR連続 音声認識,'' 日本音響学会平成3年度秋季研究発表 会講演論文集, 1-5-20, pp. 39-40, Oct. 1991.
  • Four-layer Tied-structure HMM (1996)
    Tying at four levels: allophones, states, Gaussians, and scalar parameters. Advantageous in training with small amount and fast likelihood calculation. Presented in English at ICASSP95.
    Bibliography: 高橋 敏, 嵯峨山 茂樹, ``4 階層の共有構造を持つ音素 環境依存 HMM の検討,'' 日本音響学会平成6年度秋期研究発表会講演論文集, 3-8-3, pp. 113-114, Oct. 1994. (Satoshi Takahashi, Shigeki Sagayama, ``A Study of Context-Dependent HMMs with Four-Level Tied Structure,'' Proc. ASJ Fall Conf., 3-8-3, pp. 113-114, Oct. 1994.)
  • Discrete Mixture HMM (1996)
    Mixture components are replaced by discrete (scalar quantized) distributions to represent non-Gaussian, complex distributions. Presented in English at ICASSP97.
    Bibliography: 高橋 敏, 嵯峨山 茂樹, ``離散混合出力分布型HMM,'' 日本 音響学会平成8年度 秋期研究発表会講演論文集, -, pp. -, Sep. 1996. (Satoshi Takahashi, Shigeki Sagayama, ``Discrete Mixture Output Distribution HMM,'' Proc. ASJ Fall Conf., -, pp. -, Sep. 1996.)
  • Asynchronous-Transition HMM (1999)
    Proposed a new HMM structure where state transitions are not synchronous between features.
    Bibliography: 松田繁樹, 中井満, 下平博, 嵯峨山茂樹, ``非同期遷移 型HMM,'' 平成11年日本音響学会秋季研究発表会講演論文集, 1-1-12, pp. 23-24, Oct. 1999.
  • Multiple Linear-Regression HMM (2000)
    Proposed a new HMM structure where mean vectors are linearly dependent on observable factors, such as pitch frequency and power. English paper at ICASSP2001.
    Bibliography: 藤永 勝久, 中井 満, 下平 博, 嵯峨山 茂樹, ``重回帰 HMMによる音声のモデ ル化,'' 平成12年電気関係学会北陸支部大会講演論 文集, G-15, p. 458, Sep 2000.

Speaker & Noise Modeling & Adaptation

  • Vector Field Smoothing (1992)
    Proposed a speaker adaptation method by spatially smoothing difference vectors between original and trained Gaussian mean vectors in the feature space. This was the first method enabling adapting the Gaussian mixture HMM to speaker. Extensively used in Japan.
    Bibliography: 大倉 計美, 杉山 雅英, 嵯峨山 茂樹, ``混合連続分布HMM を用いた移動ベクトル場平滑化話者適応方式,'' 日本音響学会平成4年度春 季研究発表会講演論文集, 2-Q-17, pp. 191-192, Mar. 1992. (Kazumi Ohkura, Masahide Sugiyama and Sigeki Sagayama, ``Speaker Adaptation Based on Transfer Vector Field Smoothing Model with Continuous Mixture Density HMMs,'' ASJ, 2-Q-17, pp. 191-192, (Mar. 1992).
  • Speaker-Tied Mixture (1992)
    Gaussian mixture is derived from speaker-dependent single Gaussian phone (allophone) models. Later, this model was used for rapid speaker adaptation where speaker mixture weights are adapted using an extremely small amount of training data (1 word, for example).
    Bibliography: 小坂 哲夫, 鷹見 淳一, 嵯峨山 茂樹, ``話者混合SSSによる不特定話者音 声認識,'' 日本音響学会平成4年度秋季研究発表会講演論文集, 2-5-9, pp. 135-136, Oct. 1992. (T. Kosaka, J. Takami and S. Sagayama, ``Speaker-Independent Speech Recognition Using Speaker-Mixture SSS algorithm,'' ASJ Fall Conf., 2-5-9, pp. 135-136, Oct. 1992.
  • Speaker Tree (1993)
    Applied tree-based clustering to speakers to find a speaker tree that spanned from a speaker-independent model to speaker-dependent models along the tree.
    Bibliography: 小坂 哲夫, 松永 昭一, 嵯峨山 茂樹, ``木構造クラスタリ ングを用いた話者 適応,'' 日本音響学会平成5年度秋期研究発表会講演論 文集, 2-7-14, pp. 97-98, Oct. 1993.
  • MAP/VFS for Speaker Adaptation (1994)
    MAP (maximum a priori) training and VFS (vector field smoothing) are combined to accelerate speaker adaptation.
    Bibliography: 高橋 淳一, 嵯峨山 茂樹, ``最大事後確率推定と移動ベク トル場平滑化を組み 合わせによる高速話者適応,'' 日本音響学会平成6年 度秋期研究発表会講演論 文集, 2-8-19, pp. 75-76, Oct. 1994. (Jun-ichi Takahashi, Shigeki Sagayama, ``Vector-Field-Smoothed Bayesian Learning for Fast Speaker Adaptation,'' Proc. ASJ Fall Conf., 2-8-19, pp.75-76, Oct 1994.)
  • Jacobian Adaptation (1996)
    Proposed a fast model adaptation method to the environmental noise. When adapting a model trained beforehand for noise A to the target noise B, and when A and B are relatively close, the noise adaptation procedure is linearized and can be very fast. This idea was first formulated at ATR, 1992, inspired by PMC by M. Gales. First experimental results were obtained in 1996 and presented in English at ICASSP97 (Munich). Extensive studies often seen in recent ICASSPs and ICSLPs.
    Bibliography: 山口義和, 高橋淳一, 高橋敏, 嵯峨山茂樹, ``Taylor展 開に基づく高速な音響モデル適応法,'' 日本音響学会平成8年度秋 季研究発表会講演論文集, -, pp. -, Sep. 1996.

Hand-Writing Recognition

  • HMM-based Online Hand-Written Kanji-Character Recognition with Structured Lexicon (2000)
    Online hand-written Kanji character is recognized in the continuous speech recognition framework where a 6500-Kanji lexicon is hierarchically structured and represented by sequences of substrokes. This was the first application of continuous speech recognition algorithm to handwriting. English papers at ICDAR2001, IWFHR2002, and ICPR2002.
    Bibliography: 秋良 直人, 中井 満, 下平 博, 嵯峨山 茂樹, ``ストロー クHMMによるオンラ イン文字認識の特徴量の検討,'' 平成12年電気関係学 会北陸支部大会講演論文 集, F-92, p. 393, Sep 2000.

Music Information Processing

  • HMM-based Music Transcription from MIDI Signals (1999)
    The sequence of observed note durations (inter-onset time) was transcribed by an HMM-based note recognizer with a grammar probabilistically modeling the sequence of musical notes. Presented in English at IEEE MMSP2002.
    Bibilography: 齋藤直樹, 中井満, 下平博, 嵯峨山茂樹, ``隠れマルコフ モデルによる音楽演奏情報からの音符列推定,'' 平成11年電気関係学会北 陸支部大会講演論文集, Oct. 1999.
  • HMM-based Harmonization of Given Melodies (1999)
    Optimal chord finding was formulated as finding the Viterbi sequence of hidden state which generates the observed melody. With a bigram grammar of chord sequences, the decoding process estimates the most likely chord sequence in the maximum likelihood sense.
    Bibliography: 川上隆, 中井満, 下平博, 嵯峨山茂樹, ``隠れマルコフモ デルを用いた旋律への和声付け,'' 平成11年電気関係学会北陸支部大会講 演論文集, Oct. 1999.


  • Anthropomorphic Dialogue Agent Toolkit (2000-2002)
    Involving 17 researchers from 10 organizations, a project for providing a open-source, free-of-charge toolkit is in progress for promoting the spoken dialog research. Funded by IPA (Information Processing Promotion Agency) for 3 years.
  • Written Communication for the Blind (2000-2002)
    Combining hand-written character recognition, speech synthesis, and new transducers for written input (replacing the stylus), a project is in progress for supporting written communication (E-mailing etc.) of the blind. Collaborating with 7 organizations. Funded by Ishikawa Prefecture for 3 years.

Back to the top