主な研究成果
以下に、研究成果の主なものを示します。 修正が必要な箇所があれば、宛てにご指摘下さい。

音声分析と音声特徴量 Speech Analysis and Features

  • ラグ窓法 (1975) Lag Window
    ラグ窓法はPARCOR分析においてピッチの影響を軽減するための手法と して着想したものですが、そののちLSPも含め音声符号化などの標準化にも 含まれ、広く用いられています。 ほぼ世界中全ての携帯電話に用いられていると言ってよいでしょう。 [ Proposed lag windowing of autocorrelation to reduce the pitch effect in PARCOR speech analysis/synthesis. Jointly patented with Tohkura and Hashimoto. Commonly used in LPC-based speech coding standards. ]
    初出: 特許01358638 ``音声分析装置'' (東倉、橋本、嵯峨山) (Japan Patent No. 01358638).
  • ケプストラム距離 (1978) Cepstrum Distance for Speech Recognition
    ケプストラム(あるいはmel-cepstrum)は音声認識の特徴量として最も多く 用いられていますが、ケプストラムを音声認識に最初に用いたのは 誰であるか疑問に思っていました。おそらく我々(長島、好田、嵯峨山)であろう、 と外国の方々も言われます。より正しい情報があれば、御教示下さい。 Used LPC-cepstrum distance for DP-based speech recognition. Collaborated with H. Nagashima. Cepstrum is widely used in speech recognition. (Perhaps, the first use of cepstral features in speech recognition though application to speaker recognition was existing.)
    初出: 好田 正紀, 長島 広海, 嵯峨山 茂樹, ``音韻単位の標準パ ターンを用いた単 語音声認識装置,'' 電子通信学会全国大会予稿集, S9-4, Vol. 5, pp. 335-336, 1978.
  • デルタケプストラム (1979) Delta Cepstrum
    現在の音声認識では、音声特徴量としてほぼ例外なくデルタ特徴量(ケプル トラムに対してデルタケプルトラム)が使われますが、このような動的特徴 量を最初に提唱しました。そのときの目的は話者特性の解析でした。 後年、この特徴量を古井(当時は嵯峨山の指導者)が不特定話者音声認識に 使用し大きな効果を実証しました。 [ Proposed delta cepstrum for capturing the dynamic characteristics of speech. Primarily for speaker recognition. This feature was later succesfully applied to DP-based speaker-independent speech recognition by S. Furui. With a name of ``delta-cepstrum'', extensively used in modern HMM-based speech recognition. ]
    初出: 嵯峨山 茂樹, 板倉 文忠, ``音声の動的尺度に含まれる個 人性情報,'' 日本音 響学会昭和54年度春季研究発表会講演論文集, 3-2-7, pp. 589-590 (1979-06). (Shigei Sagayama and Fumitada Itakura, ``On Individuality in a Dynamic Measure of Speech,'' Proc. ASJ Spring Spring Conf. 1979, 3-2-7, pp. 589-590, June 1979.)
  • 線スペクトル分析理論 (1979-) LSP Analysis Theory
    線スペクトル分析は板倉が着想し、現在はほぼすべての携帯電話で使われ ている効率の高い音声情報圧縮方式ですが、その理論について 板倉・嵯峨山が研究し、その理論的な性質を明らかにしました。 [ Collaborating with Itakura, produced a number of research results about theoretical properties of LSP, e.g., Duality theory of LPC and LSP, Theoretical properties of LSP frequency distributions, etc. Ph.D. thesis. ]

音声分析と音声合成 Speech Analysis and Synthesis

  • 複合正弦波モデル化 (1979) Composite Sinusoidal Modeling (CSM)
    Proposed modeling speech by a sum of $n$ sinusoids and equate autocorrelations of the speech signal and the model at lowest $2n-1$ points. The model frequencies were proven equivalent to Line Spectrum Pair frequencies. Applied to Yamaha's best selling sound IC, the CSM Speech Synthesis patent earned best among all NTT patents for several years.
    初出: 嵯峨山 茂樹, 板倉 文忠, ``複合正弦波モデルによる音声 分析,'' 電子情報通 信学会 情報・システム部門別 全国大会予稿集, 63, p. 63, 1979.; 嵯峨山 茂樹, 板倉 文忠, ``複合正弦波による簡易な音声 合成法,'' 日本音響 学会昭和54年度秋季研究発表会講演論文集, 3-2-3, pp. 557-558, Oct. 1979.
  • LSP音声合成LSI (1980) LSP Speech Synthesizer LSI
    Designed LSP Speech Synthesizer LSI collaborating with the Fujitsu's telephone switching hardware team. It became the first LSI for LSP speech synthesis and also the first C-MOS LSI for speech synthesis.
    初出: 嵯峨山 茂樹, 管村 昇, 板倉 文忠, 小池 恒彦, ``線スペ クトル対パラメータ による音声合成器LSIとその応用,'' 電子通信学会 通 信部門 全国大会予稿集, S5-5, pp. 2-393-394, 1980.
  • 日本語テキスト音声合成 (1982) Japanese Text-to-Speech System
    Collaborated with H. Sato, Y. Sagisaka, and K. Kogure to construct the first Japanese text-to-speech system.
    初出: 佐藤 大和, 匂坂 芳典, 小暮 潔, 嵯峨山 茂樹, ``日本語 テキストからの音声 合成,'' 電子情報通信学会全国大会予稿集, S6-3, Vol. 5, pp. 399-400, 1982.

音素モデル Phone Modeling

  • 木構造の異音クラスタリング (1987) Tree-based Allophone Clustering
    Proposed tree-based clustering technique for context-dependent phones. After presented in English in 1989 (IASSP89, Glasgow), this idea was adopted and modified by Kai-Fu Lee et al. in 1990, and extensively used by IBM and Cambridge University. Commonly used in modern phoneme-based high-performance speech recognition.
    初出: 嵯峨山 茂樹, ``音素環境のクラスタリング,'' 音学講論, 1-5-15, pp. 29-30, Oct. 1987. (Sagayama, S. (1987). "Phoneme Environment Clustering," 1-5-15, Proc. ASJ Fall Conference, pp. 29-30, Oct. 1987.
  • 隠れマルコフ網 (1991) Hidden Markov Network
    Combined the above-stated tree-based clustering and state tying ideas to represent context-dependent phones. Successive State Splitting (SSS) algorithm was also proposed for automatically obtaining a network structure of allophones. HMnet and SSS algorithm were extensively applied to speech recognition and other areas (e.g., protein analysis, automatic grammar acquisition, etc.).
    初出: 鷹見 淳一, 嵯峨山 茂樹, ``逐次状態分割法(SSS)による隠れ マルコフネット ワークの自動生成,'' 日本音響学会平成3年度秋季研究発 表会講演論文集, 2-5-13, pp. 73-74, Oct. 1991. (Jun-ichi Takami and Shigeki Sagayama, ``Automatic Generation of the Hidden Markov Network by the Successive State Splitting Algorithm,'' Proc. ASJ Fall Conf., 2-5-13, pp. 73-74, Oct. 1991.)
  • コンテキスト依存HMM=LR 連続音声認識 (1996) Context-Dependent HMM-LR Continuous Speech Recognition
    Context-dependent HMM (HMnet) was combined with generalized LR parser for continuous speech recognition using a given context-free grammar (CFG).
    初出: 永井 明人, 鷹見 淳一, 嵯峨山 茂樹, ``環境依存連続 HMMを用いたHMM-LR連続 音声認識,'' 日本音響学会平成3年度秋季研究発表 会講演論文集, 1-5-20, pp. 39-40, Oct. 1991.
  • 四階層結び構造HMM (1996) Four-layer Tied-structure HMM
    Tying at four levels: allophones, states, Gaussians, and scalar parameters. Advantageous in training with small amount and fast likelihood calculation. Presented in English at ICASSP95.
    初出: 高橋 敏, 嵯峨山 茂樹, ``4 階層の共有構造を持つ音素 環境依存 HMM の検討,'' 日本音響学会平成6年度秋期研究発表会講演論文集, 3-8-3, pp. 113-114, Oct. 1994. (Satoshi Takahashi, Shigeki Sagayama, ``A Study of Context-Dependent HMMs with Four-Level Tied Structure,'' Proc. ASJ Fall Conf., 3-8-3, pp. 113-114, Oct. 1994.)
  • 離散混合HMM (1996) Discrete Mixture HMM
    Mixture components are replaced by discrete (scalar quantized) distributions to represent non-Gaussian, complex distributions. Presented in English at ICASSP97.
    初出: 高橋 敏, 嵯峨山 茂樹, ``離散混合出力分布型HMM,'' 日本 音響学会平成8年度 秋期研究発表会講演論文集, -, pp. -, Sep. 1996. (Satoshi Takahashi, Shigeki Sagayama, ``Discrete Mixture Output Distribution HMM,'' Proc. ASJ Fall Conf., -, pp. -, Sep. 1996.)
  • 非同期遷移HMM (1999) Asynchronous-Transition HMM
    Proposed a new HMM structure where state transitions are not synchronous between features.
    初出: 松田繁樹, 中井満, 下平博, 嵯峨山茂樹, ``非同期遷移 型HMM,'' 平成11年日本音響学会秋季研究発表会講演論文集, 1-1-12, pp. 23-24, Oct. 1999.
  • 重回帰HMM (2000) Multiple Linear-Regression HMM
    Proposed a new HMM structure where mean vectors are linearly dependent on observable factors, such as pitch frequency and power. English paper at ICASSP2001.
    初出: 藤永 勝久, 中井 満, 下平 博, 嵯峨山 茂樹, ``重回帰 HMMによる音声のモデ ル化,'' 平成12年電気関係学会北陸支部大会講演論 文集, G-15, p. 458, Sep 2000.

話者・雑音へのモデル適応 Speaker & Noise Modeling & Adaptation

  • ベクトル場平滑化法 (1992) Vector Field Smoothing
    Proposed a speaker adaptation method by spatially smoothing difference vectors between original and trained Gaussian mean vectors in the feature space. This was the first method enabling adapting the Gaussian mixture HMM to speaker. Extensively used in Japan.
    初出: 大倉 計美, 杉山 雅英, 嵯峨山 茂樹, ``混合連続分布HMM を用いた移動ベクトル場平滑化話者適応方式,'' 日本音響学会平成4年度春 季研究発表会講演論文集, 2-Q-17, pp. 191-192, Mar. 1992. (Kazumi Ohkura, Masahide Sugiyama and Sigeki Sagayama, ``Speaker Adaptation Based on Transfer Vector Field Smoothing Model with Continuous Mixture Density HMMs,'' ASJ, 2-Q-17, pp. 191-192, (Mar. 1992).
  • 話者混合法 (1992) Speaker-Tied Mixture
    Gaussian mixture is derived from speaker-dependent single Gaussian phone (allophone) models. Later, this model was used for rapid speaker adaptation where speaker mixture weights are adapted using an extremely small amount of training data (1 word, for example).
    初出: 小坂 哲夫, 鷹見 淳一, 嵯峨山 茂樹, ``話者混合SSSによる不特定話者音 声認識,'' 日本音響学会平成4年度秋季研究発表会講演論文集, 2-5-9, pp. 135-136, Oct. 1992. (T. Kosaka, J. Takami and S. Sagayama, ``Speaker-Independent Speech Recognition Using Speaker-Mixture SSS algorithm,'' ASJ Fall Conf., 2-5-9, pp. 135-136, Oct. 1992.
  • 話者木構造 (1993) Speaker Tree
    Applied tree-based clustering to speakers to find a speaker tree that spanned from a speaker-independent model to speaker-dependent models along the tree.
    初出: 小坂 哲夫, 松永 昭一, 嵯峨山 茂樹, ``木構造クラスタリ ングを用いた話者 適応,'' 日本音響学会平成5年度秋期研究発表会講演論 文集, 2-7-14, pp. 97-98, Oct. 1993.
  • MAP/VFS話者適応法 (1994) MAP/VFS for Speaker Adaptation
    MAP (maximum a priori) training and VFS (vector field smoothing) are combined to accelerate speaker adaptation.
    初出: 高橋 淳一, 嵯峨山 茂樹, ``最大事後確率推定と移動ベク トル場平滑化を組み 合わせによる高速話者適応,'' 日本音響学会平成6年 度秋期研究発表会講演論 文集, 2-8-19, pp. 75-76, Oct. 1994. (Jun-ichi Takahashi, Shigeki Sagayama, ``Vector-Field-Smoothed Baysian Learning for Fast Speaker Adaptation,'' Proc. ASJ Fall Conf., 2-8-19, pp.75-76, Oct 1994.)
  • ヤコビ適応法 (1996) Jacobian Adaptation
    Proposed a fast model adaptation method to the environmental noise. When adapting a model trained beforehand for noise A to the target noise B, and when A and B are relatively close, the noise adaptation procedure is linearized and can be very fast. This idea was first formulated at ATR, 1992, inspired by PMC by M. Gales. First experimental results were obtained in 1996 and presented in English at ICASSP97 (Munich). Extensive studies often seen in recent ICASSPs and ICSLPs.
    初出: 山口義和, 高橋淳一, 高橋敏, 嵯峨山茂樹, ``Taylor展 開に基づく高速な音響モデル適応法,'' 日本音響学会平成8年度秋 季研究発表会講演論文集, -, pp. -, Sep. 1996.

オンライン手書き文字認識 Hand-Writing Recognition

  • HMMと構造化漢字辞書を用いたによるオンライン手書き文字認識 (2000) HMM-based Online Hand-Written Kanji-Character Recognition with Structured Lexicon
    Online hand-written Kanji character is recognized in the continuous speech recognition framework where a 6500-Kanji lexicon is hierarchically structured and represented by sequences of substrokes. This was the first application of continuous speech recognition algorithm to handwriting. Englsih papers at ICDAR2001, IWFHR2002, and ICPR2002.
    初出: 秋良 直人, 中井 満, 下平 博, 嵯峨山 茂樹, ``ストロー クHMMによるオンラ イン文字認識の特徴量の検討,'' 平成12年電気関係学 会北陸支部大会講演論文 集, F-92, p. 393, Sep 2000.

音楽情報処理 Music Information Processing

  • HMMによるMIDI信号の楽譜化 (1999) HMM-based Music Transription from MIDI Signals
    The sequence of observed note durations (inter-onset time) was transcribed by an HMM-based note recognizer with a grammar probabilistically modeling the sequence of musical notes. Will be presented at MMSP2002 in English.
    Bibilography: 齋藤直樹, 中井満, 下平博, 嵯峨山茂樹, ``隠れマルコフ モデルによる音楽演奏情報からの音符列推定,'' 平成11年電気関係学会北 陸支部大会講演論文集, Oct. 1999.
  • HMMによる旋律への自動和声づけ (1999) HMM-based Harmonization of Given Melodies
    Optimal chord finding was formulated as finding the Viterbi sequence of hidden state which generates the observed melody. With a bigram grammar of chord sequences, the decoding process estimates the most likely chord sequence in the maximum likelihood sense.
    初出: 川上隆, 中井満, 下平博, 嵯峨山茂樹, ``隠れマルコフモ デルを用いた旋律への和声付け,'' 平成11年電気関係学会北陸支部大会講 演論文集, Oct. 1999.

プロジェクト Projects

  • 音声対話擬人化エージェントツールキット (2000-2002) Anthropomorphic Dialogue Agent Toolkit
    Involving 17 researchers from 10 organizations, a project for providing a open-source, free-of-charge toolkit is in progress for promoting the spoken dialog research. Funded by IPA (Information Processing Promotion Agency) for 3 years.
  • 視覚障害者のための文字コミュニケーション (2000-2002) Written Communication for the Blind
    Combining hand-written character recognition, speech synthesis, and new transducers for written input (replacing the stylus), a project is in progress for supporting written communication (E-mailing etc.) of the blind. Collaborating with 7 organizations. Funded by Ishikawa Prefecture for 3 years.

嵯峨山トップページへ