研究者業績

世木 寛之

セギ ヒロユキ  (Hiroyuki SEGI)

基本情報

所属
成蹊大学 理工学部 理工学科 教授
学位
博士(工学)(慶應義塾大学)

J-GLOBAL ID
201501025877783683
researchmap会員ID
B000244685

研究キーワード

 2

論文

 25
  • Ai Mizota, Hiroyuki Segi
    2021 IEEE International Conference on Consumer Electronics (ICCE) 2021年1月10日  査読有り
  • Hiroyuki Segi, Shoei Sato, Kazuo Onoe, Akio Kobayashi, Akio Ando
    Artificial Intelligence: Concepts, Methodologies, Tools, and Applications 3 2021-2037 2016年12月12日  査読有り
    Tied-mixture HMMs have been proposed as the acoustic model for large-vocabulary continuous speech recognition and have yielded promising results. They share base-distribution and provide more flexibility in choosing the degree of tying than state-clustered HMMs. However, it is unclear which acoustic models to superior to the other under the same training data. Moreover, LBG algorithm and EM algorithm, which are the usual training methods for HMMs, have not been compared. Therefore in this paper, the recognition performance of the respective HMMs and the respective training methods are compared under the same condition. It was found that the number of parameters and the word error rate for both HMMs are equivalent when the number of codebooks is sufficiently large. It was also found that training method using the LBG algorithm achieves a 90% reduction in training time compared to training method using the EM algorithm, without degradation of recognition accuracy.
  • Segi Hiroyuki
    INTERNATIONAL JOURNAL OF MULTIMEDIA DATA ENGINEERING & MANAGEMENT 7(2) 53-67 2016年4月  査読有り
  • 世木寛之
    成蹊大学理工学研究報告 52(2) 5-10 2015年12月  
    The 'Kabushiki Shikyo' program broadcast on NHK Radio 2 reports on the daily closing prices and net changes of about 830 stocks listed on the Tokyo Stock Exchange. Reading out the numerical values without making mistakes within the allotted broadcast time can be extremely difficult for the announcers. We have therefore developed an automatic broadcast system for stock-price bulletins, which uses numerical speech synthesis and automatic speech-rate conversion. Our system has been used in experimental digital terrestrial radio broadcasts since October 2006 and also used in NHK radio 2 since March 2010. This article describes the generation of texts to build the speech waveform database, the mechanism used to synthesize numerical speech via the database, and the evaluation of naturalness for the synthesized speech samples.
  • Hiroyuki Segi, Kazuo Onoe, Shoei Sato, Akio Kobayashi, Akio Ando
    Journal of Information Technology Research 7(3) 15-31 2014年7月1日  査読有り
    Tied-mixture HMMs have been proposed as the acoustic model for large-vocabulary continuous speech recognition and have yielded promising results. They share base-distribution and provide more flexibility in choosing the degree of tying than state-clustered HMMs. However, it is unclear which acoustic models to superior to the other under the same training data. Moreover, LBG algorithm and EM algorithm, which are the usual training methods for HMMs, have not been compared. Therefore in this paper, the recognition performance of the respective HMMs and the respective training methods are compared under the same condition. It was found that the number of parameters and the word error rate for both HMMs are equivalent when the number of codebooks is sufficiently large. It was also found that training method using the LBG algorithm achieves a 90% reduction in training time compared to training method using the EM algorithm, without degradation of recognition accuracy.
  • Hiroyuki Segi, Reiko Takou, Nobumasa Seiyama, Tohru Takagi, Yuko Uematsu, Hideo Saito, Shinji Ozawa
    IEEE TRANSACTIONS ON BROADCASTING 59(3) 548-555 2013年9月  査読有り
    Here we describe a speech-synthesis method using templates that can generate recording-sentence sets for speech databases and produce natural sounding synthesized speech. Applying this method to the Japan Broadcasting Corporation (NHK) weather report radio program reduced the size of the recording-sentence set required to just a fraction of that needed by a comparable method. After integrating the recording voice of the generated recording-sentence set into the speech database, speech was produced by a voice synthesizer using templates. In a paired-comparison test, 66 % of the speech samples synthesized by our system using templates were preferred to those produced by a conventional voice synthesizer. In an evaluation test using a five-point mean opinion score (MOS) scale, the speech samples synthesized by our system scored 4.97, whereas the maximum score for commercially available voice synthesizers was 3.09. In addition, we developed an automatic broadcast system for the weather report program using the speech-synthesis method and speech-rate converter. The system was evaluated using real weather data for more than 1 year, and exhibited sufficient stability and synthesized speech quality for broadcast purposes.
  • Reiko Takou, Hiroyuki Segi, Tohru Takagi, Nobumasa Seiyama
    IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES E95A(4) 751-759 2012年4月  査読有り
    The frequency regions and spectral features that can be used to measure the perceived similarity and continuity of voice quality are reported here. A perceptual evaluation test was conducted to assess the naturalness of spoken sentences in which either a vowel or a long vowel of the original speaker was replaced by that of another. Correlation analysis between the evaluation score and the spectral feature distance was conducted to select the spectral features that were expected to be effective in measuring the voice quality and to identify the appropriate speech segment of another speaker. The mel-frequency cepstrum coefficient (MFCC) and the spectral center of gravity (COG) in the low-, middle-, and high-frequency regions were selected. A perceptual paired comparison test was carried out to confirm the effectiveness of the spectral features. The results showed that the MFCC was effective for spectra across a wide range of frequency regions, the COG was effective in the low- and high-frequency regions, and the effective spectral features differed among the original speakers.
  • 世木寛之
    慶應義塾大学 2012年3月  査読有り
  • Hiroyuki Segi, Reiko Takou, Nobumasa Seiyama, Tohru Takagi, Hideo Saito, Shinji Ozawa
    2011 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING 1757-1760 2011年  査読有り
    Here we propose a sentence-generation method using templates that can be applied to create a speech database. This method requires the recording of a relatively small sentence set, and the resultant speech database can generate comparatively natural sounding synthesized speech. Applying this method to the Japan Broadcasting Corporation (NHK) weather report radio program reduced the size of the required sentence set to just a fraction of that required by comparable methods. We also propose a speech-synthesis method using templates. In an evaluation test, 66% of the speech samples synthesized by the proposed method using templates were preferred to those produced by the conventional concatenative speech-synthesis method.
  • 世木 寛之, 田高 礼子, 清山 信正, 都木 徹, 斎藤 英雄, 小澤 愼治
    映像情報メディア学会誌 : 映像情報メディア = The journal of the Institute of Image Information and Television Engineers 65(1) 76-83 2011年  査読有り
    The design method of a sentence set for a speech-synthesis database strongly influences the quality of the synthesized speech. To minimize the costs associated with making the speech recordings and constructing the speech database, the number of the sentence set should be limited. However, if a sentence set does not include sufficient data, the quality of the synthesized speech can be inadequate. In this paper, we propose a method for generating a sentence set from templates. When applied to the templates in the "Weather Report" radio program, the proposed method reduced the number of the sentence set to less than several percent of that required by a comparison method. In addition, the mean opinion score of speech samples synthesized using the proposed method was 4.32 on a five-point scale.
  • Hiroyuki Segi, Reiko Takou, Nobumasa Seiyama, Tohru Takagi
    IEEE TRANSACTIONS ON CONSUMER ELECTRONICS 56(1) 169-174 2010年2月  査読有り
    Here we propose a prototype data-broadcast receiver equipped with a voice synthesizer, which can read out stock prices and stock-price changes from a live data broadcast. Using this receiver, listeners can access their chosen stock information at any time and at an appropriate speech rate. We also propose a high-quality voice synthesizer for use with this receiver. A subjective evaluation confirmed the superiority of this voice synthesizer over commercially available ones(1).
  • Hiroyuki Segi, Reiko Tako, Nobumasa Seiyama, Tohru Takagi
    2010 DIGEST OF TECHNICAL PAPERS INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS ICCE 9.1-4 411-412 2010年  査読有り
    Here we propose a prototype data-broadcast receiver equipped with a voice synthesizer, which can use the stock prices and the stock-price changes from a live data broadcast. Using this receiver, listeners can access their chosen stock information at any time and at a speech rate appropriate to each individual. We also propose a high-quality voice synthesizer for use with this receiver. The results of a subjective evaluation confirmed the superiority of the proposed voice synthesizer compared with commercially available voice synthesizers.
  • 世木寛之, 田高礼子, 清山信正, 都木徹
    情報処理学会論文誌ジャーナル(CD-ROM) 50(2) 575-586 2009年2月  査読有り
  • 田高 礼子, 世木 寛之, 清山 信正, 都木 徹
    映像情報メディア学会冬季大会講演予稿集 2008 _7-9-1_ 2008年  
    We are developing a tool that can synthesize concatenate word speech and correct its degradation in order to make it broadcasting quality. In this study, several correction functions in were introduced into this tool. It is available to investigate better correction procedure to generate high quality synthesized speech.
  • 世木 寛之, 清山 信正, 田高 礼子, 都木 徹, 大出 訓史, 今井 篤, 西脇 正通, 小山 隆二
    映像情報メディア学会誌 : 映像情報メディア 62(1) 69-76 2008年  査読有り
    The 'Kabushiki Shikyo' program, broadcast on NHK Radio 2, reports on the daily closing prices and net changes of about 830 stocks listed on the Tokyo Stock Exchange. Reading out the numerical values within the allotted broadcast time without making mistakes can be extremely difficult for the announcers. We have therefore developed a prototype voice synthesizer for stock-price bulletins, which uses numerical speech synthesis and automatic speech-rate conversion. Our prototype system has been used in experimental digital terrestrial radio broadcasts since October 2006. This article describes the generation of texts to build the speech waveform database, the mechanism used to synthesize numerical speech via the database, the evaluation of naturalness of synthesized speech samples, and the prototype system currently being used by experimental digital terrestrial radio.
  • Segi Hiroyuki, Seiyama Nobumasa, Tako Reiko, Takagi Tohru, Toda Hideo, Koyama Ryuji
    National Association of Broadcasters Proceedings (NAB) Broadcasting Engineering Conference 205-212 2007年4月  査読有り
  • S Sato, H Segi, K Onoe, E Miyasaka, H Isono, T Imai, A Ando
    ELECTRONICS AND COMMUNICATIONS IN JAPAN PART III-FUNDAMENTAL ELECTRONIC SCIENCE 88(2) 41-51 2005年  査読有り
    In speech recognition systems where the speaker and utterance environment cannot be designated, the drop in recognition precision due to the incompatibility of the input speech and acoustic model's training data is a problem. Although this problem is normally solved by speaker adaptation, sufficient precision cannot be achieved for speaker adaptation unless good-quality adaptation data can be obtained. In this paper, the authors propose a method of efficiently clustering large-scale data using the likelihoods of a cluster model that was created from small-scale data as the criteria to obtain a high-precision adapted acoustic model. They also propose a method of using the cluster model to automatically determine the adapted acoustic model during recognition from only the beginning of the sentences of the input speech. The results of applying the proposed technique to news speech recognition experiments show that the adapted acoustic model selection precision can be ensured by using only 0.5 second of data of the beginnings of sentences of the input speech and that the proposed technique achieves a reduction rate for invalid recognitions of 20% and a reduction in the time required for recognition of 23% compared with when the adapted acoustic model for each cluster is not used. (C) 2004 Wiley Periodicals, Inc.
  • Hiroyuki Segi, Tohru Takagi, Takayuki Ito
    Fifth ISCA ITRW on Speech Synthesis, Pittsburgh, PA, USA, June 14-16, 2004 115-120 2004年  査読有り
  • K Onoe, H Segi, T Kobayakawa, S Sato, S Homma, T Imai, A Ando
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS E86D(3) 483-488 2003年3月  査読有り
    In this paper, we propose a new technique of filter bank subtraction for robust speech recognition under various acoustic conditions. Spectral subtraction is a simple and useful technique for reducing the influence of additive noise. Conventional spectral subtraction assumes accurate estimation of the noise spectrum and no correlation between speech and noise. Those assumptions, however, are rarely satisfied in reality, leading to the degradation of speech recognition accuracy. Moreover, the recognition improvement attained by conventional methods is slight when the input SNR changes sharply. We propose a new method in which the output values of filter banks are used for noise estimation and subtraction. By estimating noise at each filter bank, instead of at each frequency point, the method alleviates the necessity for precise estimation of noise. We also take into consideration expected phase differences between the spectra of speech and noise in the subtraction and control a subtraction coefficient theoretically. Recognition experiments on test sets at several SNRs showed that the filter bank subtraction technique improved the word accuracy significantly and got better results than conventional spectral subtraction on all the test sets. In other experiments, on recognizing speech from TV news field reports with environmental noise, the proposed subtraction method yielded better results than the conventional method.
  • A Ando, T Imai, A Kobayashi, S Homma, J Goto, N Seiyama, T Mishima, T Kobayakawa, S Sato, K Onoe, H Segi, A Imai, A Matsui, A Nakamura, H Tanaka, T Takagi, E Miyasaka, H Isono
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS E86D(1) 15-25 2003年1月  査読有り
    There is a strong demand to expand captioned broadcasting for TV news programs in Japan. However, keyboard entry of captioned manuscripts for news program cannot keep pace with the speed of speech, because in the case of Japanese it takes time to select the correct characters from among homonyms. In order to implement simultaneous subtitled broadcasting for Japanese news programs, a simultaneous subtitling system by speech recognition has been developed. This system consists of a real-time speech recognition system to handle broadcast news transcription and a recognition-error correction system that manually corrects mistakes in the recognition result with short delay time. NHK started simultaneous subtitled broadcasting for the news program "News 7" on the evening of March 27, 2000.
  • 佐藤 庄衛, 世木 寛之, 尾上 和穂, 宮坂 栄一, 磯野 春雄, 今井 亨, 安藤 彰男
    電子情報通信学会論文誌. D-II, 情報・システム, II-パターン処理 85(2) 174-183 2002年2月  査読有り
    話者及び発話環境を特定できない音声認識システムでは,音響モデルの学習データと入力音声の不整合による認識精度の低下が問題になる.この問題は話者適応化によって解決が図られるのが通常であるが,話者適法化は,良質な適応化データを入手できないと十分な精度が達成できない.本論文では,小規模なデータから作成したクラスタモデルのゆう度を基準にして,大規模なデータを効率的にクラスタリングすることにより,精度の高い適応音響モデルを得る方法を提案する.また,認識時には,クラスタモデルを用いて,入力音声の文頭部分のみから適応音響モデルを自動的に決定する方法を提案する.提案手法をニュース音声の認識実験に適用した結果,入力音声の文頭の0.5秒間のデータだけでも適応音響モデルの選択精度を確保でき,クラスタごとの適応音響モデルを用いない場合に比べて,誤認識削減率20%が得られ,認識所要時間が23%削減されることが示された.
  • Kazuo Onoe, Hiroyuki Segi, Takeshi Kobayakawa, Shoei Sato, Toru Imai, Akio Ando
    7th International Conference on Spoken Language Processing, ICSLP2002 - INTERSPEECH 2002, Denver, Colorado, USA, September 16-20, 2002 2002年  査読有り
  • 安藤 彰男, 今井 亨, 小林 彰夫, 本間 真一, 後藤 淳, 清山 信正, 三島 剛, 小早川 健, 佐藤 庄衛, 尾上 和穂, 世木 寛之, 今井 篤, 松井 淳, 中村 章, 田中 英輝, 都木 徹, 宮坂 栄一, 磯野 春雄
    電子情報通信学会論文誌. D-II, 情報・システム, II-パターン処理 84(6) 877-887 2001年6月  査読有り
    テレビニュース番組に対する字幕放送を実現するためには, リアルタイムで字幕原稿を作成する必要がある.欧米では特殊なキーボード入力により, ニュースの字幕原稿が作成されているが, 日本語の場合には, 仮名漢字変換などに時間がかかるため, アナウンサーの声に追従して字幕原稿を入力することは難しい.そこで, 音声認識を利用した, 放送ニュース番組用の字幕制作システムを開発した.このシステムは, アナウンサーの音声をリアルタイムで認識し, 認識結果中の認識誤りを即座に人手で修正して, 字幕原稿を作成するシステムである.NHKでは, 本システムを利用して, 平成12年3月27日から, ニュース番組「ニュース7」の字幕放送を開始した.
  • Atsushi Matsui, Hiroyuki Segi, Akio Kobayashi, Toru Imai, Akio Ando
    EUROSPEECH 2001 Scandinavia, 7th European Conference on Speech Communication and Technology, 2nd INTERSPEECH Event, Aalborg, Denmark, September 3-7, 2001 709-712 2001年  査読有り
  • 世木寛之
    慶應義塾大学 1996年3月  査読有り

MISC

 29
  • 世木 寛之, 清山 信正, 田高 礼子
    NHK技研R&D (131) 40-47 2012年1月  
  • 世木 寛之, 田高 礼子, 清山 信正, 都木 徹
    情報処理学会論文誌 50(2) 575-586 2009年2月15日  
    大規模な音声データベースから音声データを選択して接続する波形接続型音声合成が提案されている.この音声合成方式で利用される大規模音声データベースは,音韻バランスなどを考慮して選定された文章を,音声合成に適した話速やスタイルで読み上げることで作成されることが多い.一方,放送局では過去に放送された番組が大量に保存されているため,これらを音声データベースとして利用することが考えられる.本研究では,ニュース番組の収録音声を,波形接続型音声合成システムの音声データベースとして利用することを試みた.高い頻度で音声データベースに存在する音素列を,前後の音素環境を考慮して抽出した"音素環境依存音素列"を探索単位として合成音を作成し,5段階のオピニオン評価実験を行った結果,MOSは4.01となり,「不自然な部分はあるが気にならない」という自然性を持つ合成音が得られた.特に,全体の39.8%が5の「自然である」と評価され,自然音声と変わらない品質の合成音がかなりの頻度で作成されていることが分かった.次に,目標スコアを用いた場合と,用いない場合の合成音とを比較したところ,MOSの差は0.18となり,音声データベースの発話内容と合成する文が類似している場合には,必ずしも韻律予測せず目標スコアを考慮しなくても,自然性の高い合成音を作成できる可能性が示された.Proposals have been made to implement a system that generates synthesized speech by concatenating segments of speech stored in large databases. While these databases are often created by recording sentences with a specific phonetic balance, read at a rate and in a style that are optimal for speech synthesis, this paper explores an alternative method of database creation, one that utilizes broadcast materials archived in networks. In our study, we used samples of recorded speech from news programs to create a speech database. An assessment of speech generated by the speech synthesis method using "context dependent phoneme sequences" as search units yielded the mean opinion score (MOS) of 4.01 in a one-to-five-scale rating. Overall, the samples were considered "somewhat unnatural but not bothersome." In particular, 39.8% of the entire samples scored 5.0, demonstrating their highly natural-sounding quality. In addition, we compared the evaluation on "synthesized speech with target scores" and that on "synthesized speech without target scores." The difference of MOS was 0.18. This result confirmed that prosody prediction or target scores are not necessarily required to create synthesized speech of natural-sounding quality when the content of input sentences is similar to the content of sentences stored in the database.
  • 世木 寛之, 清山 信正, 田高 礼子
    放送技術 61(4) 91-96 2008年4月  
  • 田高 礼子, 世木 寛之, 清山 信正
    聴覚研究会資料 38(2) 159-164 2008年3月20日  

書籍等出版物

 1
  • 八木伸行監修, 世木寛之ほか著 (担当:分担執筆, 範囲:第11章音声合成)
    オーム社 2008年7月

講演・口頭発表等

 47

共同研究・競争的資金等の研究課題

 1

産業財産権

 72