世木寛之

セギヒロユキ (Hiroyuki SEGI)

基本情報

所属: 成蹊大学理工学部理工学科教授

学位: 博士(工学)(慶應義塾大学)

J-GLOBAL ID: 201501025877783683
researchmap会員ID: B000244685

研究キーワード

研究分野

情報通信 / 情報学基礎論 / 音声処理

経歴

学歴

委員歴

2022年11月 - 現在

内閣府「障害者による情報取得等に資する機器等の開発及び普及の促進並びに質の向上に関する協議の場」構成員
2022年11月 - 2023年3月

総務省「視聴覚障害者等向け放送の充実に関する研究会」構成員

受賞

2017年8月

ティーチングアウォード成蹊大学
2017年3月

第32回電気通信普及財団賞(テレコムシステム技術賞奨励賞) An Automatic Broadcast System for the Weather Report Program (公財)電気通信普及財団

世木寛之, 田高礼子, 清山信正, 都木徹, 植松裕子, 斎藤英雄, 小澤愼治
2013年4月

文部科学大臣表彰科学技術賞音声合成と話速変換を利用した自動放送システムの開発文部科学省

世木寛之
2012年6月

Senior Member 米国電気電子学会

世木寛之
2012年6月

同窓会矢上賞株式市況音声合成システムの開発慶應義塾大学理工学部

世木寛之

もっとみる

論文

Speech rate control methods of speech synthesis systems for quick listening

Ai Mizota, Hiroyuki Segi

2021 IEEE International Conference on Consumer Electronics (ICCE) 2021年1月10日査読有り
Comparison of tied-mixture and state-clustered HMMs with respect to recognition performance and training method

Hiroyuki Segi, Shoei Sato, Kazuo Onoe, Akio Kobayashi, Akio Ando

Artificial Intelligence: Concepts, Methodologies, Tools, and Applications 3 2021-2037 2016年12月12日査読有り

Tied-mixture HMMs have been proposed as the acoustic model for large-vocabulary continuous speech recognition and have yielded promising results. They share base-distribution and provide more flexibility in choosing the degree of tying than state-clustered HMMs. However, it is unclear which acoustic models to superior to the other under the same training data. Moreover, LBG algorithm and EM algorithm, which are the usual training methods for HMMs, have not been compared. Therefore in this paper, the recognition performance of the respective HMMs and the respective training methods are compared under the same condition. It was found that the number of parameters and the word error rate for both HMMs are equivalent when the number of codebooks is sufficiently large. It was also found that training method using the LBG algorithm achieves a 90% reduction in training time compared to training method using the EM algorithm, without degradation of recognition accuracy.
Unit-Selection Speech Synthesis Method Using Words as Search Units

Segi Hiroyuki

INTERNATIONAL JOURNAL OF MULTIMEDIA DATA ENGINEERING & MANAGEMENT 7(2) 53-67 2016年4月査読有り
株式市況音声合成システム

世木寛之

成蹊大学理工学研究報告 52(2) 5-10 2015年12月

The 'Kabushiki Shikyo' program broadcast on NHK Radio 2 reports on the daily closing prices and net changes of about 830 stocks listed on the Tokyo Stock Exchange. Reading out the numerical values without making mistakes within the allotted broadcast time can be extremely difficult for the announcers. We have therefore developed an automatic broadcast system for stock-price bulletins, which uses numerical speech synthesis and automatic speech-rate conversion. Our system has been used in experimental digital terrestrial radio broadcasts since October 2006 and also used in NHK radio 2 since March 2010. This article describes the generation of texts to build the speech waveform database, the mechanism used to synthesize numerical speech via the database, and the evaluation of naturalness for the synthesized speech samples.
Comparison of tied- mixture and state-clustered HMMs with respect to recognition performance and training method

Hiroyuki Segi, Kazuo Onoe, Shoei Sato, Akio Kobayashi, Akio Ando

Journal of Information Technology Research 7(3) 15-31 2014年7月1日査読有り

Tied-mixture HMMs have been proposed as the acoustic model for large-vocabulary continuous speech recognition and have yielded promising results. They share base-distribution and provide more flexibility in choosing the degree of tying than state-clustered HMMs. However, it is unclear which acoustic models to superior to the other under the same training data. Moreover, LBG algorithm and EM algorithm, which are the usual training methods for HMMs, have not been compared. Therefore in this paper, the recognition performance of the respective HMMs and the respective training methods are compared under the same condition. It was found that the number of parameters and the word error rate for both HMMs are equivalent when the number of codebooks is sufficiently large. It was also found that training method using the LBG algorithm achieves a 90% reduction in training time compared to training method using the EM algorithm, without degradation of recognition accuracy.
An Automatic Broadcast System for a Weather Report Radio Program

Hiroyuki Segi, Reiko Takou, Nobumasa Seiyama, Tohru Takagi, Yuko Uematsu, Hideo Saito, Shinji Ozawa

IEEE TRANSACTIONS ON BROADCASTING 59(3) 548-555 2013年9月査読有り

Here we describe a speech-synthesis method using templates that can generate recording-sentence sets for speech databases and produce natural sounding synthesized speech. Applying this method to the Japan Broadcasting Corporation (NHK) weather report radio program reduced the size of the recording-sentence set required to just a fraction of that needed by a comparable method. After integrating the recording voice of the generated recording-sentence set into the speech database, speech was produced by a voice synthesizer using templates. In a paired-comparison test, 66 % of the speech samples synthesized by our system using templates were preferred to those produced by a conventional voice synthesizer. In an evaluation test using a five-point mean opinion score (MOS) scale, the speech samples synthesized by our system scored 4.97, whereas the maximum score for commercially available voice synthesizers was 3.09. In addition, we developed an automatic broadcast system for the weather report program using the speech-synthesis method and speech-rate converter. The system was evaluated using real weather data for more than 1 year, and exhibited sufficient stability and synthesized speech quality for broadcast purposes.
Spectral Features for Perceptually Natural Phoneme Replacement by Another Speaker's Speech

Reiko Takou, Hiroyuki Segi, Tohru Takagi, Nobumasa Seiyama

IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES E95A(4) 751-759 2012年4月査読有り

The frequency regions and spectral features that can be used to measure the perceived similarity and continuity of voice quality are reported here. A perceptual evaluation test was conducted to assess the naturalness of spoken sentences in which either a vowel or a long vowel of the original speaker was replaced by that of another. Correlation analysis between the evaluation score and the spectral feature distance was conducted to select the spectral features that were expected to be effective in measuring the voice quality and to identify the appropriate speech segment of another speaker. The mel-frequency cepstrum coefficient (MFCC) and the spectral center of gravity (COG) in the low-, middle-, and high-frequency regions were selected. A perceptual paired comparison test was carried out to confirm the effectiveness of the spectral features. The results showed that the MFCC was effective for spectra across a wide range of frequency regions, the COG was effective in the low- and high-frequency regions, and the effective spectral features differed among the original speakers.
放送品質を実現するための波形接続型音声合成システムの構築

世木寛之

慶應義塾大学 2012年3月査読有り
TEMPLATE-BASED METHODS FOR SENTENCE GENERATION AND SPEECH SYNTHESIS

Hiroyuki Segi, Reiko Takou, Nobumasa Seiyama, Tohru Takagi, Hideo Saito, Shinji Ozawa

2011 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING 1757-1760 2011年査読有り

Here we propose a sentence-generation method using templates that can be applied to create a speech database. This method requires the recording of a relatively small sentence set, and the resultant speech database can generate comparatively natural sounding synthesized speech. Applying this method to the Japan Broadcasting Corporation (NHK) weather report radio program reduced the size of the required sentence set to just a fraction of that required by comparable methods. We also propose a speech-synthesis method using templates. In an evaluation test, 66% of the speech samples synthesized by the proposed method using templates were preferred to those produced by the conventional concatenative speech-synthesis method.
音声合成のためのテンプレートを用いた録音文セット生成システムとラジオ番組「気象通報」への適用について

世木寛之, 田高礼子, 清山信正, 都木徹, 斎藤英雄, 小澤愼治

映像情報メディア学会誌 : 映像情報メディア = The journal of the Institute of Image Information and Television Engineers 65(1) 76-83 2011年査読有り

The design method of a sentence set for a speech-synthesis database strongly influences the quality of the synthesized speech. To minimize the costs associated with making the speech recordings and constructing the speech database, the number of the sentence set should be limited. However, if a sentence set does not include sufficient data, the quality of the synthesized speech can be inadequate. In this paper, we propose a method for generating a sentence set from templates. When applied to the templates in the "Weather Report" radio program, the proposed method reduced the number of the sentence set to less than several percent of that required by a comparison method. In addition, the mean opinion score of speech samples synthesized using the proposed method was 4.32 on a five-point scale.
Development of a Prototype Data-Broadcast Receiver with a High-Quality Voice Synthesizer

Hiroyuki Segi, Reiko Takou, Nobumasa Seiyama, Tohru Takagi

IEEE TRANSACTIONS ON CONSUMER ELECTRONICS 56(1) 169-174 2010年2月査読有り

Here we propose a prototype data-broadcast receiver equipped with a voice synthesizer, which can read out stock prices and stock-price changes from a live data broadcast. Using this receiver, listeners can access their chosen stock information at any time and at an appropriate speech rate. We also propose a high-quality voice synthesizer for use with this receiver. A subjective evaluation confirmed the superiority of this voice synthesizer over commercially available ones(1).
Development of a Prototype Data-Broadcast Receiver with a High-Quality Voice Synthesizer

Hiroyuki Segi, Reiko Tako, Nobumasa Seiyama, Tohru Takagi

2010 DIGEST OF TECHNICAL PAPERS INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS ICCE 9.1-4 411-412 2010年査読有り

Here we propose a prototype data-broadcast receiver equipped with a voice synthesizer, which can use the stock prices and the stock-price changes from a live data broadcast. Using this receiver, listeners can access their chosen stock information at any time and at a speech rate appropriate to each individual. We also propose a high-quality voice synthesizer for use with this receiver. The results of a subjective evaluation confirmed the superiority of the proposed voice synthesizer compared with commercially available voice synthesizers.
ニュース番組の収録音声を利用した波形接続型音声合成システム

世木寛之, 田高礼子, 清山信正, 都木徹

情報処理学会論文誌ジャーナル(CD-ROM) 50(2) 575-586 2009年2月査読有り
7-9 放送用単語合成音声作成編集ツールの検討(第7部門マルチメディアフレームワーク)

田高礼子, 世木寛之, 清山信正, 都木徹

映像情報メディア学会冬季大会講演予稿集 2008 _7-9-1_ 2008年

We are developing a tool that can synthesize concatenate word speech and correct its degradation in order to make it broadcasting quality. In this study, several correction functions in were introduced into this tool. It is available to investigate better correction procedure to generate high quality synthesized speech.
高品質な株価音声合成装置の開発とデジタルラジオ放送での試験運用(<フィールド論文小特集>放送現業・コンテンツ制作)

世木寛之, 清山信正, 田高礼子, 都木徹, 大出訓史, 今井篤, 西脇正通, 小山隆二

映像情報メディア学会誌 : 映像情報メディア 62(1) 69-76 2008年査読有り

The 'Kabushiki Shikyo' program, broadcast on NHK Radio 2, reports on the daily closing prices and net changes of about 830 stocks listed on the Tokyo Stock Exchange. Reading out the numerical values within the allotted broadcast time without making mistakes can be extremely difficult for the announcers. We have therefore developed a prototype voice synthesizer for stock-price bulletins, which uses numerical speech synthesis and automatic speech-rate conversion. Our prototype system has been used in experimental digital terrestrial radio broadcasts since October 2006. This article describes the generation of texts to build the speech waveform database, the mechanism used to synthesize numerical speech via the database, the evaluation of naturalness of synthesized speech samples, and the prototype system currently being used by experimental digital terrestrial radio.
High Quality Speech Synthesis System Using Speech Rate Conversion for Stock-price Bulletins

Segi Hiroyuki, Seiyama Nobumasa, Tako Reiko, Takagi Tohru, Toda Hideo, Koyama Ryuji

National Association of Broadcasters Proceedings (NAB) Broadcasting Engineering Conference 205-212 2007年4月査読有り
Acoustic model adaptation by selective training using two-stage clustering

S Sato, H Segi, K Onoe, E Miyasaka, H Isono, T Imai, A Ando

ELECTRONICS AND COMMUNICATIONS IN JAPAN PART III-FUNDAMENTAL ELECTRONIC SCIENCE 88(2) 41-51 2005年査読有り

In speech recognition systems where the speaker and utterance environment cannot be designated, the drop in recognition precision due to the incompatibility of the input speech and acoustic model's training data is a problem. Although this problem is normally solved by speaker adaptation, sufficient precision cannot be achieved for speaker adaptation unless good-quality adaptation data can be obtained. In this paper, the authors propose a method of efficiently clustering large-scale data using the likelihoods of a cluster model that was created from small-scale data as the criteria to obtain a high-precision adapted acoustic model. They also propose a method of using the cluster model to automatically determine the adapted acoustic model during recognition from only the beginning of the sentences of the input speech. The results of applying the proposed technique to news speech recognition experiments show that the adapted acoustic model selection precision can be ensured by using only 0.5 second of data of the beginnings of sentences of the input speech and that the proposed technique achieves a reduction rate for invalid recognitions of 20% and a reduction in the time required for recognition of 23% compared with when the adapted acoustic model for each cluster is not used. (C) 2004 Wiley Periodicals, Inc.
A concatenative speech synthesis method using context dependent phoneme sequences with variable length as search units.

Hiroyuki Segi, Tohru Takagi, Takayuki Ito

Fifth ISCA ITRW on Speech Synthesis, Pittsburgh, PA, USA, June 14-16, 2004 115-120 2004年査読有り
Filter bank subtraction for robust speech recognition

K Onoe, H Segi, T Kobayakawa, S Sato, S Homma, T Imai, A Ando

IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS E86D(3) 483-488 2003年3月査読有り

In this paper, we propose a new technique of filter bank subtraction for robust speech recognition under various acoustic conditions. Spectral subtraction is a simple and useful technique for reducing the influence of additive noise. Conventional spectral subtraction assumes accurate estimation of the noise spectrum and no correlation between speech and noise. Those assumptions, however, are rarely satisfied in reality, leading to the degradation of speech recognition accuracy. Moreover, the recognition improvement attained by conventional methods is slight when the input SNR changes sharply. We propose a new method in which the output values of filter banks are used for noise estimation and subtraction. By estimating noise at each filter bank, instead of at each frequency point, the method alleviates the necessity for precise estimation of noise. We also take into consideration expected phase differences between the spectra of speech and noise in the subtraction and control a subtraction coefficient theoretically. Recognition experiments on test sets at several SNRs showed that the filter bank subtraction technique improved the word accuracy significantly and got better results than conventional spectral subtraction on all the test sets. In other experiments, on recognizing speech from TV news field reports with environmental noise, the proposed subtraction method yielded better results than the conventional method.
Simultaneous subtitling system for broadcast news programs with a speech recognizer

A Ando, T Imai, A Kobayashi, S Homma, J Goto, N Seiyama, T Mishima, T Kobayakawa, S Sato, K Onoe, H Segi, A Imai, A Matsui, A Nakamura, H Tanaka, T Takagi, E Miyasaka, H Isono

IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS E86D(1) 15-25 2003年1月査読有り

There is a strong demand to expand captioned broadcasting for TV news programs in Japan. However, keyboard entry of captioned manuscripts for news program cannot keep pace with the speed of speech, because in the case of Japanese it takes time to select the correct characters from among homonyms. In order to implement simultaneous subtitled broadcasting for Japanese news programs, a simultaneous subtitling system by speech recognition has been developed. This system consists of a real-time speech recognition system to handle broadcast news transcription and a recognition-error correction system that manually corrects mistakes in the recognition result with short delay time. NHK started simultaneous subtitled broadcasting for the news program "News 7" on the evening of March 27, 2000.
2段階クラスタリングに基づく選択学習による音響モデル適応化

佐藤庄衛, 世木寛之, 尾上和穂, 宮坂栄一, 磯野春雄, 今井亨, 安藤彰男

電子情報通信学会論文誌. D-II, 情報・システム, II-パターン処理 85(2) 174-183 2002年2月査読有り

話者及び発話環境を特定できない音声認識システムでは,音響モデルの学習データと入力音声の不整合による認識精度の低下が問題になる.この問題は話者適応化によって解決が図られるのが通常であるが,話者適法化は,良質な適応化データを入手できないと十分な精度が達成できない.本論文では,小規模なデータから作成したクラスタモデルのゆう度を基準にして,大規模なデータを効率的にクラスタリングすることにより,精度の高い適応音響モデルを得る方法を提案する.また,認識時には,クラスタモデルを用いて,入力音声の文頭部分のみから適応音響モデルを自動的に決定する方法を提案する.提案手法をニュース音声の認識実験に適用した結果,入力音声の文頭の0.5秒間のデータだけでも適応音響モデルの選択精度を確保でき,クラスタごとの適応音響モデルを用いない場合に比べて,誤認識削減率20%が得られ,認識所要時間が23%削減されることが示された.
Filter bank subtraction for robust speech recognition.

Kazuo Onoe, Hiroyuki Segi, Takeshi Kobayakawa, Shoei Sato, Toru Imai, Akio Ando

7th International Conference on Spoken Language Processing, ICSLP2002 - INTERSPEECH 2002, Denver, Colorado, USA, September 16-20, 2002 2002年査読有り
音声認識を利用した放送用ニュース字幕制作システム

安藤彰男, 今井亨, 小林彰夫, 本間真一, 後藤淳, 清山信正, 三島剛, 小早川健, 佐藤庄衛, 尾上和穂, 世木寛之, 今井篤, 松井淳, 中村章, 田中英輝, 都木徹, 宮坂栄一, 磯野春雄

電子情報通信学会論文誌. D-II, 情報・システム, II-パターン処理 84(6) 877-887 2001年6月査読有り

テレビニュース番組に対する字幕放送を実現するためには, リアルタイムで字幕原稿を作成する必要がある.欧米では特殊なキーボード入力により, ニュースの字幕原稿が作成されているが, 日本語の場合には, 仮名漢字変換などに時間がかかるため, アナウンサーの声に追従して字幕原稿を入力することは難しい.そこで, 音声認識を利用した, 放送ニュース番組用の字幕制作システムを開発した.このシステムは, アナウンサーの音声をリアルタイムで認識し, 認識結果中の認識誤りを即座に人手で修正して, 字幕原稿を作成するシステムである.NHKでは, 本システムを利用して, 平成12年3月27日から, ニュース番組「ニュース7」の字幕放送を開始した.
Speech recognition of broadcast sports news.

Atsushi Matsui, Hiroyuki Segi, Akio Kobayashi, Toru Imai, Akio Ando

EUROSPEECH 2001 Scandinavia, 7th European Conference on Speech Communication and Technology, 2nd INTERSPEECH Event, Aalborg, Denmark, September 3-7, 2001 709-712 2001年査読有り
人間の脳における記憶のメカニズム

世木寛之

慶應義塾大学 1996年3月査読有り

MISC

株式市況音声合成システムの開発 (音声処理特集号)

世木寛之, 清山信正, 田高礼子

NHK技研R&D (131) 40-47 2012年1月
データ放送を利用した地震・津波速報の自動読み上げ放送サービス (バリアフリー放送技術特集号)

松村欣司, 世木寛之, 近藤悟

NHK技研R&D (114) 44-49 2009年3月
ニュース番組の収録音声を利用した波形接続型音声合成システム

世木寛之, 田高礼子, 清山信正, 都木徹

情報処理学会論文誌 50(2) 575-586 2009年2月15日

大規模な音声データベースから音声データを選択して接続する波形接続型音声合成が提案されている．この音声合成方式で利用される大規模音声データベースは，音韻バランスなどを考慮して選定された文章を，音声合成に適した話速やスタイルで読み上げることで作成されることが多い．一方，放送局では過去に放送された番組が大量に保存されているため，これらを音声データベースとして利用することが考えられる．本研究では，ニュース番組の収録音声を，波形接続型音声合成システムの音声データベースとして利用することを試みた．高い頻度で音声データベースに存在する音素列を，前後の音素環境を考慮して抽出した"音素環境依存音素列"を探索単位として合成音を作成し，5段階のオピニオン評価実験を行った結果，MOSは4.01となり，「不自然な部分はあるが気にならない」という自然性を持つ合成音が得られた．特に，全体の39.8%が5の「自然である」と評価され，自然音声と変わらない品質の合成音がかなりの頻度で作成されていることが分かった．次に，目標スコアを用いた場合と，用いない場合の合成音とを比較したところ，MOSの差は0.18となり，音声データベースの発話内容と合成する文が類似している場合には，必ずしも韻律予測せず目標スコアを考慮しなくても，自然性の高い合成音を作成できる可能性が示された．Proposals have been made to implement a system that generates synthesized speech by concatenating segments of speech stored in large databases. While these databases are often created by recording sentences with a specific phonetic balance, read at a rate and in a style that are optimal for speech synthesis, this paper explores an alternative method of database creation, one that utilizes broadcast materials archived in networks. In our study, we used samples of recorded speech from news programs to create a speech database. An assessment of speech generated by the speech synthesis method using "context dependent phoneme sequences" as search units yielded the mean opinion score (MOS) of 4.01 in a one-to-five-scale rating. Overall, the samples were considered "somewhat unnatural but not bothersome." In particular, 39.8% of the entire samples scored 5.0, demonstrating their highly natural-sounding quality. In addition, we compared the evaluation on "synthesized speech with target scores" and that on "synthesized speech without target scores." The difference of MOS was 0.18. This result confirmed that prosody prediction or target scores are not necessarily required to create synthesized speech of natural-sounding quality when the content of input sentences is similar to the content of sentences stored in the database.
株価音声合成システムのデジタルラジオ放送での試験運用について

世木寛之, 清山信正, 田高礼子

放送技術 61(4) 91-96 2008年4月
別話者音素による部分置換音声の自然性とスペクトル特徴量について

田高礼子, 世木寛之, 清山信正

聴覚研究会資料 38(2) 159-164 2008年3月20日