Speech rate control methods of speech synthesis systems for quick listening

Ai Mizota, Hiroyuki Segi

2021 IEEE International Conference on Consumer Electronics (ICCE), Jan 10, 2021 Peer-reviewed
Comparison of tied-mixture and state-clustered HMMs with respect to recognition performance and training method

Hiroyuki Segi, Shoei Sato, Kazuo Onoe, Akio Kobayashi, Akio Ando

Artificial Intelligence: Concepts, Methodologies, Tools, and Applications, 3 2021-2037, Dec 12, 2016 Peer-reviewed

Tied-mixture HMMs have been proposed as the acoustic model for large-vocabulary continuous speech recognition and have yielded promising results. They share base-distribution and provide more flexibility in choosing the degree of tying than state-clustered HMMs. However, it is unclear which acoustic models to superior to the other under the same training data. Moreover, LBG algorithm and EM algorithm, which are the usual training methods for HMMs, have not been compared. Therefore in this paper, the recognition performance of the respective HMMs and the respective training methods are compared under the same condition. It was found that the number of parameters and the word error rate for both HMMs are equivalent when the number of codebooks is sufficiently large. It was also found that training method using the LBG algorithm achieves a 90% reduction in training time compared to training method using the EM algorithm, without degradation of recognition accuracy.
Unit-Selection Speech Synthesis Method Using Words as Search Units

Segi Hiroyuki

INTERNATIONAL JOURNAL OF MULTIMEDIA DATA ENGINEERING & MANAGEMENT, 7(2) 53-67, Apr, 2016 Peer-reviewed
An Automatic Broadcast System Using Speech Synthesis for the Stock Market Report

Hiroyuki SEGI

The Journal of the Faculty of Science and Technology, Seikei University, 52(2) 5-10, Dec, 2015
Comparison of tied- mixture and state-clustered HMMs with respect to recognition performance and training method

Hiroyuki Segi, Kazuo Onoe, Shoei Sato, Akio Kobayashi, Akio Ando

Journal of Information Technology Research, 7(3) 15-31, Jul 1, 2014 Peer-reviewed

Tied-mixture HMMs have been proposed as the acoustic model for large-vocabulary continuous speech recognition and have yielded promising results. They share base-distribution and provide more flexibility in choosing the degree of tying than state-clustered HMMs. However, it is unclear which acoustic models to superior to the other under the same training data. Moreover, LBG algorithm and EM algorithm, which are the usual training methods for HMMs, have not been compared. Therefore in this paper, the recognition performance of the respective HMMs and the respective training methods are compared under the same condition. It was found that the number of parameters and the word error rate for both HMMs are equivalent when the number of codebooks is sufficiently large. It was also found that training method using the LBG algorithm achieves a 90% reduction in training time compared to training method using the EM algorithm, without degradation of recognition accuracy.
An Automatic Broadcast System for a Weather Report Radio Program

Hiroyuki Segi, Reiko Takou, Nobumasa Seiyama, Tohru Takagi, Yuko Uematsu, Hideo Saito, Shinji Ozawa

IEEE TRANSACTIONS ON BROADCASTING, 59(3) 548-555, Sep, 2013 Peer-reviewed

Here we describe a speech-synthesis method using templates that can generate recording-sentence sets for speech databases and produce natural sounding synthesized speech. Applying this method to the Japan Broadcasting Corporation (NHK) weather report radio program reduced the size of the recording-sentence set required to just a fraction of that needed by a comparable method. After integrating the recording voice of the generated recording-sentence set into the speech database, speech was produced by a voice synthesizer using templates. In a paired-comparison test, 66 % of the speech samples synthesized by our system using templates were preferred to those produced by a conventional voice synthesizer. In an evaluation test using a five-point mean opinion score (MOS) scale, the speech samples synthesized by our system scored 4.97, whereas the maximum score for commercially available voice synthesizers was 3.09. In addition, we developed an automatic broadcast system for the weather report program using the speech-synthesis method and speech-rate converter. The system was evaluated using real weather data for more than 1 year, and exhibited sufficient stability and synthesized speech quality for broadcast purposes.
Spectral Features for Perceptually Natural Phoneme Replacement by Another Speaker's Speech

Reiko Takou, Hiroyuki Segi, Tohru Takagi, Nobumasa Seiyama

IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES, E95A(4) 751-759, Apr, 2012 Peer-reviewed

The frequency regions and spectral features that can be used to measure the perceived similarity and continuity of voice quality are reported here. A perceptual evaluation test was conducted to assess the naturalness of spoken sentences in which either a vowel or a long vowel of the original speaker was replaced by that of another. Correlation analysis between the evaluation score and the spectral feature distance was conducted to select the spectral features that were expected to be effective in measuring the voice quality and to identify the appropriate speech segment of another speaker. The mel-frequency cepstrum coefficient (MFCC) and the spectral center of gravity (COG) in the low-, middle-, and high-frequency regions were selected. A perceptual paired comparison test was carried out to confirm the effectiveness of the spectral features. The results showed that the MFCC was effective for spectra across a wide range of frequency regions, the COG was effective in the low- and high-frequency regions, and the effective spectral features differed among the original speakers.
A Concatenative Speech Synthesis System for Broadcast Quality

Hiroyuki SEGI

Keio University, Mar, 2012 Peer-reviewed
TEMPLATE-BASED METHODS FOR SENTENCE GENERATION AND SPEECH SYNTHESIS

Hiroyuki Segi, Reiko Takou, Nobumasa Seiyama, Tohru Takagi, Hideo Saito, Shinji Ozawa

2011 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 1757-1760, 2011 Peer-reviewed

Here we propose a sentence-generation method using templates that can be applied to create a speech database. This method requires the recording of a relatively small sentence set, and the resultant speech database can generate comparatively natural sounding synthesized speech. Applying this method to the Japan Broadcasting Corporation (NHK) weather report radio program reduced the size of the required sentence set to just a fraction of that required by comparable methods. We also propose a speech-synthesis method using templates. In an evaluation test, 66% of the speech samples synthesized by the proposed method using templates were preferred to those produced by the conventional concatenative speech-synthesis method.
Sentence-generating system for speech synthesis using templates and application for the 'Weather Report" radio program

Hiroyuki Segi, Reiko Takou, Nobumasa Seiyamat, Tohru Takagi, Hideo Saito, Shinji Ozawat

Kyokai Joho Imeji Zasshi/Journal of the Institute of Image Information and Television Engineers, 65(1) 76-83, 2011 Peer-reviewed

The design method of a sentence set for a speech-synthesis database strongly influences the quality of the synthesized speech. To minimize the costs associated with making the speech recordings and constructing the speech database, the number of the sentence set should be limited. However, if a sentence set does not include sufficient data, the quality of the synthesized speech can be inadequate. In this paper, we propose a method for generating a sentence set from templates. When applied to the templates in the "Weather Report" radio program, the proposed method reduced the number of the sentence set to less than several percent of that required by a comparison method. In addition, the mean opinion score of speech samples synthesized using the proposed method was 4.32 on a five-point scale.
Development of a Prototype Data-Broadcast Receiver with a High-Quality Voice Synthesizer

Hiroyuki Segi, Reiko Takou, Nobumasa Seiyama, Tohru Takagi

IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, 56(1) 169-174, Feb, 2010 Peer-reviewed

Here we propose a prototype data-broadcast receiver equipped with a voice synthesizer, which can read out stock prices and stock-price changes from a live data broadcast. Using this receiver, listeners can access their chosen stock information at any time and at an appropriate speech rate. We also propose a high-quality voice synthesizer for use with this receiver. A subjective evaluation confirmed the superiority of this voice synthesizer over commercially available ones(1).
Development of a Prototype Data-Broadcast Receiver with a High-Quality Voice Synthesizer

Hiroyuki Segi, Reiko Tako, Nobumasa Seiyama, Tohru Takagi

2010 DIGEST OF TECHNICAL PAPERS INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS ICCE, 9.1-4 411-412, 2010 Peer-reviewed

Here we propose a prototype data-broadcast receiver equipped with a voice synthesizer, which can use the stock prices and the stock-price changes from a live data broadcast. Using this receiver, listeners can access their chosen stock information at any time and at a speech rate appropriate to each individual. We also propose a high-quality voice synthesizer for use with this receiver. The results of a subjective evaluation confirmed the superiority of the proposed voice synthesizer compared with commercially available voice synthesizers.
Concatenative Speech Synthesis System Using Recordings of Japanese Broadcast News Programs as a Speech Database

SEGI HIROYUKI, TAKO REIKO, SEIYAMA NOBUMASA, TAKAGI TOORU

情報処理学会論文誌ジャーナル(CD-ROM), 50(2) 575-586, Feb, 2009 Peer-reviewed
7-9 A Tool of Word Speech Synthesis and Its Editing for Automatic Sound Broadcasting

TAKOU Reiko, SEGI Hiroyuki, SEIYAMA Nobumasa, TAKAGI Tohru

PROCEEDINGS OF THE ITE WINTER ANNUAL CONVENTION, 2008 _7-9-1_, 2008

We are developing a tool that can synthesize concatenate word speech and correct its degradation in order to make it broadcasting quality. In this study, several correction functions in were introduced into this tool. It is available to investigate better correction procedure to generate high quality synthesized speech.
Developing speech synthesis system for stock-price bulletins and trial use in digital terrestrial radio

Hiroyuki Segi, Nobumasa Seiyama, Reiko Tako, Tohru Takagi, Satoshi Oode, Atsushi Imai, Masamichi Nishiwaki, Ryuji Koyama

Kyokai Joho Imeji Zasshi/Journal of the Institute of Image Information and Television Engineers, 62(1) 69-76, 2008 Peer-reviewed

The 'Kabushiki Shikyo' program, broadcast on NHK Radio 2, reports on the daily closing prices and net changes of about 830 stocks listed on the Tokyo Stock Exchange. Reading out the numerical values within the allotted broadcast time without making mistakes can be extremely difficult for the announcers. We have therefore developed a prototype voice synthesizer for stock-price bulletins, which uses numerical speech synthesis and automatic speech-rate conversion. Our prototype system has been used in experimental digital terrestrial radio broadcasts since October 2006. This article describes the generation of texts to build the speech waveform database, the mechanism used to synthesize numerical speech via the database, the evaluation of naturalness of synthesized speech samples, and the prototype system currently being used by experimental digital terrestrial radio.
High Quality Speech Synthesis System Using Speech Rate Conversion for Stock-price Bulletins

Segi Hiroyuki, Seiyama Nobumasa, Tako Reiko, Takagi Tohru, Toda Hideo, Koyama Ryuji

National Association of Broadcasters Proceedings (NAB) Broadcasting Engineering Conference, 205-212, Apr, 2007 Peer-reviewed
Acoustic model adaptation by selective training using two-stage clustering

S Sato, H Segi, K Onoe, E Miyasaka, H Isono, T Imai, A Ando

ELECTRONICS AND COMMUNICATIONS IN JAPAN PART III-FUNDAMENTAL ELECTRONIC SCIENCE, 88(2) 41-51, 2005 Peer-reviewed

In speech recognition systems where the speaker and utterance environment cannot be designated, the drop in recognition precision due to the incompatibility of the input speech and acoustic model's training data is a problem. Although this problem is normally solved by speaker adaptation, sufficient precision cannot be achieved for speaker adaptation unless good-quality adaptation data can be obtained. In this paper, the authors propose a method of efficiently clustering large-scale data using the likelihoods of a cluster model that was created from small-scale data as the criteria to obtain a high-precision adapted acoustic model. They also propose a method of using the cluster model to automatically determine the adapted acoustic model during recognition from only the beginning of the sentences of the input speech. The results of applying the proposed technique to news speech recognition experiments show that the adapted acoustic model selection precision can be ensured by using only 0.5 second of data of the beginnings of sentences of the input speech and that the proposed technique achieves a reduction rate for invalid recognitions of 20% and a reduction in the time required for recognition of 23% compared with when the adapted acoustic model for each cluster is not used. (C) 2004 Wiley Periodicals, Inc.
A concatenative speech synthesis method using context dependent phoneme sequences with variable length as search units.

Hiroyuki Segi, Tohru Takagi, Takayuki Ito

Fifth ISCA ITRW on Speech Synthesis, Pittsburgh, PA, USA, June 14-16, 2004, 115-120, 2004 Peer-reviewed
Filter bank subtraction for robust speech recognition

K Onoe, H Segi, T Kobayakawa, S Sato, S Homma, T Imai, A Ando

IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, E86D(3) 483-488, Mar, 2003 Peer-reviewed

In this paper, we propose a new technique of filter bank subtraction for robust speech recognition under various acoustic conditions. Spectral subtraction is a simple and useful technique for reducing the influence of additive noise. Conventional spectral subtraction assumes accurate estimation of the noise spectrum and no correlation between speech and noise. Those assumptions, however, are rarely satisfied in reality, leading to the degradation of speech recognition accuracy. Moreover, the recognition improvement attained by conventional methods is slight when the input SNR changes sharply. We propose a new method in which the output values of filter banks are used for noise estimation and subtraction. By estimating noise at each filter bank, instead of at each frequency point, the method alleviates the necessity for precise estimation of noise. We also take into consideration expected phase differences between the spectra of speech and noise in the subtraction and control a subtraction coefficient theoretically. Recognition experiments on test sets at several SNRs showed that the filter bank subtraction technique improved the word accuracy significantly and got better results than conventional spectral subtraction on all the test sets. In other experiments, on recognizing speech from TV news field reports with environmental noise, the proposed subtraction method yielded better results than the conventional method.
Simultaneous subtitling system for broadcast news programs with a speech recognizer

A Ando, T Imai, A Kobayashi, S Homma, J Goto, N Seiyama, T Mishima, T Kobayakawa, S Sato, K Onoe, H Segi, A Imai, A Matsui, A Nakamura, H Tanaka, T Takagi, E Miyasaka, H Isono

IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, E86D(1) 15-25, Jan, 2003 Peer-reviewed

There is a strong demand to expand captioned broadcasting for TV news programs in Japan. However, keyboard entry of captioned manuscripts for news program cannot keep pace with the speed of speech, because in the case of Japanese it takes time to select the correct characters from among homonyms. In order to implement simultaneous subtitled broadcasting for Japanese news programs, a simultaneous subtitling system by speech recognition has been developed. This system consists of a real-time speech recognition system to handle broadcast news transcription and a recognition-error correction system that manually corrects mistakes in the recognition result with short delay time. NHK started simultaneous subtitled broadcasting for the news program "News 7" on the evening of March 27, 2000.
Acoustic Model Adaptation by Selective Training Using 2-Stage Clustering

SATO Shoei, SEGI Hiroyuki, ONOE Kazuo, MIYASAKA Eiichi, ISONO Haruo, IMAI Toru, ANDO Akio

The transactions of the Institute of Electronics, Information and Communication Engineers. D-II, 85(2) 174-183, Feb, 2002 Peer-reviewed
Filter bank subtraction for robust speech recognition.

Kazuo Onoe, Hiroyuki Segi, Takeshi Kobayakawa, Shoei Sato, Toru Imai, Akio Ando

7th International Conference on Spoken Language Processing, ICSLP2002 - INTERSPEECH 2002, Denver, Colorado, USA, September 16-20, 2002, 2002 Peer-reviewed
A Simultaneous Subtitling System for Broadcast News Programs with a Speech Recognizer

ANDO Akio, IMAI Toru, KOBAYASHI Akio, HONMA Shinichi, GOTO Jun, SEIYAMA Nobumasa, MISHIMA Takeshi, KOBAYAKAWA Takeshi, SATO Shoe, ONOE Kazuo, SEGI Hiroyuki, IMAI Atsushi, MATSUI Atsushi, NAKAMURA Akira, TANAKA Hideki, TAKAGI Tohru, MIYASAKA Eiichi, ISONO Haruo

The transactions of the Institute of Electronics, Information and Communication Engineers. D-II, 84(6) 877-887, Jun, 2001 Peer-reviewed
Speech recognition of broadcast sports news.

Atsushi Matsui, Hiroyuki Segi, Akio Kobayashi, Toru Imai, Akio Ando

EUROSPEECH 2001 Scandinavia, 7th European Conference on Speech Communication and Technology, 2nd INTERSPEECH Event, Aalborg, Denmark, September 3-7, 2001, 709-712, 2001 Peer-reviewed
The Mechanism about memory of the human being

Hiroyuki SEGI

Keio University, Mar, 1996 Peer-reviewed

Misc.

Developing High-Quality Speech Synthesis System for Stock-price Bulletins

(131) 40-47, Jan, 2012
Automatic news flash readout broadcasting service of earthquake and Tsunami information using a data broadcasting technology

(114) 44-49, Mar, 2009
Concatenative Speech Synthesis System Using Recordings of Japanese Broadcast News Programs as a Speech Database

50(2) 575-586, Feb 15, 2009
株価音声合成システムのデジタルラジオ放送での試験運用について

世木寛之, 清山信正, 田高礼子

放送技術, 61(4) 91-96, Apr, 2008
A study on perceptual naturalness and spectrum features of a phoneme replaced with other speaker's one in short sentence

38(2) 159-164, Mar 20, 2008

Books and Other Publications

映像メディア技術

八木伸行監修, 世木寛之ほか著 (Role: Contributor, 第11章音声合成)

オーム社, Jul, 2008

Presentations

発話内容を考慮した音声認証システムの検討

田辺晴果, 世木寛之

映像情報メディア学会冬季大会講演予稿集, Dec, 2021
早聴きを目的としたHMM音声合成のおける話速制御手法

溝田藍, 世木寛之, 佐野崇

映像情報メディア学会冬季大会講演予稿集, Dec, 2019
他話者の無声子音を用いた波形接続型音声合成方式に関する検討

山田雄斗, 世木寛之, 佐野崇

映像情報メディア学会冬季大会講演予稿集, Dec, 2019
高品質な音声合成のための文統合

加藤悠太, 世木寛之, 酒井浩之

映像情報メディア学会冬季大会講演予稿集(CD-ROM), Dec 6, 2018
波形接続型音声合成の高品質化とその放送利用について

世木寛之

電子情報通信学会サーバーワールド研究会, Mar 9, 2016 Invited

Teaching Experience

Professional Memberships

Research Projects

放送を目的とした道路交通情報音声合成システムの開発

平成28年度助成, 公益財団法人放送文化基金, Apr, 2017 - Mar, 2018

世木寛之

Industrial Property Rights

特許第6323905号音声合成装置

世木寛之, 妹尾真澄, 小滝邦宏, 栗原清, 細谷宏生, 飯島慎一, 倉田淳, 渋谷朋寛
特許第6185712号音声合成用読み上げテキストデータ選択装置およびそのプログラム

世木寛之, 杉本岳大
特許第6181921号音声再生装置および音声合成再生装置ならびにこれらのプログラム

世木寛之
特許第6181920号音声再生装置およびそのプログラム

世木寛之
特許第6099250号放送送出装置、及び受信機

杉本岳大, 世木寛之, 今井篤, 大竹剛, 中山靖茂

To the list screen