Curriculum Vitaes

Hiroyuki SEGI

  (世木 寛之)

Profile Information

Affiliation
Professor, Faculty of Science and Technology Department of Science and Technology , Seikei University
Degree
博士(工学)(慶應義塾大学)

J-GLOBAL ID
201501025877783683
researchmap Member ID
B000244685

Research Interests

 2

Papers

 25
  • Ai Mizota, Hiroyuki Segi
    2021 IEEE International Conference on Consumer Electronics (ICCE), Jan 10, 2021  Peer-reviewed
  • Hiroyuki Segi, Shoei Sato, Kazuo Onoe, Akio Kobayashi, Akio Ando
    Artificial Intelligence: Concepts, Methodologies, Tools, and Applications, 3 2021-2037, Dec 12, 2016  Peer-reviewed
    Tied-mixture HMMs have been proposed as the acoustic model for large-vocabulary continuous speech recognition and have yielded promising results. They share base-distribution and provide more flexibility in choosing the degree of tying than state-clustered HMMs. However, it is unclear which acoustic models to superior to the other under the same training data. Moreover, LBG algorithm and EM algorithm, which are the usual training methods for HMMs, have not been compared. Therefore in this paper, the recognition performance of the respective HMMs and the respective training methods are compared under the same condition. It was found that the number of parameters and the word error rate for both HMMs are equivalent when the number of codebooks is sufficiently large. It was also found that training method using the LBG algorithm achieves a 90% reduction in training time compared to training method using the EM algorithm, without degradation of recognition accuracy.
  • Segi Hiroyuki
    INTERNATIONAL JOURNAL OF MULTIMEDIA DATA ENGINEERING & MANAGEMENT, 7(2) 53-67, Apr, 2016  Peer-reviewed
  • Hiroyuki SEGI
    The Journal of the Faculty of Science and Technology, Seikei University, 52(2) 5-10, Dec, 2015  
  • Hiroyuki Segi, Kazuo Onoe, Shoei Sato, Akio Kobayashi, Akio Ando
    Journal of Information Technology Research, 7(3) 15-31, Jul 1, 2014  Peer-reviewed
    Tied-mixture HMMs have been proposed as the acoustic model for large-vocabulary continuous speech recognition and have yielded promising results. They share base-distribution and provide more flexibility in choosing the degree of tying than state-clustered HMMs. However, it is unclear which acoustic models to superior to the other under the same training data. Moreover, LBG algorithm and EM algorithm, which are the usual training methods for HMMs, have not been compared. Therefore in this paper, the recognition performance of the respective HMMs and the respective training methods are compared under the same condition. It was found that the number of parameters and the word error rate for both HMMs are equivalent when the number of codebooks is sufficiently large. It was also found that training method using the LBG algorithm achieves a 90% reduction in training time compared to training method using the EM algorithm, without degradation of recognition accuracy.
  • Hiroyuki Segi, Reiko Takou, Nobumasa Seiyama, Tohru Takagi, Yuko Uematsu, Hideo Saito, Shinji Ozawa
    IEEE TRANSACTIONS ON BROADCASTING, 59(3) 548-555, Sep, 2013  Peer-reviewed
    Here we describe a speech-synthesis method using templates that can generate recording-sentence sets for speech databases and produce natural sounding synthesized speech. Applying this method to the Japan Broadcasting Corporation (NHK) weather report radio program reduced the size of the recording-sentence set required to just a fraction of that needed by a comparable method. After integrating the recording voice of the generated recording-sentence set into the speech database, speech was produced by a voice synthesizer using templates. In a paired-comparison test, 66 % of the speech samples synthesized by our system using templates were preferred to those produced by a conventional voice synthesizer. In an evaluation test using a five-point mean opinion score (MOS) scale, the speech samples synthesized by our system scored 4.97, whereas the maximum score for commercially available voice synthesizers was 3.09. In addition, we developed an automatic broadcast system for the weather report program using the speech-synthesis method and speech-rate converter. The system was evaluated using real weather data for more than 1 year, and exhibited sufficient stability and synthesized speech quality for broadcast purposes.
  • Reiko Takou, Hiroyuki Segi, Tohru Takagi, Nobumasa Seiyama
    IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES, E95A(4) 751-759, Apr, 2012  Peer-reviewed
    The frequency regions and spectral features that can be used to measure the perceived similarity and continuity of voice quality are reported here. A perceptual evaluation test was conducted to assess the naturalness of spoken sentences in which either a vowel or a long vowel of the original speaker was replaced by that of another. Correlation analysis between the evaluation score and the spectral feature distance was conducted to select the spectral features that were expected to be effective in measuring the voice quality and to identify the appropriate speech segment of another speaker. The mel-frequency cepstrum coefficient (MFCC) and the spectral center of gravity (COG) in the low-, middle-, and high-frequency regions were selected. A perceptual paired comparison test was carried out to confirm the effectiveness of the spectral features. The results showed that the MFCC was effective for spectra across a wide range of frequency regions, the COG was effective in the low- and high-frequency regions, and the effective spectral features differed among the original speakers.
  • Hiroyuki SEGI
    Keio University, Mar, 2012  Peer-reviewed
  • Hiroyuki Segi, Reiko Takou, Nobumasa Seiyama, Tohru Takagi, Hideo Saito, Shinji Ozawa
    2011 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 1757-1760, 2011  Peer-reviewed
    Here we propose a sentence-generation method using templates that can be applied to create a speech database. This method requires the recording of a relatively small sentence set, and the resultant speech database can generate comparatively natural sounding synthesized speech. Applying this method to the Japan Broadcasting Corporation (NHK) weather report radio program reduced the size of the required sentence set to just a fraction of that required by comparable methods. We also propose a speech-synthesis method using templates. In an evaluation test, 66% of the speech samples synthesized by the proposed method using templates were preferred to those produced by the conventional concatenative speech-synthesis method.
  • Hiroyuki Segi, Reiko Takou, Nobumasa Seiyamat, Tohru Takagi, Hideo Saito, Shinji Ozawat
    Kyokai Joho Imeji Zasshi/Journal of the Institute of Image Information and Television Engineers, 65(1) 76-83, 2011  Peer-reviewed
    The design method of a sentence set for a speech-synthesis database strongly influences the quality of the synthesized speech. To minimize the costs associated with making the speech recordings and constructing the speech database, the number of the sentence set should be limited. However, if a sentence set does not include sufficient data, the quality of the synthesized speech can be inadequate. In this paper, we propose a method for generating a sentence set from templates. When applied to the templates in the "Weather Report" radio program, the proposed method reduced the number of the sentence set to less than several percent of that required by a comparison method. In addition, the mean opinion score of speech samples synthesized using the proposed method was 4.32 on a five-point scale.
  • Hiroyuki Segi, Reiko Takou, Nobumasa Seiyama, Tohru Takagi
    IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, 56(1) 169-174, Feb, 2010  Peer-reviewed
    Here we propose a prototype data-broadcast receiver equipped with a voice synthesizer, which can read out stock prices and stock-price changes from a live data broadcast. Using this receiver, listeners can access their chosen stock information at any time and at an appropriate speech rate. We also propose a high-quality voice synthesizer for use with this receiver. A subjective evaluation confirmed the superiority of this voice synthesizer over commercially available ones(1).
  • Hiroyuki Segi, Reiko Tako, Nobumasa Seiyama, Tohru Takagi
    2010 DIGEST OF TECHNICAL PAPERS INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS ICCE, 9.1-4 411-412, 2010  Peer-reviewed
    Here we propose a prototype data-broadcast receiver equipped with a voice synthesizer, which can use the stock prices and the stock-price changes from a live data broadcast. Using this receiver, listeners can access their chosen stock information at any time and at a speech rate appropriate to each individual. We also propose a high-quality voice synthesizer for use with this receiver. The results of a subjective evaluation confirmed the superiority of the proposed voice synthesizer compared with commercially available voice synthesizers.
  • SEGI HIROYUKI, TAKO REIKO, SEIYAMA NOBUMASA, TAKAGI TOORU
    情報処理学会論文誌ジャーナル(CD-ROM), 50(2) 575-586, Feb, 2009  Peer-reviewed
  • TAKOU Reiko, SEGI Hiroyuki, SEIYAMA Nobumasa, TAKAGI Tohru
    PROCEEDINGS OF THE ITE WINTER ANNUAL CONVENTION, 2008 _7-9-1_, 2008  
    We are developing a tool that can synthesize concatenate word speech and correct its degradation in order to make it broadcasting quality. In this study, several correction functions in were introduced into this tool. It is available to investigate better correction procedure to generate high quality synthesized speech.
  • Hiroyuki Segi, Nobumasa Seiyama, Reiko Tako, Tohru Takagi, Satoshi Oode, Atsushi Imai, Masamichi Nishiwaki, Ryuji Koyama
    Kyokai Joho Imeji Zasshi/Journal of the Institute of Image Information and Television Engineers, 62(1) 69-76, 2008  Peer-reviewed
    The 'Kabushiki Shikyo' program, broadcast on NHK Radio 2, reports on the daily closing prices and net changes of about 830 stocks listed on the Tokyo Stock Exchange. Reading out the numerical values within the allotted broadcast time without making mistakes can be extremely difficult for the announcers. We have therefore developed a prototype voice synthesizer for stock-price bulletins, which uses numerical speech synthesis and automatic speech-rate conversion. Our prototype system has been used in experimental digital terrestrial radio broadcasts since October 2006. This article describes the generation of texts to build the speech waveform database, the mechanism used to synthesize numerical speech via the database, the evaluation of naturalness of synthesized speech samples, and the prototype system currently being used by experimental digital terrestrial radio.
  • Segi Hiroyuki, Seiyama Nobumasa, Tako Reiko, Takagi Tohru, Toda Hideo, Koyama Ryuji
    National Association of Broadcasters Proceedings (NAB) Broadcasting Engineering Conference, 205-212, Apr, 2007  Peer-reviewed
  • S Sato, H Segi, K Onoe, E Miyasaka, H Isono, T Imai, A Ando
    ELECTRONICS AND COMMUNICATIONS IN JAPAN PART III-FUNDAMENTAL ELECTRONIC SCIENCE, 88(2) 41-51, 2005  Peer-reviewed
    In speech recognition systems where the speaker and utterance environment cannot be designated, the drop in recognition precision due to the incompatibility of the input speech and acoustic model's training data is a problem. Although this problem is normally solved by speaker adaptation, sufficient precision cannot be achieved for speaker adaptation unless good-quality adaptation data can be obtained. In this paper, the authors propose a method of efficiently clustering large-scale data using the likelihoods of a cluster model that was created from small-scale data as the criteria to obtain a high-precision adapted acoustic model. They also propose a method of using the cluster model to automatically determine the adapted acoustic model during recognition from only the beginning of the sentences of the input speech. The results of applying the proposed technique to news speech recognition experiments show that the adapted acoustic model selection precision can be ensured by using only 0.5 second of data of the beginnings of sentences of the input speech and that the proposed technique achieves a reduction rate for invalid recognitions of 20% and a reduction in the time required for recognition of 23% compared with when the adapted acoustic model for each cluster is not used. (C) 2004 Wiley Periodicals, Inc.
  • Hiroyuki Segi, Tohru Takagi, Takayuki Ito
    Fifth ISCA ITRW on Speech Synthesis, Pittsburgh, PA, USA, June 14-16, 2004, 115-120, 2004  Peer-reviewed
  • K Onoe, H Segi, T Kobayakawa, S Sato, S Homma, T Imai, A Ando
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, E86D(3) 483-488, Mar, 2003  Peer-reviewed
    In this paper, we propose a new technique of filter bank subtraction for robust speech recognition under various acoustic conditions. Spectral subtraction is a simple and useful technique for reducing the influence of additive noise. Conventional spectral subtraction assumes accurate estimation of the noise spectrum and no correlation between speech and noise. Those assumptions, however, are rarely satisfied in reality, leading to the degradation of speech recognition accuracy. Moreover, the recognition improvement attained by conventional methods is slight when the input SNR changes sharply. We propose a new method in which the output values of filter banks are used for noise estimation and subtraction. By estimating noise at each filter bank, instead of at each frequency point, the method alleviates the necessity for precise estimation of noise. We also take into consideration expected phase differences between the spectra of speech and noise in the subtraction and control a subtraction coefficient theoretically. Recognition experiments on test sets at several SNRs showed that the filter bank subtraction technique improved the word accuracy significantly and got better results than conventional spectral subtraction on all the test sets. In other experiments, on recognizing speech from TV news field reports with environmental noise, the proposed subtraction method yielded better results than the conventional method.
  • A Ando, T Imai, A Kobayashi, S Homma, J Goto, N Seiyama, T Mishima, T Kobayakawa, S Sato, K Onoe, H Segi, A Imai, A Matsui, A Nakamura, H Tanaka, T Takagi, E Miyasaka, H Isono
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, E86D(1) 15-25, Jan, 2003  Peer-reviewed
    There is a strong demand to expand captioned broadcasting for TV news programs in Japan. However, keyboard entry of captioned manuscripts for news program cannot keep pace with the speed of speech, because in the case of Japanese it takes time to select the correct characters from among homonyms. In order to implement simultaneous subtitled broadcasting for Japanese news programs, a simultaneous subtitling system by speech recognition has been developed. This system consists of a real-time speech recognition system to handle broadcast news transcription and a recognition-error correction system that manually corrects mistakes in the recognition result with short delay time. NHK started simultaneous subtitled broadcasting for the news program "News 7" on the evening of March 27, 2000.
  • SATO Shoei, SEGI Hiroyuki, ONOE Kazuo, MIYASAKA Eiichi, ISONO Haruo, IMAI Toru, ANDO Akio
    The transactions of the Institute of Electronics, Information and Communication Engineers. D-II, 85(2) 174-183, Feb, 2002  Peer-reviewed
  • Kazuo Onoe, Hiroyuki Segi, Takeshi Kobayakawa, Shoei Sato, Toru Imai, Akio Ando
    7th International Conference on Spoken Language Processing, ICSLP2002 - INTERSPEECH 2002, Denver, Colorado, USA, September 16-20, 2002, 2002  Peer-reviewed
  • ANDO Akio, IMAI Toru, KOBAYASHI Akio, HONMA Shinichi, GOTO Jun, SEIYAMA Nobumasa, MISHIMA Takeshi, KOBAYAKAWA Takeshi, SATO Shoe, ONOE Kazuo, SEGI Hiroyuki, IMAI Atsushi, MATSUI Atsushi, NAKAMURA Akira, TANAKA Hideki, TAKAGI Tohru, MIYASAKA Eiichi, ISONO Haruo
    The transactions of the Institute of Electronics, Information and Communication Engineers. D-II, 84(6) 877-887, Jun, 2001  Peer-reviewed
  • Atsushi Matsui, Hiroyuki Segi, Akio Kobayashi, Toru Imai, Akio Ando
    EUROSPEECH 2001 Scandinavia, 7th European Conference on Speech Communication and Technology, 2nd INTERSPEECH Event, Aalborg, Denmark, September 3-7, 2001, 709-712, 2001  Peer-reviewed
  • Hiroyuki SEGI
    Keio University, Mar, 1996  Peer-reviewed

Books and Other Publications

 1
  • 八木伸行監修, 世木寛之ほか著 (Role: Contributor, 第11章音声合成)
    オーム社, Jul, 2008

Presentations

 47

Research Projects

 1

Industrial Property Rights

 72