11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, 442-445, 2010 Peer-reviewed
A variety of methods for audio-visual integration, which integrate audio and visual information at the level of either features, states, or classifier outputs, have been proposed for the purpose of robust speech recognition. However, these methods do not always fully utilize auditory information when the signal-to-noise ratio becomes low. In this paper, we propose a novel approach to estimate speech signal in noise environments. The key idea behind this approach is to exploit clean speech candidates generated by using timing structures between mouth movements and sound signals. We first extract a pair of feature sequences of media signals and segment each sequence into temporal intervals. Then, we construct a cross-media timing-structure model of human speech by learning the temporal relations of overlapping intervals. Based on the learned model, we generate clean speech candidates from the observed mouth movements.
Proceedings - 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops, ACII 2009, 201-208, 2009 Peer-reviewed
Conference on Human Factors in Computing Systems - Proceedings, 3585-3590, 2008 Peer-reviewed
Turn-taking in a smooth conversation is supported by the anticipation of the floor handover timing among participants. However, it becomes difficult to maintain natural turn-taking in video conferencing with transmission delays because the utterances and movements of each participant are presented to the others with a time lag, which often leads to a collision of utterances. In order to facilitate smooth communication over a video-conferencing system, we propose a novel method, "Visual Filler," that fills temporal gaps in turn-taking caused by the existence of delays. Visual Filler overlays an artificial visual stimulus that has a function similar to that of filler sounds on a screen with participant images. We have evaluated the effectiveness of a Visual Filler for reducing the unnaturalness of turn-taking on a simulated dyadic dialog situation with a delay.
ARTICULATED MOTION AND DEFORMABLE OBJECTS, PROCEEDINGS, 4069 453-463, 2006 Peer-reviewed
Modeling and describing temporal structure in multimedia signals, which are captured simultaneously by multiple sensors, is important for realizing human machine interaction and motion generation. This paper proposes a method for modeling temporal structure in multimedia signals based on temporal intervals of primitive signal patterns. Using temporal difference between beginning points and the difference between ending points of the intervals, we can explicitly express timing structure; that is, synchronization and mutual dependency among media signals. We applied the model to video signal generation from an audio signal to verify the effectiveness.
IEICE Transactions on Fundamentals, E88-A(11) 3022-3035, Nov, 2005 Peer-reviewed
This paper addresses the parameter estimation problem of an interval-based hybrid dynamical system (interval system). The interval system has a two-layer architecture that comprises a finite state automaton and multiple linear dynamical systems. The automaton controls the activation timing of the dynamical systems based on a stochastic transition model between intervals. Thus, the interval system can generate and analyze complex multivariate sequences that consist of temporal regimes of dynamic primitives. Although the interval system is a powerful model to represent human behaviors such as gestures and facial expressions, the learning process has a paradoxical nature : temporal segmentation of primitives and identification of constituent dynamical systems need to be solved simultaneously. To overcome this problem, we propose a multiphase parameter estimation method that consists of a bottom-up clustering phase of linear dynamical systems and a refinement phase of all the system parameters. Experimental results show the method can organize hidden dynamical systems behind the training data and refine the system parameters successfully.
3rd International Conference on Advances in Pattern Recognition (S. Singh et al. Eds.: ICAPR 2005 Springer LNCS 3686), 229-238, Aug, 2005 Peer-reviewed
ANALYSIS AND MODELLING OF FACES AND GESTURES, PROCEEDINGS, 3723 140-154, 2005 Peer-reviewed
This paper presents a method for interpreting facial expressions based on temporal structures among partial movements in facial image sequences. To extract the structures, we propose a novel facial expression representation, which we call a facial score, similar to a musical score. The facial score enables us to describe facial expressions as spatio-temporal combinations of temporal intervals; each interval represents a simple motion pattern with the beginning and ending times of the motion. Thus, we can classify fine-grained expressions from multivariate distributions of temporal differences between the intervals in the score. In this paper, we provide a method to obtain the score automatically from input images using bottom-up clustering of dynamics. We evaluate the efficiency of facial scores by comparing the temporal structure of intentional smiles with that of spontaneous smiles.
Systems and Computers in Japan, 34(14) 1-12, Dec, 2003 Peer-reviewed
This paper proposes a system architecture for event recognition that dynamically integrates information from multiple sources (e.g., multimodal data from visual and auditory sensors). The proposed system consists of multiple event classifiers called Continuous State Machines (CSMs). Each CSM has a state transition rule in a continuous state space and classifies time-varying patterns from a different single source. Since the rule is defined as an extension of Kalman filters (i.e., the next state is deduced from the trade-off scheme between the input data and the model's prediction), CSMs support dynamic time warping and robustness against noise. We then introduce an interaction method among CSMs to classify events from multiple sources. A continuous state space (i.e., vector space) allows us to design interaction as minimization of an energy function. This interaction enables the system to dynamically suppress unreliable classifiers and improves system reliability and the accuracy of classifying events in dynamically changing situations (e.g., the object is temporary occluded from one of multiple cameras in a gesture recognition task). Experimental results on gesture recognition by two cameras show the effectiveness of our proposed system.
16TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL II, PROCEEDINGS, 2 785-789, 2002 Peer-reviewed
This paper proposes a system architecture for event recognition that integrates information from multiple sources (e.g., gesture and speech recognition from distributed sensors in the real world). The proposed system consists of multiple recognizers named Continuous State Machines (CSMs). Each CSM has a state transition rule in a continuous state space and classifies time-varying patterns from a single source. Since the rule is defined as a simplification of Kalman filter (i.e., the next state is deduced from the trade-off scheme between input data and model's prediction), CSMs support dynamic time warping and robustness against noise. We then introduce an interaction method among CSMs to classify events from multiple sources. A continuous state space (i.e., vector space) allows as to design interaction as recursively minimizing an energy function. This interaction enables the system to dynamically focus over the multiple sources, and improves reliability and accuracy of classifying events in dynamically changing situations (e.g., the object is temporally occluded from one of multiple cameras in a gesture recognition task). Experimental results on gesture recognition by two cameras show the effectiveness of our proposed system.
P. Benner, R. Findeisen, D. Flockerzi, U. Reichl, K. Sundmacher (Role: Contributor, Chap.3, Magnus Egerstedt, Jean-Pierre de la Croix, Hiroaki Kawashima, and Peter Kingston, "Interacting with Networks of Mobile Agents")
Grants-in-Aid for Scientific Research Grant-in-Aid for Transformative Research Areas (A), Japan Society for the Promotion of Science, Sep, 2021 - Mar, 2026
Grants-in-Aid for Scientific Research Grant-in-Aid for Transformative Research Areas (A), Japan Society for the Promotion of Science, Sep, 2021 - Mar, 2026