Abstract:
Prosody plays an important role in discriminating between languages and speakers. Due to the complexity of estimating relevant prosodic information, most systems rely on the notion that the statistics of the fundamental frequency (as a proxy for pitch) and speech energy (as a proxy for loudness) distributions can be used to capture prosodic differences between speakers and languages. However, this simplistic notion disregards the temporal aspects of prosodic features that determine certain phenomena, such as intonation and stress. We propose alternative approaches that exploit the dynamics between the fundamental frequency and speech energy. The aim is to characterize different intonation and stress patterns produced by the variation in fundamental frequency and speech energy. We show the proposed approach can capture speaker- and language-specific information and that they provide complementary information to the conventional systems.
Abstract:
In spontaneous speech, speakers segment their speech into intonational phrases, and make repairs to what they are saying. However, techniques for understanding spontaneous speech tend to treat these events as noise, in the same manner as they handle out-of-grammar constructions and misrecognitions. In our approach, we advocate that these events should be explicitly modeled, and that they must be resolved early in the processing stream. We put forward a statistical language model, which can be used during speech recognition, that models these events. This not only improves speech recognition perplexity and POS tagging, but also results in much richer output from the recognizer, with speech repairs resolved and intonational phrase boundaries identified. Syntactic and semantic processing can thus focus on dealing with out-of-grammar constructions and misrecognitions.
About the speaker:
Dr. Peter Heeman is an assistant professor at the OGI School of Science and engineering at the Oregon Health & Science University. He is a member of the Center for Spoken Language Understanding and the Center for Human Computer Communication. Dr. Heeman does research on the automatic recognition of spontaneous speech, which contains disfluencies, and intonational phrases. He also conducts research on dialogue management and spoken dialogue systems. Dr. Heeman received his Ph.D. from the University of Rochester in 1997, his Masters of Science from the University of Toronto in 1991, and has worked at CNET France Telecom and at ATR, Japan.
Abstract:
The talk presents an attempt at using the syntactic structure in natural language for improved language models for large vocabulary speech recognition. The structured language model merges techniques in automatic parsing and language modeling using an original probabilistic parameterization of a shift-reduce parser. A maximum likelihood reestimation procedure belonging to the class of expectation-maximization algorithms is employed for training the model. Experiments on the Wall Street Journal, Switchboard and Broadcast News corpora show improvement in both perplexity and word error rate --- word lattice rescoring --- over the standard 3-gram language model. Further experiments investigate the portability of syntactic structure across domains --- Wall Street Journal to Air Travel Information Systems --- as well as the use of the structured language model for information extraction from text.Click here for the PDF slides of Chelba's talk.
Abstract:
This work investigates the use of frequency-localized temporal patterns of the speech signal for developing robust front-end for Automatic Speech Recognition (ASR).The TempoRAl PatternS (TRAPS) are investigated for estimating broad-phonetic features independently in each critical-band. These features are combined and finally used for continuous word recognition. Our work shows that broad-phonetic TRAPS features generalize better than other conventional features and yield considerable complementary information with respect to short-term cepstral features in ASR. Two practical applications are proposed for the broad-phonetic TRAPS features
These features yield a significant improvement in the performance for these applications.
- Distributed Speech Recognition (DSR) in cellular telephony
- Voice Activity Detection (VAD) tasks.
The TempoRAl PatternS (TRAPS) are further investigated for speech-events based feature estimation. New band-independent categories are proposed which represent distinct speech-events in the frequency-localized temporal patterns of the speech signal. A Universal TempoRAl PatternS (UTRAPS) system is proposed for feature estimation. On combining speech-event UTRAPS features with cepstral features, a significant improvement in the recognition performance under widely varying noisy conditions is achieved. We show that speech-event UTRAPS features generalize better than broad-phonetic TRAPS features and give similar gains in the recognition performance. With new UTRAPS system, we achieve a significant reduction in number of parameters.