SRI Speech Technology and Research Laboratory

Seminar Series

1997 Seminar Talks

Edward L. Riegelsberger Ohio State University

Time: Tuesday, July 22, 1997 11am
Title: Acoustic-to-Articulatory Mapping of Voiced and Fricated Speech

Abstract:
Acoustic-to-Articulatory Mapping is the estimation of a time-varying vocal-tract shape from an acoustic waveform. While most research in acoustic-to-articulatory mapping considers only purely voiced speech, this work considers the problem for speech that includes fricatives. Aspects of fricative production and perception challenge many of the assumptions and techniques used in existing acoustic-to-articulatory mapping algorithms. This work investigates these issues and extends existing techniques for the acoustic-to-articulatory mapping of purely voiced speech to unvoiced and voiced fricatives in isolation and in continuous speech.

Yunxin Zhao ECE Department and Beckman Institute, University of Illinois at Urbana-Champaign

Time: Wednesday, August 20, 10:30 am
Title: Speech Processing Methods for Environment-Robust ASR

Abstract

Environmental noise poses a significant challenge to the use of speech recognition technology in real world applications. In this talk, our recent efforts on speech processing and modeling for dealing with additive noise, convolutive noise, as well as interference speech will first be summarized. The talk will then be focused on a recently developed technique for speech recognition in the presence of both additive and convolutive noises. The method is based on a maximum likelihood estimation of distortion channel and noise level using an EM algorithm, where a correlation-matching based algorithm is derived for parameter initialization. The posterior estimates of the speech signal statistics are used for deriving spectral features of speech in order to better match with the trained speech models. Experimental results are presented to illustrate the effectiveness of the discussed techniques under various noise conditions.

Kyrill Fischer, Deutsche Telekom Berkom, Darmstadt, Germany,
and International Computer Science Institute, Berkeley

Time: Wednesday, September 3, 1997, 11am
Title: Investigating Simultaneous Masking in Speech
Abstract:
During the last months of my research at ICSI I investigated the effect of simultaneous masking in speech. The first step of this investigation was the definition of a reasonable "psychoacoustic model" that predicts the simultaneous masking threshold. A corresponding model of the MPEG/Audio Encoding Standard has been adapted and "generalized" in order to increase the spectral resolution and to enable the processing of a wider range of sampling frequencies. Using this "generalized model", a statistical analysis of masking has been carried out.
Silke Witt, Cambridge University

Time: Thursday, September 4, 1997, 11am
Title: Automatic Pronunciation Scoring for Foreign Accented Speech
Abstract:
This talk will present a method of assessing non-native speech to aid computer-assisted pronunciation teaching. The method is based on automatic speech recognition techniques using Hidden Markov Models. Confidence scores at the phoneme level are calculated to provide detailed information about the pronunciation quality of a foreign language student. Experimental results are given based on both artificial data and a database of non-native speech. The latter has been recorded specifically for this purpose and will be described as well. The presented results suggest that the metric is capable to locate and assess mispronunciations at the phoneme level.
Bernhard Suhm, Interactive Systems Laboratories, Carnegie Mellon University and Karlsruhe University

Time: Wednesday, December 10, 1997, 11am
Location: Computer Dialog Lab (CDL), EK101
Title: Empirical Evaluation of Interactive Multimodal Error Correction
Abstract:
Recently, the first commercial dictation systems for continuous speech have become available. Although they generally received positive reviews, error correction is still limited to choosing from a list of alternatives, speaking again or typing. The goal of my research was to develop more flexible and efficient methods to recover from recognition errors, and to empirically evaluate them in user studies. My approach is to involve the user in correcting ("interactive") and to offer the possibility to switch between different modalities ("multimodal"). The current bag of correction tools includes - in addition to the above mentioned standard methods - spelling, handwriting and pen gestures, both on the level of whole words as well as letters within a word. I integrated these correction methods with our large vocabulary speech recognition system to build a prototypical multimodal listening typewriter. We designed an experiment to empirically evaluate the efficiency of different error correction modalities. The experiment compares multimodal correction strategies with correction strategies available in current speech recognition applications. We confirm the hypothesis that switching modality can significantly expedite corrections. However, in applications where a keyboard is acceptable, typing currently remains the fastest method to correct errors for users with good typing skills. If the keyboard is not desired, either due to application constraints or user preferences, our multimodal error correction makes it possible to reproduce text at a speed which exceeds fast unskilled typing. This speed includes the time necessary to correct errors the large vocabulary speech recognizer makes in decoding the dictated sentences.
Michael Finke, Carnegie Mellon University

Time: Thursday, December 11, 1997, 10:30 am
Title: Mode Dependent Large Vocabulary Conversational Speech Recognition using the Janus Recognition Toolkit
Abstract:
In spontaneous conversational speech there is a large amount of variability due to accents, speaking styles and speaking rates (also known as the speaking mode). Because current recognition systems usually use only a relatively small number of pronunciation variants for the words in their dictionaries, the amount of variability that can be modeled is limited. Increasing the number of variants per dictionary entry is the obvious solution. Unfortunately, this also means increasing the confusability between the dictionary entries, and thus often leads to an actual performance decrease. In this talk I will present a framework for speaking mode dependent pronunciation modeling. The probability of encountering pronunciation variants is defined to be a function of the speaking style. The probability function is learned through decision trees from rule based generated pronunciation variants as observed on the Switchboard corpus. The framework is successfully applied to increase the performance of our Janus Recognition Toolkit Switchboard recognizer significantly.
Roni Rosenfeld, Carnegie Mellon University

Time: Thursday, December 18, 1997, 2pm
Location: Computer Dialog Lab (CDL), EK101
Title: A Whole Sentence Maximum Entropy Language Model (and other language modeling projects) (paper available)
Abstract:
A new kind of language model will be described, which models whole sentences or utterances directly using the Maximum Entropy (ME) paradigm. The new model is conceptually simpler, and more naturally suited to modeling whole-sentence phenomena, than the conditional ME models proposed to date. By avoiding the chain rule, the model treats each sentence or utterance as a ``bag of features'', where features are arbitrary computable properties of the sentence. The model is unnormalizable, but this does not interfere with training (done via sampling) or with use. Using the model is computationally straightforward. The main computational cost of training the model is in generating sample sentences from a Gibbs distribution. Interestingly, this cost has different dependencies, and is potentially lower, than in the comparable conditional ME model.

The talk will be preceded by a short summary of several other language modeling projects at our lab, and start with a general introduction to statistical language modeling.


Last updated $Date: 2008/12/24 04:36:16 $ by stolcke@speech.sri.com