SRI Speech Technology and Research Laboratory

Seminar Series

1996 Seminar Talks

Ramana Rao, SRI STAR Lab

Time: Thursday, 12 December 1996, 3 PM
Title: Studies on GMM-based Vocal Tract Length Normalization
Abstract:
A major source of variability in speech is the vocal tract length variation across speakers. A number of techniques were proposed to explicitly compensate for the vocal tract length variation and they reported improvements in the recognition. Recently a GMM-based technique was proposed which achieved similar performance but at a reduced computational cost. We have conducted some studies on this technique with an aim to identify the parameters for optimal recognition. In this talk, I will describe the studies we conducted and the results we obtained.
Steven Greenberg, University of California, Berkeley, and International Computer Science Institute

Time: Thursday, 17 October 1996, 3 PM
Title: From Sound to Meaning via the Syllable
Abstract:

Models of speech recognition (by both human and machine) have traditionally assumed the phoneme to serve as the fundamental unit of phonetic and phonological analysis. However, phoneme-centric models have failed to provide a convincing theoretical account of the process by which the brain extracts meaning from the speech signal and have fared poorly in automatic recognition of natural, informal speech (e.g., the Switchboard corpus).

Perhaps the most challenging aspect of understanding the speech decoding process concerns the representational stability of linguistic information contained within the acoustic signal (the "phonogram") under a wide array of speaker and environmental variability. The spectro-temporal properties of speech are exceedingly variable when viewed from the perspective of the individual phonetic segment (the "phone"), yet the phonogram is remarkably tolerant of perturbations imposed by reverberation and acoustic interference or variation resulting from differences in vocal tract size, fundamental frequency or speaking rate. Given the variability of the acoustic signal at the phonetic level, how does the brain extract invariant information at the lexical and semantic levels?

The invariance issue diminishes in magnitude if the syllable, rather than the phoneme, is considered to serve as the basic unit of speech recognition, particularly as it can mediate the acoustic- phonetic and lexical levels of linguistic analysis. Lexical structure, which appears relatively complex and arbitrary when analyzed in terms of phonemes appears more transparent when analyzed as syllabic units, thus providing a possible mechanism for direct access of the lexicon.

This presentation will focus on possible mechanisms by which syllabic units are extracted from the speech signal and their organization into meaningful lexical units. Specifically, it is proposed that syllabic units are derived from an auditory analysis of speech focused on modulation frequencies around 2-10 Hz, and that this low-frequency modulation spectrum is relatively stable over a range of speaker and acoustic conditions that have traditionally posed problems for models of speech recognition. We will also consider the implications of such a syllable-centric model of language for automatic recognition of speech.

Herb Clark, Dept. of Psychology, Stanford University
Time: Wednesday, June 19, 1996, 11:00AM
Title: Strategies in managing disruptions in spontaneous speaking
Abstract:

In conversation, speakers periodically run into problems in choosing what to say and in formulating how to say that, and these lead to disruptions in the smooth flow of speaking. Speakers have good reasons for managing these disruptions, and have a battery of strategies for doing so. I will take up strategies that lead to three features of disrupted speech: repeated articles and pronouns, the fillers "uh" and "um," and "the" pronounced with a non-reduced vowel as as "thee."

Johan Myhre Andersen, Intl. Computer Science Institute

Time: Wednesday, August 28, 10am
Title: BANANA REMAP for Speech Recognition
BAyesian Network interpretation of Artificial Neural networks as probability Approximators trained by Recursive Estimation and Maximization of A Posteriori probabilities

Abstract:

In this talk, I will describe a new Bayesian Network (BN) interpretation of a kind of Artificial Neural Network/Hidden Markov Model (ANN/HMM) hybrid. This hybrid is based on phoneme transitions and can be used to calculate the a posteriori probability of a sentence model.

The ANN/HMM hybrid is trained by REMAP, a new training method developed at ICSI. REMAP is an Expectation Maximization (EM)-like method that iteratively increases the a posteriori probability of the correct model of a sentence. The BN interpretation led to a small reformulation of REMAP, which will also be described.

A series of new REMAP experiments has been run, and I will present and discuss the most important results. Although the error-rate is only reduced from 11.8% to 11.2% in the best case, a big increase in the a posteriori probability is seen for all runs.

Finally, some suggestions for future experiments are discussed.

Dr. Yoshinori Sagisaka, ATR, Japan

Time: Tuesday, 1 October 1996, 2:00PM-3:00PM
Title: Computational prosody modeling
Abstract:

To synthesize natural speech, corpus-based approach was initiated at ATR about a decade ago. Since then, computational modeling of prosody control has been continuously studied. In this talk, segmental duration characterisctics are introduced with statistical models and some perceptual studies will also be introduced. Throughout these modeling and analysis, it is pointed out that both the engineering modeling by optizing objective measure and scientific analyses followed by careful perceptual experiment are indispensable to the sound development of speech technology.

Dr. Qiang Huo, ATR, Japan

Time: Tuesday, 1 October 1996, 3:00PM-3:30PM
Title: Recent research activities on speech recognition at ATR
Abstract:

At ATR, speech technologies have been studied for more than ten years aiming at speech to speech translation system. In this talk, speech recognition works will be introduced. In speech recognition, feature extraction, speaker adaptation, search, language model will be described focussing on the recent advances such as on-line dynamic adaptation, spontaneous speech recognition using cross-word context constrained word graphs and variable order N-grams.

Marti Hearst, Xerox PARC

Time: Thursday, 10 October 1996, 10:30 AM
Title: TextTiling: Segmenting Text into Multi-Paragraph Subtopic Passages
Abstract:

TextTiling is a technique for subdividing texts into multi-paragraph units that represent passages, or subtopics. The discourse cues used for identifying major subtopic shifts are patterns of lexical co-occurrence and distribution. The algorithm is fully implemented and is shown to produce segmentation that corresponds well to human judgments of the subtopic boundaries of twelve texts. Multi-paragraph subtopic segmentation should be useful for many text analysis tasks, including information retrieval and summarization. In the near future I hope to apply it to the recognition of story boundaries in the transcripts of news broadcasts.

Jason Hutchens, Centre for Intelligent Information Processing Systems, University of Western Australia

Time: Friday, 1 November 1996, 10:30 AM
Title: Natural Language Grammatical Inference
Abstract:

Grammatical Inference is the process of automatically inferring a grammar for a language from a set of strings in the language, and possibly also a set of strings not in the language. In this talk, I focus on the development of a generic Grammatical Inference Engine which is able to find structure within natural and artificial texts.

The techniques used to construct the grammar are similar to those used in my research group to recognise images. Heuristics and ad-hoc hacks have been avoided in favour of a completely automatic, generic system.

The results show that simple techniques can be used to find complicated structure in all sorts of data. Potential applications include speech recognition, image understanding, data compression, natural language database interfaces and machine translation.

Wolfgang Menzel, University of Hamburg

Time: Monday, November 25, 10:30 AM
Title: Robust and Time-Sensitive Processing of Natural Language
Abstract:

Robustness in spoken language processing is a shifting notion. Previously developed techniques (mostly based on a constraint relaxation schema) did not take into account its multi-faceted appearance. This not only produced one-eyed solutions strongly biased towards a particular kind of disruptive factors but additionally resulted in a system behaviour being quite disappointing from a cognitive viewpoint. The talk will present a parsing approach which tries to combine robustness against ill-formed input, metaphorical use and temporal pressure while using just a single processing mechanism.

The approach is based on the application of Constraint Satisfaction techniques and parsing is understood as a procedure of structural disambiguation. The basic property of robustness against unexpected input is introduced through redundancy between parallel processing components (namely a syntactic and a semantic one). The two representational levels are coupled by graded constraints to overcome partial deficiencies by mutual compensation.

Robustness against temporal pressure is provided by the eliminative nature of constraint satisfaction. It allows to evaluate the current state of disambiguation at every time point during analysis and to decide upon the necessary means for further advance. Constraint reinforcement can then be used to speed up the procedure if desired. This leads to a particular kind of anytime parsing where quality (robustness against unexpected input) is exchanged for analysis time.


Last updated $Date: 1998/06/06 18:00:08 $ by stolcke@speech.sri.com