SRI Speech Technology and Research Laboratory

Seminar Series

2005 Seminars


Speaker: Ruiqiang Zhang
Time: Thu 27 Oct 2005, 11:00am
Venue: EJ 124
Title: Speech translation: From 1-best to N-best to word lattice translation

Abstract:

An important issue in speech translation is to minimize negative impacts of speech recognition errors on machine translation. In our work we used two approaches for improving a single-best translation, a baseline system which uses single-best recognition outputs only. The first is named N-best translation approach, which translates the top N-best recognition hypotheses separately and re-ranks all the translations by a loglinear model integrating both features from ASR and SMT. The second is a word lattice translation approach, which can translate the speech recognition word lattice directly, where multiple hypotheses are utilized in the lattice to bypass the misrecognized single-best hypothesis. The decoding process consists of converting the recognition word lattice to a translation word graph by a beam search, followed by a fine rescoring by an A* search. We also found that a speech recognition confidence measure implemented by posterior probability is effective to improve speech translation. The proposed techniques were tested in a J/E and C/E speech translation task, in which we measured the translation results in terms of a number of automatic evaluation metrics, BLEU, NIST, WER, and et al. The experimental results demonstrated a consistent and significant improvement in speech translation.

About the speaker:

Ruiqiang Zhang obtained his BSc. and MSc. in 1989 and 1994 resp. from BUAA, and PhD in 1998 from Tsinghua University of China. Dr. Zhang was a postdoctoral researcher of ATR (Advanced Telecommunications Research Institute) Japan from 1998 to 2001. In 2001, he joined a start-up company called Verbaltek in Bay Area of California. He is now a reseacher at ATR. His research interests include language modelling, part-of-speech tagging, Chinese word segmentation, statistical machine translation and speech translation. He has published many papers in those areas.

Speaker: Hermann Ney
Time: Wed 5 Oct 2005, 11:15am
Venue: EJ 124
Title: Statistical Machine Translation at RWTH Aachen

Abstract:

During the last 10 years, the statistical approach has found widespread use in machine translation both for written and spoken language and has had a major impact on the translation accuracy. This talk will give an overview of our work on statistical machine translation. We will discuss three methods that were found especially useful: phrase-based translation, log-linear modelling, and efficient search algorithms. Most of our recent work has been done in the framework of the EU-funded project TC-Star whose goal is the translation of speeches given in the European Parliament.

Speaker: Peng Xu
Time: Wed 17 Aug 2005, 11am
Venue: EJ 124
Title: Random Forest based smoothing of LMs

Abstract:

Language modeling is the problem of predicting words based on histories containing words already seen. Two key aspects of language modeling are effective history equivalence classification and robust probability estimation. The data sparseness problem associated with language modeling arises from these two aspects. Although works have been done in both aspects seperately, few have shown solutions that aim at them at the same time.

We explore the use of Random Forests (RFs) in language modeling to deal with the two key aspects jointly. The goal in this work is to develop a new language model smoothing technique based on randomly grown Decision Trees (DTs) and apply the resulting RF language models to automatic speech recognition. This new technique is complementary to many of the existing techniques dealing with data sparseness problem.

After presenting our approach to efficient DT construction, we study our RF approach in the context of n-gram type language modeling in which n-1 words are present in a history. Unlike regular n-gram language models, RF language models have the potential to generalize well to unseen data, even when histories have more than four words. We show that our RF language models are superior to the best known smoothing technique, the interpolated Kneser-Ney smoothing, in reducing both the perplexity (PPL) and word error rate (WER) in large vocabulary speech recognition systems. In particular, we will show statistically significant improvements in a contemporary conversational telephony speech recognition system by applying the RF approach only to one of its many language models.

The new technique developed in this work is general. We will show that it works well when combined with other techniques, including word clustering and the structured languge model (SLM).

About the speaker:

Peng Xu received his two BS degrees (in Engineering Mechanics and Electronics & Computer Technology) from Tsinghua University in 1995, and his MS degree (in Pattern Recognition & Artificial Intelligence) from Institute of Automation, Chinese Academy of Sciences in 1998. After spending one year in Brown University, he transferred to the Johns Hopkins University as a Ph.D. candidate in Dept. of Electrical and Computer Engineering, and started his language modeling work in the Center for Language and Speech Processing (CLSP) under the supervision of Prof. Frederick Jelinek. He joined Google as a Research Scientist shortly after getting a Ph.D. in April 2005. While his research is focused on statistical language modeling, he is also interested in statistical machine learning, information retrieval, and statistical machine translation.

CANCELED
Speaker: Karen Livescu, CSAIL, MIT
Time: Canceled (originally was on Tue Jun 21, 2005. 11 AM)
Venue: EJ 124
Title: Feature-based Lexical Modeling for Speech Recognition: Theory and Applications

Abstract:

Spontaneous, conversational speech often contains word pronunciations that differ grossly from dictionary baseforms. This has been cited as a factor in the poor performance of automatic recognizers on this type of speech. Phone-based pronunciation models usually account for this variability by expanding dictionary pronunciations with phonetic substitution, insertion, and deletion rules whose probabilities can be learned from data. Such models have the drawbacks that (1) many pronunciation variations typically remain unaccounted for, and (2) word confusability is increased due to the high granularity of phone units. As an alternative, many types of variation can be explained by representing speech as multiple streams of linguistic features rather than a single stream of phones. By allowing for asynchrony between features and per-feature subtitutions, many pronunciation changes that are difficult to account for with phone-based models become quite natural. Although it is well-known that many phenomena can be attributed to this "semi-independent evolution" of features, previous models of lexical structure have typically not taken advantage of this.

In this talk, I will present a class of feature-based lexical models implemented using dynamic Bayesian networks (DBNs). The DBN approach allows us to represent the factorization of the state space into factors corresponding to different features, thereby reducing the number of parameters that must be learned. I will discuss the ways in which such models can be incorporated into a speech recognizer and present experiments done thus far, including tests of a lexical model in isolation, its use in a landmark-based speech recognition system developed at the 2004 Johns Hopkins Summer Workshop, and its application to visual and audio-visual speech recognition.

About the speaker:

Karen Livescu is in the process of finishing her PhD at MIT's CS and AI Lab (CSAIL). This talk pertains to joint work with her advisor Dr. Jim Glass, Jeff Bilmes, Kate Saenko, Trevor Darrell, and Mark Hasegawa-Johnson and the JHU Workshop '04 Landmark-based ASR team.

Last updated $Date: 2005/10/17 18:49:36 $ by anand@speech.sri.com