| |
Speech Technology and Research (STAR) Laboratory Seminar Series
Past talks: 2008
-
Speaker: Kemal Oflazer, Sabanci University, Istanbul, Turkey
Time: Friday, May 9th, 2008, 11:00 am
Venue: STAR Lab, EJ 124
Title: Statistical Machine Translation into a Morphologically Complex Language
Abstract:
In this talk, we present some results of our on-going work on English
to Turkish statistical machine translation. Turkish is an
agglutinative language with very rich inflectional and derivational
morphology. Turkish is also a free constituent order with almost no
formal ordering constraints at the sentence level. These and the
fact that Turkish -- English parallel corpora is a scarce resource
compared to other languages popular in SMT research, bring about
interesting issues for SMT involving Turkish. After a discussion of
the highlights of relevant aspects of Turkish, we investigate
different representational granularities for sub-lexical
representation. We find that (i) representing both Turkish and
English at the morpheme-level but with some selective
morpheme-grouping on the Turkish side of in the training data, (ii)
augmenting the training data with ``sentences'' comprising only the
"content words" of the original training data, and (iii) re-ranking
the n-best decoder outputs with a word-level language model by combining
translation model scores with word-level language model scores,
provide a non-trivial improvement over a fully word-based baseline
model. Additional improvements are obtained by iterative model
training (which may very loosely be called "statistical
post-editing"), augmenting training data with phrase-pairs which are
high-probability translations of each other, and by "word-repair" --
automatically identifying and correcting morphologically malformed words.
Despite our relatively limited training data, we improve from 19.77 BLEU for the
baseline, to 28.41 BLEU for a 42% relative improvement. We also
touch briefly on the suitability of BLEU for languages like a Turkish
and present an overview of our BLUE+ tool which considers root and
morphological proximity when comparing candidate sentence words to
reference sentence words and also provides various oracle BLUE scores.
Kemal Oflazer has got his PhD from Computer Science at Carmegie Mellon University in 1987. He is currently a faculty member at Sabanci University, associated with the Computer Science pro
gram. He is directing the Human Language and Speech Processing Laboratory. He is mainly interested in Natural Language Processing with specific applications to Turkish. Currently he is working on s
tatistical machine translation (MT) between English and Turkish and developing NLP-based application for language learning. He is especially well known for his work on applying finite state methods
for language processing and error tolerant finite state recognition. Two recent very interesting studies include extension of BLEU, called as BLEU+ for the evaluation of MT systems of morphologica
lly rich languages and adaptation of the Turkish MT system to other Turkic languages, such as Uzbek or Turkmen. He has co-authored more than 100 international conference and peer reviewed journal p
apers. Prof. Oflazer is in the editorial board of Computational Linguistics, Machine Translation, and a number of other journals. He is in the organization committees of EACL'09, IJCNLP'08, EACL'06
, ACL'05, ACL'04, EACL'03, and many others.
-
Speaker: Daniel Cer, University of Colorado at Boulder (visiting at Stanford University)
Time: Thursday, May 1st, 2008, 3:00 pm
Venue: STAR Lab, EJ 124
Title: Improvements in MERT
Abstract:
Minimum error rate training (MERT) is a widely used learning
procedure for statistical machine translation models. I will
contrast three search strategies for MERT: Powell's method,
the variant of coordinate descent found in the Moses MERT
utility, and a novel stochastic method that outperforms both
of these. I will also present a method for regularizing the
MERT objective that achieves statistically significant gains
when combined with both Powell's method and coordinate descent.
-
Speaker: John DeNero, UC Berkeley
Time: Thursday, April 17th, 2008, 11:00 am
Venue: STAR Lab, EJ 124
Title: Inference in Phrase Alignment Models
Abstract:
Models that align phrases instead of words offer an
appealing alternative to the standard relative frequency estimates of
phrase translation probabilities. But, while some effective word
alignment models (Model 1, Model 2 & HMM) can be estimated tractably
with EM, phrase alignment models cannot. I'll talk about how to show
that estimation and inference under these models is intractable.
Then, I'll present two useful approximation techniques.
First, I'll talk about how to cast phrase alignment search as an
integer linear programming (ILP) problem and find the optimal
alignment reliably and quickly with off-the-shelf ILP software. Some
applications of this technique include training phrase alignment
models and interpreting the output of word alignment models.
Second, we'll look at how to estimate translation probabilities under
a phrase alignment model using a Gibbs sampling procedure. The
sampler has some nice asymptotic convergence properties and also seems
to produce good results in practice. I'll walk through the different
models we've trained and how they performed.
|
|