| |
SRILM Manual Pages
Papers and Tutorials
Novice users should consult the
following papers and tutorials first, where applicable.
-
A. Stolcke,
SRILM - An Extensible Language Modeling Toolkit,
in Proc. Intl. Conf. Spoken Language Processing, Denver, Colorado,
September 2002.
Gives an overview of SRILM design and functionality.
-
D. Jurafsky,
Language Modeling,
Lecture 11 of his course on "Speech Recognition and Synthesis" at Stanford.
Excellent introduction to the basic concepts in LM.
-
J. Goodman,
The State of The Art in Language Modeling,
presented at the
6th Conference of the Association for Machine Translation in the Americas
(AMTA), Tiburon, CA, October, 2002.
Tutorial presentation and overview of current LM techniques
(with emphasis on machine translation).
-
K. Kirchhoff, J. Bilmes, and K. Duh,
Factored Language Models Tutorial,
Tech. Report UWEETR-2007-0003, Dept. of EE, U. Washington, June 2007.
This report serves as both a tutorial and reference manual on FLMs.
-
S. F. Chen and J. Goodman,
An Empirical Study of Smoothing Techniques for Language Modeling,
Tech. Report TR-10-98, Computer Science Group,
Harvard U., Cambridge, MA, August 1998
(original
postscript document).
Excellent overview and comparative study of smoothing methods.
Served as a reference for many of the methods implemented in SRILM.
FAQ
Answers to
frequently asked questions
and notes on
N-gram smoothing implementations.
Programs
These are the top-level executables that are currently part of SRILM:
- ngram-count(1)
- count N-grams and estimate language models
- ngram-merge(1)
- merge N-gram counts
- ngram(1)
- apply N-gram language models
- ngram-class(1)
- induce word classes from N-gram statistics
- disambig(1)
- disambiguate text tokens using an N-gram model
- hidden-ngram(1)
- tag hidden events between words
- nbest-lattice(1)
- rescore N-best lists and lattices
- nbest-optimize(1)
- optimize score combination for N-best word error minimization
- nbest-mix(1)
- interpolate N-best posterior probabilities
- segment(1)
- segment text using N-gram language model
- segment-nbest(1)
- rescore and segment N-best lists using N-gram language models
- anti-ngram(1)
- count posterior-weighted N-grams in N-best lists
- multi-ngram(1)
- build multiword N-gram models
- lattice-tool(1)
- manipulate word lattices
- nbest-pron-score(1)
- score pronunciations and pauses in N-best hypotheses
Utility Scripts
Additional tools implemented as scripts:
- training-scripts(1)
- miscellaneous conveniences for language model training
- lm-scripts(1)
- manipulate N-gram language models
- ppl-scripts(1)
- manipulate perplexities
- pfsg-scripts(1)
- create and manipulate finite-state networks
- nbest-scripts(1)
- rescore and evaluate N-best lists
- select-vocab(1)
- select a maximum-likelihood vocabulary from a mixture of corpora
File Formats
Some of the data formats used by SRILM:
- ngram-format(5)
- ARPA backoff N-gram models
- classes-format(5)
- Word class definitions
- pfsg-format(5)
- Decipher(TM) probabilistic finite-state grammars
- nbest-format(5)
- N-best hypotheses lists
LM Library Classes
These are some of the basic classes of the SRILM library.
Note that this list is woefully incomplete, as this part of the documentation
is largely yet to be written.
- LM(3)
- Generic language model
- Vocab(3)
- Vocabulary indexing for SRILM
- Prob(3)
- Probabilities for SRILM
- File(3)
- Wrapper for stdio streams
Back to the SRILM home page.
|
|