ngram-class - induce word classes from N-gram statistics
ngram-class [ -help ] option ...
induces word classes from distributional statistics,
so as to minimize perplexity of a class-based N-gram model
given the provided word N-gram counts.
Presently, only bigram statistics are used, i.e., the induced classes
are best suited for a class-bigram language model.
The program generates the class N-gram counts and class expansions
respectively to train and to apply the class N-gram model.
Each filename argument can be an ASCII file, or a
compressed file (name ending in .Z or .gz), or ``-'' to indicate
Print option summary.
Print version information.
- -debug level
Set debugging output at
Level 0 means no debugging.
Debugging messages are written to stderr.
A useful level to trace the formation of classes is 2.
- -vocab file
Read a vocabulary from file.
Subsequently, out-of-vocabulary words in both counts or text are
replaced with the unknown-word token.
If this option is not specified all words found are implicitly added
to the vocabulary.
Map the vocabulary to lowercase.
- -counts file
Read N-gram counts from a file.
Each line contains an N-gram of
words, followed by an integer count, all separated by whitespace.
Repeated counts for the same N-gram are added.
Counts collected by
are additive as well.
Note that the input should contain consistent lower- and higher-order
counts (i.e., unigrams and bigrams), as would be generated by
- -text textfile
Generate N-gram counts from text file.
should contain one sentence unit per line.
Begin/end sentence tokens are added if not already present.
Empty lines are ignored.
- -numclasses C
The target number of classes to induce.
A zero argument suppresses automatic class merging altogether
(e.g., for use with
Perform full greedy merging over all classes starting with one class per
This is the O(V^3) algorithm described in Brown et al. (1992).
Perform incremental greedy merging, starting with
one class each for the
most frequent words, and then adding one word at a time.
This is the O(V*C^2) algorithm described in Brown et al. (1992);
it is the default.
Enter a primitive interactive interface when done with automatic class
induction, allowing manual specification of additional merging steps.
- -noclass-vocab file
Read a list of vocabulary items from
that are to be excluded from classes.
These words or tags do no undergo class merging, but their
N-gram counts still affect the optimization of model perplexity.
The default is to exclude the sentence begin/end tags (<s> and </s>)
from class merging; this can be suppressed by specifying
- -class-counts file
Write class N-gram counts to
The format is the same as for word N-gram counts, and can be
to estimate a class-N-gram model.
- -classes file
Write class definitions (member words and their probabilities) to
The output format is the same as required by the
- -save S
Save the class counts and/or class definitions every
iterations during induction.
The filenames are obtained from the
options, respectively, by appending the iteration number.
This is convenient for producing sets of classes at different granularities
during the same run.
(the default) suppresses the saving actions.
- -save-maxclasses K
Modifies the action of
so as to only start saving once the number of classes reaches
(The iteration numbers embedded in filenames will start at 0 from that point.)
P. F. Brown, V. J. Della Pietra, P. V. deSouza, J. C. Lai and R. L. Mercer,
``Class-Based n-gram Models of Natural Language,''
Computational Linguistics 18(4), 467-479, 1992.
Classes are optimized only for bigram models at present.
Andreas Stolcke <firstname.lastname@example.org>.
Copyright 1999-2007 SRI International