| |
Search SRILM-USER Archives
incremental ngram-count
From: Alexy Khrabrov <deliverable at ADDRESS HIDDEN>
Date: Fri, 2 Nov 2007 00:42:55 +0300
A separate task I do on a corpus is computing a "running ngram
count": for each "tick" size of a subset of the corpus, e.g. 10%,
20%, etc., or every N files, or every file, show the *increase* in
the number of ngrams.
Obviously building sublists of files with a single file added and
rerunning ngram-count on such lists is inefficient. Is it the case
where I have to get into C++ library indeed, and which classes should
I use? Basically, I want to know which *new* ngrams are contributed
by a given file, in the sequence of processing. I may want to output
them separately for look-see, too.
Cheers,
Alexy
Click here to go to the SRILM home page.
|
|