Divider
  Speech Technology and Research Laboratory
  People
  Current Research Activities
  Past Research Activities
  Publications
  Career Opportunities
  Seminars
  Technologies for License
  In the News
  Contact Us
  STAR Search
  Information and Computing Sciences Division
SpacerAbout UsDividerR and D DivisionsDividerCareersDividerNewsroomDividerContact UsDividerSRI HomeSpacer

Spacer
         
  SRI Logo

Search SRILM-USER Archives

Match: Format: Sort by:
Search:

Re: [SRILM] Some more FLM questions

From: Andreas Stolcke <stolcke at ADDRESS HIDDEN>
Date: Mon, 23 Oct 2006 15:46:53 -0700

ilya oparin wrote:
>
> 2) Could you please specify how you work with large
> data?
> When I was training the model on 5M data, it was
> taking 1.2G of memory. Actually, I work with
> inflectional languages (Russian and Czech) so the
> factors are really  "rich": features for each word
> include its stem, inflection, detailed morphological
> tag and lemma. May be that's why it takes so much
> space? Otherwise I can not get how you managed to run
> it for 30G words in English: in my case if I want to
> enlarge data it seems like I'll have to switch to
> 64-bit architecture. Does SRILM and FLM support 64-bit
> somehow?
> If it's only me that lucky with memory loads, what
> could you suggest to reduce it?
>  
Yes, SRILM supports 64bit linux (and other) platforms.  For Linux
running on AMD64-compatible
machines use
  
    make MACHINE_TYPE=i686-m64

So reduce memory consumptions use the strategies described in doc/FAQ.  
I'm copying here the relevant bits, many of which apply to
FLMs as well.

> Topic: Large data / too little memory issues
>
> 1) I'm getting a message saying (among other things)
>
>         Assertion `body != 0' failed.
>
> A: You are running out of memory.  See subsequent questions depending on
>    what you are trying to do.  Note: the above message means you are
> running
>    out of "virtual" memory on your computer, which could be because of
>    limits in swap space, administrative resource limits, or limitations of
>    the machine architecture (a 32-bit machine cannot address more than
>    4GB no matter how many resources your system has).
>    Another symptom of not enough memory is that your program runs, but
>    very, very slowly, i.e., it is "paging" or "swapping" as it tries to
>    use more memory than the machine has RAM installed.
>
> 2) I am trying to count N-grams in a text file and running out of memory.
>
> A: Don't use ngram-count directly to count N-grams.  Instead, use the
>    make-batch-counts and merge-batch-counts scripts described in
>    training-scripts(1).  That way you can create N-gram counts limited
> only
>    by the maximum file size on your system.
>
> 3) I am trying to build an N-gram LM and ngram-count runs out of memory.
>
> A: You are running out of memory either because of the size of ngram
> counts,
>    or of the LM being built. The following are strategies for reducing the
>    memory requiredments for training LMs.
>
>      a) Assuming you are using Good-Turing or Kneser-Ney discounting,
> don't use
>         ngram-count in "raw" form.  Instead, use the make-big-lm wrapper
>         script described in the traning-scripts(1) man page.    

> b) Switch to using the "_c" or "_s" versions of the SRI binaries.  For
>         instructions on how to build them, see the INSTALL file.
>         Once built, set your executable seach path accordingly, and try
>         make-big-lm again.
>
>      c) Lower the minimum counts for N-grams included in the LM, i.e., the
>         values of the options -gt2min, -gt3min, gt4min, etc.  The higher
>         order N-grams typically get higher minumum counts.
>
>      d) Get a machine with more memory.  If you are hitting the
> limitations of
>         a 32-bit machine architecture, get a 64-bit machine and
> recompile SRILM
>         to take advantage of the expanded address space. (The "i686-m64"
>         MACHINE_TYPE setting is for systems based on 64-bit AMD
> processors.)
>         Note: that the 64-bit pointers will require a memory overhead in
>         themselves, so will need a machine with significantly, not just a
>         little, more memory than 4GB.
>
> 4) I am trying to apply a large LM to some data and am running out of
> memory.
>
> A: Again, there are several strategies to reduce memory requirements.
>
>      a) Use the "_c" or "_s" versions of the SRI binaries.  See 3b) above.
>
>      b) Precompute the vocabulary of your test data and use the
>         ngram -limit-vocab option to load only the N-gram parameters
> relevant
>         to your data.  This approach should allow you to use arbitrarily
>         large LMs provided the data is divided into small enough chunks.
>
>      c) If the LM can be built on a large machine, but then is to be
> used on
>         machines with limited memory, use ngram -prune to remove the less
>         important parametere of the model.  This usually gives huge size
>         reductions with relatively modest performance degradation.  The
>         tradeoff is adjustable by varying the pruning parameter.
>

Andreas

Click here to go to the SRILM home page.

 

About Us  Vertical divider  R&D Divisions  Divider  Careers  Divider  Newsroom  Divider  Contact Us
©2006 SRI International, 333 Ravenswood Avenue, Menlo Park, CA 94025-3493
SRI International is an independent, nonprofit corporation. Privacy policy

Last modified Dec 02, 2008