Divider
  Speech Technology and Research Laboratory
  People
  Current Research Activities
  Past Research Activities
  Publications
  Career Opportunities
  Seminars
  Technologies for License
  In the News
  Contact Us
  STAR Search
  Information and Computing Sciences Division
SpacerAbout UsDividerR and D DivisionsDividerCareersDividerNewsroomDividerContact UsDividerSRI HomeSpacer

Spacer
         
  SRI Logo

Search SRILM-USER Archives

Match: Format: Sort by:
Search:

GT coeffs in -make-big-lm

From: ilya oparin <ioparin at ADDRESS HIDDEN>
Date: Thu, 11 May 2006 16:12:33 +0100 (BST)

--0-1726096994-1147360353=:9473
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit

Hi!
  
  When I trained a very large model (corpus size approx. 600 mln tokens), I found out a feature that look a bit odd. Since the LM is going to be huge, I'm using -make-big-lm script to calculate in a distributed way 4 partial LMs and then merge those into the resulting one.
  After I put to calculation 4 -make-big-lm tasks, GT coefficients for the first one are output in the home directory (and then it takes some time to get that something is possibly wrong, since this output is not reported in manual), and the other running tasks are just using those, presuming GT pre-computation was done in advance. It should not seriously damage a large model, but it's good to be as precise as possible. So I had to delete GT files manually after each consequent (not simultaneous then) -make-big-lm execution, presuming n-gram merge would correctly renormalize the probabilities. Is it correct or I'd rather calculate GT coefficients from the whole .ngram file, save in the home directory and use for each partial -make-big-lm calculation?

best regards,
Ilya

---------------------------------
To help you stay safe and secure online, we've developed the all new Yahoo! Security Centre.
--0-1726096994-1147360353=:9473
Content-Type: text/html; charset=iso-8859-1
Content-Transfer-Encoding: 8bit

<div>Hi!</div>  <div> </div>  <div>When I trained a very large model (corpus size approx. 600 mln tokens), I found out a feature that look a bit odd. Since the LM is going to be huge, I'm using -make-big-lm script to calculate in a distributed way 4 partial LMs and then merge those into the resulting one. </div>  <div>After I put to calculation 4 -make-big-lm tasks, GT coefficients for the first one are output in the home directory (and then it takes some time to get that something is possibly wrong, since this output is not reported in manual), and the other running tasks are just using those, presuming GT pre-computation was done in advance. It should not seriously damage a large model, but it's good to be as precise as possible. So I had to delete GT files manually after each consequent (not simultaneous then) -make-big-lm execution, presuming n-gram merge would correctly renormalize the probabilities. Is it correct or I'd rather calculate GT coefficients
from the whole .ngram file, save in the home directory and use for each partial -make-big-lm calculation?</div><BR><BR>best regards,<br>Ilya<p>
<hr size=1><font face="Arial" size="2">To help you stay safe and secure online, we've developed the all new <a href="http://us.rd.yahoo.com/mail/uk/taglines/default/security_centre/*http://uk.security.yahoo.com/"><b>Yahoo! Security Centre</b></a>.</font>
--0-1726096994-1147360353=:9473--

Click here to go to the SRILM home page.

 

About Us  Vertical divider  R&D Divisions  Divider  Careers  Divider  Newsroom  Divider  Contact Us
©2006 SRI International, 333 Ravenswood Avenue, Menlo Park, CA 94025-3493
SRI International is an independent, nonprofit corporation. Privacy policy

Last modified Dec 02, 2008