Divider
  Speech Technology and Research Laboratory
  People
  Current Research Activities
  Past Research Activities
  Publications
  Career Opportunities
  Seminars
  Technologies for License
  In the News
  Contact Us
  STAR Search
  Information and Computing Sciences Division
SpacerAbout UsDividerR and D DivisionsDividerCareersDividerNewsroomDividerContact UsDividerSRI HomeSpacer

Spacer
         
  SRI Logo

Search SRILM-USER Archives

Match: Format: Sort by:
Search:

[SRILM]: -debug 2 info

From: ilya oparin <ioparin at ADDRESS HIDDEN>
Date: Wed, 31 May 2006 11:41:51 +0100 (BST)

--0-164690114-1149072111=:63649
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit

Hi!

When I calculate perplexity of my POS-based class model (word can belong to many classes, class-definition file I create myself on the base of a POS-tagged data), with "-debug 2" I get the output I can not fully understand. For testing puropses I measure ppl on the same data I trained the class model (i.e. there should not be ay OOVs). However, in the debug output, for every N-gram there is a string of the format
P(w| w...) = [OOV][n-gram][n-gram]...[OOV][n-gram][n-gram]...
As far as I get it, [n-gram]s refer to different combinations of assigning words to classes. But why fo those [OOV] may appear (and they appear in equal intervals between strings of [n-gram]s for each word)?

I have only one guess: since [OOVs] are only missing for the last (</s>| ...) n-gram, those [OOV] may correspond to a check if a word is present in the implicit stop-word vocabulary or something...

It would be great if anybody could comment on that.

best regards,
Ilya

---------------------------------
All New Yahoo! Mail – Tired of Vi@gr@! come-ons? Let our SpamGuard protect you.
--0-164690114-1149072111=:63649
Content-Type: text/html; charset=iso-8859-1
Content-Transfer-Encoding: 8bit

Hi!<br><br>When I calculate perplexity of my POS-based class model (word can belong to many classes, class-definition file I create myself on the base of a POS-tagged data), with "-debug 2" I get the output I can not fully understand. For testing puropses I measure ppl on the same data I trained the class model (i.e. there should not be ay OOVs). However, in the debug output, for every N-gram there is a string of the format<br>P(w| w...) = [OOV][n-gram][n-gram]...[OOV][n-gram][n-gram]...<br>As far as I get it, [n-gram]s refer to different combinations of assigning words to classes. But why fo those [OOV] may appear (and they appear in equal intervals between strings of [n-gram]s for each word)?<br><br>I have only one guess: since [OOVs] are only missing for the last (</s>| ...) n-gram, those [OOV] may correspond to a check if a word is present in the implicit stop-word vocabulary or something... <br><br>It would be great if anybody could comment on
that.<br><BR><BR>best regards,<br>Ilya<p>
<hr size=1>
<a href="http://us.rd.yahoo.com/mail/uk/taglines/default/nowyoucan/spamguard/*http://us.rd.yahoo.com/evt=40565/*http://uk.docs.yahoo.com/nowyoucan.html">All New Yahoo! Mail</a> – Tired of Vi@gr@! come-ons? Let our SpamGuard protect you.
--0-164690114-1149072111=:63649--

Click here to go to the SRILM home page.

 

About Us  Vertical divider  R&D Divisions  Divider  Careers  Divider  Newsroom  Divider  Contact Us
©2006 SRI International, 333 Ravenswood Avenue, Menlo Park, CA 94025-3493
SRI International is an independent, nonprofit corporation. Privacy policy

Last modified Dec 02, 2008