Divider
  Speech Technology and Research Laboratory
  People
  Current Research Activities
  Past Research Activities
  Publications
  Career Opportunities
  Seminars
  Technologies for License
  In the News
  Contact Us
  STAR Search
  Information and Computing Sciences Division
SpacerAbout UsDividerR and D DivisionsDividerCareersDividerNewsroomDividerContact UsDividerSRI HomeSpacer

Spacer
         
  SRI Logo

Search SRILM-USER Archives

Match: Format: Sort by:
Search:

Re: query regarding usage of SRILM toolkit

From: Andreas Stolcke <stolcke at ADDRESS HIDDEN>
Date: Fri, 29 Sep 2006 09:04:06 PDT

In message <Pine.LNX.4.60.0609291425390.5866 at ADDRESS HIDDEN>you wrote:
>
> Greetings!!!
>
> We are developing a syllable based isolated style continuous speech recognize
> r
> for Indian languages. Currently, our recognizer output is just a sequence of
> syllables. We want to extract the sequence of words from this syllable sequen
> ce
> using statistical language models and lexicon.I thought may be one of the
> programs in this  toolkit must be doing something similar (sub-word
> sequence to word sequence conversion). But all the programs seems to use
> word lattices.
>
> Is there any program in this toolkit that extracts the word sequence from
> the sub-word sequence using LM and lexicon.

Lashmi,

first you have to remember that when the documentation of a program says
'words' it doesn't mean you have to use words in the conventional sense.
you can use any kind of token (phones, syllables, etc.) in your lattices
etc.

The task you describe sounds like a boundary tagging problem, i.e., given
a sequence of tokens, you want to label each transition between tokens as
either a "boundary" or a "non-boundary".  There are two tools in SRILM
that can do this, using different kind of models.  One is
"hidden-ngram", which performs boundary tagging explicitly.
The other is "disambig" which tags the tokens themselves, not the boundaries
between them.  But by assigining tags that denote "first token in a unit",
"token insde a unit', etc. you can perform boundary tagging implicitly.
(The tokens in your case are the syllables, the units would be the words.)
Both tools use ngram language models to disambiguate the input.
The model can be trained from syllabified training data, in your case.

I suggest you look up papers on "word segmentation", "sentence segmentation",
"Mandarin tokenization", "chunk parsing" and "shallow parsing" to
get a good idea of the existing models for this type of task,
then study the manual pages for the programs.

--Andreas

>
> Thanks in Advance.
> Regards,
> Lakshmi

Click here to go to the SRILM home page.

 

About Us  Vertical divider  R&D Divisions  Divider  Careers  Divider  Newsroom  Divider  Contact Us
©2006 SRI International, 333 Ravenswood Avenue, Menlo Park, CA 94025-3493
SRI International is an independent, nonprofit corporation. Privacy policy

Last modified Dec 02, 2008