Divider
  Speech Technology and Research Laboratory
  People
  Current Research Activities
  Past Research Activities
  Publications
  Career Opportunities
  Seminars
  Technologies for License
  In the News
  Contact Us
  STAR Search
  Information and Computing Sciences Division
SpacerAbout UsDividerR and D DivisionsDividerCareersDividerNewsroomDividerContact UsDividerSRI HomeSpacer

Spacer
         
  SRI Logo

Search SRILM-USER Archives

Match: Format: Sort by:
Search:

SRILM and LC_ALL

From: David Gelbart <gelbart at ADDRESS HIDDEN>
Date: Mon, 8 Oct 2007 18:32:33 -0700 (PDT)

  This message is in MIME format.  The first part should be readable text,
  while the remaining parts are likely unreadable without MIME-aware tools.

---1765779782-1209304681-1191893553=:12736
Content-Type: TEXT/PLAIN; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8BIT

On July 19 2007, Andreas Stolcke wrote:
> David Brodbeck wrote:
> > I'm trying to build SRILM 1.5.2 on Redhat Enterprise Linux Server 5.
> > The machine type is i686_m64.  Everything builds all right, but
> > the tests fail for make-ngram-pfsg, ngram-class, and
> > ngram-count-lm-limit-vocab.
> >
> > make-ngram-pfsg is the most obvious one, so I'll tackle that one
> > first.  I get the following in the stderr file:
> > gawk: /opt/srilm/bin/i686-m64/add-pauses-to-pfsg:22: fatal: Invalid
> > collation character: /[[:lower:]-ÿ]/
>
> > Has anyone else run into this?  I'm using GNU Awk 3.1.5, and the
> > locale is set to en_US.UTF-8.
>
> This is odd since we're also using gawk 3.1.5 and I cannot replicate
> the problem even when setting LANG to en_US.UTF-8. It seems that the
> interpretation of gawk regular expressions should not depend on the
> OS release version, but of course there may always be bugs.

Hi Andreas,

Are you sure you used gawk 3.1.5 when you tried to replicate this?
The reason I ask is that at ICSI, the SRILM tools seem to invoke gawk
3.1.3, not gawk 3.1.5:

$ head -1 `which add-pauses-to-pfsg`
#!/usr/bin/gawk -f
$ /usr/bin/gawk --version | head -1
GNU Awk 3.1.3
$ which gawk
/usr/local/bin/gawk
$ /usr/local/bin/gawk --version | head -1
GNU Awk 3.1.5

My default locale is en_US.  With this locale, I do not see the error
David Brodbeck did, even if I use gawk 3.1.5.  If I set
LANG=en_US.UTF-8 and use gawk 3.1.5, then I see the error:

$ /usr/local/bin/gawk  -f `which add-pauses-to-pfsg`
gawk: /u/drspeech/src/srilm/devel/bin/i686/add-pauses-to-pfsg:22:
fatal: Invalid collation character: /[[:lower:]-?]/

Setting LC_ALL=C as suggested in the SRILM INSTALL file does not solve
the problem:

tmp$ export LC_ALL=C
tmp$ /usr/local/bin/gawk  -f `which add-pauses-to-pfsg`
gawk: /u/drspeech/src/srilm/devel/bin/i686/add-pauses-to-pfsg:22:
fatal: Invalid collation character: /[[:lower:]-ÿ]/

The compute-oov-rate script gives a similar error.

David Brodbeck, if you're reading this, did setting LC_ALL=C solve
your problem with add-pauses-to-pfsg?  This was not clear to me from
reading your July 23 email to Andreas.

Thanks,
David
---1765779782-1209304681-1191893553=:12736--

Click here to go to the SRILM home page.

 

About Us  Vertical divider  R&D Divisions  Divider  Careers  Divider  Newsroom  Divider  Contact Us
©2006 SRI International, 333 Ravenswood Avenue, Menlo Park, CA 94025-3493
SRI International is an independent, nonprofit corporation. Privacy policy

Last modified Dec 05, 2008