Divider
  Speech Technology and Research Laboratory
  People
  Current Research Activities
  Past Research Activities
  Publications
  Career Opportunities
  Seminars
  Technologies for License
  In the News
  Contact Us
  STAR Search
  Information and Computing Sciences Division
SpacerAbout UsDividerR and D DivisionsDividerCareersDividerNewsroomDividerContact UsDividerSRI HomeSpacer

Spacer
         
  SRI Logo

Search SRILM-USER Archives

Match: Format: Sort by:
Search:

Re: SRILM and LC_ALL

From: David Gelbart <gelbart at ADDRESS HIDDEN>
Date: Mon, 8 Oct 2007 22:24:49 -0700 (PDT)

> My default locale is en_US.  With this locale, I do not see the error David
> Brodbeck did, even if I use gawk 3.1.5.  If I set LANG=en_US.UTF-8 and use
> gawk 3.1.5, then I see the error:
>
> $ /usr/local/bin/gawk  -f `which add-pauses-to-pfsg`
> gawk: /u/drspeech/src/srilm/devel/bin/i686/add-pauses-to-pfsg:22: fatal:
> Invalid collation character: /[[:lower:]-?]/

A followup:

At home, I'm running gawk 3.1.15 under Fedora Core 3 and my default
locale is en_US.UTF-8:

  $ locale
  LANG=en_US.UTF-8
  LC_CTYPE="en_US.UTF-8"
  LC_NUMERIC="en_US.UTF-8"
  LC_TIME="en_US.UTF-8"
  LC_COLLATE="en_US.UTF-8"
  LC_MONETARY="en_US.UTF-8"
  LC_MESSAGES="en_US.UTF-8"
  LC_PAPER="en_US.UTF-8"
  LC_NAME="en_US.UTF-8"
  LC_ADDRESS="en_US.UTF-8"
  LC_TELEPHONE="en_US.UTF-8"
  LC_MEASUREMENT="en_US.UTF-8"
  LC_IDENTIFICATION="en_US.UTF-8"
  LC_ALL=

If I use the default locale, I get the "Invalid collation character"
error.  If I set LANG=C, I get the same error.

If I set LC_ALL=en_US, that error goes away but the make-ngram-pfsg
test fails with the message "make-ngram-pfsg: stdout output DIFFERS".
I think this is because when LC_ALL is set it overrides the other LC_*
variables (http://opengroup.org/onlinepubs/007908799/xbd/envvar.html).
This means that the line in test/tests/make-ngram-pfsg/run-test which
sets LC_COLLATE=C has no effect when LC_ALL is set.

If I set LANG=en_US and leave LC_ALL unset, then the
"Invalid collation character" error goes away and the make-ngram-pfsg
test passes.

So it appears that the gawk locale tips in the SRILM INSTALL file may
need to be updated to reflect gawk 3.1.15's behavior.  Please let me
know if there's anything else I could do to help with this.

Regards,
David

Click here to go to the SRILM home page.

 

About Us  Vertical divider  R&D Divisions  Divider  Careers  Divider  Newsroom  Divider  Contact Us
©2006 SRI International, 333 Ravenswood Avenue, Menlo Park, CA 94025-3493
SRI International is an independent, nonprofit corporation. Privacy policy

Last modified Dec 05, 2008