From: Chip Salzenberg Date: Mon, 25 Nov 1996 12:56:45 +0000 (+1200) Subject: Update locale documentation. X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=commitdiff_plain;h=f6bd30faf48db4b47d88b58c4e98bba648d605f1;p=p5sagit%2Fp5-mst-13.2.git Update locale documentation. --- diff --git a/pod/perli18n.pod b/pod/perli18n.pod index 891f95e..aea6b4a 100644 --- a/pod/perli18n.pod +++ b/pod/perli18n.pod @@ -5,10 +5,10 @@ perl18n - Perl i18n (internalization) =head1 DESCRIPTION Perl supports the language-specific notions of data like -"is this a letter" and "which letter comes first". These +"is this a letter" and "which letter comes first". These are very important issues especially for languages other than English -- but also for English: it would be very -naïve indeed to think that C defines all the letters. +naïve indeed to think that C defines all the "letters". Perl understands the language-specific data via the standardized (ISO C, XPG4, POSIX 1.c) method called "the locale system". @@ -33,26 +33,27 @@ In runtime you can switch locales using the POSIX::setlocale(). $old_locale = setlocale(LC_CTYPE); setlocale(LC_CTYPE, "fr_CA.ISO8859-1"); - # for LC_CTYPE now in locale "French, Canada, codeset ISO 8859-1" + # LC_CTYPE now in locale "French, Canada, codeset ISO 8859-1" setlocale(LC_CTYPE, ""); - # for LC_CTYPE now in locale what the LC_ALL / LC_CTYPE / LANG define. + # LC_CTYPE now in locale what the LC_ALL / LC_CTYPE / LANG define. # see below for documentation about the LC_ALL / LC_CTYPE / LANG. # restore the old locale setlocale(LC_CTYPE, $old_locale); The first argument of C is called B and the -second argument B. The category tells in what aspect of data -processing we want to apply language-specific rules, the locale tells -in what language-country/territory-codeset - but read on for the naming -of the locales: not all systems name locales as in the example. +second argument B. The category tells in what aspect of +data processing we want to apply language-specific rules, the locale +tells in what language-country/territory-codeset - but read on for the +naming of the locales: not all systems name locales as in the example. For further information about the categories, please consult your -L manual. For the locales available in your system, also -consult the L manual and see whether it leads you to the -list of the available locales (search for the C section). If -that fails, try out in command line the following commands: +L manual. For the locales available in your system, +also consult the L manual and see whether it leads you +to the list of the available locales (search for the C +section). If that fails, try out in command line the following +commands: =over 12 @@ -76,60 +77,101 @@ and see whether they list something resembling these english german russian english.iso88591 german.iso88591 russian.iso88595 -Sadly enough even if the calling interface has been standardized -the names of the locales are not. The naming usually is -language-country/territory-codeset but the latter parts may -not be present. Two special locales are worth special mention: - - "C" - -and - "POSIX" +Sadly enough even if the calling interface has been standardized the +names of the locales are not. The naming usually is +language_country/territory.codeset but the latter parts may not be +present. +Two special locales are worth special mention: C<"C"> and C<"POSIX">. Currently and effectively these are the same locale: the difference is mainly that the first one is defined by the C standard and the second -one is defined by the POSIX standard. What they mean and define is the -B in which every program does start in. The language -is (American) English and the character codeset C. -B: not all systems have the C<"POSIX"> locale (not all systems -are POSIX): use the C<"C"> locale when you need the default locale. +one is defined by the POSIX standard. What they mean and define is +the B in which every program does start in. The +language is (American) English and the character codeset C. +B: Not all systems have the C<"POSIX"> locale (not all systems +are POSIX), so use the C<"C"> locale when you need the default locale. -=head2 Category LC_CTYPE: CHARACTER TYPES +=head2 The C Pragma -Starting from Perl version 5.002 perl has obeyed the C -environment variable which controls application's notions on -which characters are alphabetic characters. This affects in -Perl the regular expression metanotation +By default, Perl ignores the current locale. The C pragma +tells Perl to use the current locale for some operations: The +comparison functions (lt, le, eq, cmp, ne, ge, gt, sort) use +C; regular expressions and case-modification functions +(uc, lc, ucfirst, lcfirst) use C; and formatting functions +(printf and sprintf) use C. The default behavior returns +with C or by reaching the end of the enclosing block. - \w +Note that the result of any operation that uses locale information is +tainted, since locales can be created by unprivileged users on some +systems (see L). -which stands for alphanumeric characters, that is, alphabetic and -numeric characters (please consult L for more information -about regular expressions). Thanks to the C, depending on -your locale settings, characters like C<Æ>, C<É>, C<ß>, C<ø>, can be -understood as C<\w> characters. +=head2 Category LC_COLLATE: Collation -=head2 Category LC_COLLATE: COLLATION - -Starting from Perl version 5.003_06 perl has obeyed the B +When in the scope of C, Perl obeys the B environment variable which controls application's notions on the -collation (ordering) of the characters. C does in most Latin +collation (ordering) of the characters. C does in most Latin alphabets follow the C but where do the C<Á> and C<Ä> belong? +B: Comparing and sorting by locale is usually slower than the +default sorting; factors of 2 to 4 have been observed. It will also +consume more memory: while a Perl scalar variable is participating in +any string comparison or sorting operation and obeying the locale +collation rules it will take about 3-15 (the exact value depends on +the operating system) times more memory than normally. These downsides +are dictated more by the operating system implementation of the locale +system than by Perl. + Here is a code snippet that will tell you what are the alphanumeric characters in the current locale, in the locale order: - perl -le 'print sort grep /\w/, map { chr() } 0..255' + use POSIX qw(setlocale LC_COLLATE); + use locale; -As noted above, this will work only for Perl versions 5.003_06 and up. + setlocale(LC_COLLATE, ""); + print +(sort grep /\w/, map { chr() } 0..255), "\n"; -B: in the pre-5.003_06 Perl releases the per-locale collation -was possible using the C library module. This is now -mildly obsolete and to be avoided. The C functionality is +The default collation must be used for example for sorting raw binary +data whereas the locale collation is useful for natural text. + +B: In some locales some characters may have no collation value +at all -- this means for example if the C<'-'> is such a character the +C and C may sort to the same place. + +B: For certain environments the locale support by the operating +system is very simply broken and cannot be used or fixed by Perl. Such +deficiencies can and will result in mysterious hangs and/or Perl core +dumps. One such example is IRIX before the release 6.2, the +C support simply does not work. When confronted with such +systems, please report in excruciating detail to C, +complain to your vendor, maybe some bug fixes exist for your operating +system for these problems? Sometimes such bug fixes are called an +operating system upgrade. + +B: In the pre-5.003_06 Perl releases the per-locale collation +was possible using the C library module. This is now +mildly obsolete and to be avoided. The C functionality is integrated into the Perl core language and one can use scalar data completely normally -- there is no need to juggle with the scalar references of C. +=head2 Category LC_CTYPE: Character Types + +When in the scope of C, Perl obeys the C locale +information which controls application's notions on which characters +are alphabetic characters. This affects in Perl the regular expression +metanotation C<\\w> which stands for alphanumeric characters, that is, +alphabetic and numeric characters (please consult L for more +information about regular expressions). Thanks to the C, +depending on your locale settings, characters like C<Æ>, C<É>, +C<ß>, C<ø>, may be understood as C<\w> characters. + +=head2 Category LC_NUMERIC: Numeric Formatting + +When in the scope of C, Perl obeys the C +locale information which controls application's notions on how numbers +should be formatted for input and output. This affects in Perl the +printf and fprintf function, as well as POSIX::strtod. + =head1 ENVIRONMENT =over 12 @@ -137,16 +179,17 @@ references of C. =item PERL_BADLANG A string that controls whether Perl warns in its startup about failed -locale settings. This can happen if the locale support in the -operating system is lacking (broken) is some way. If this string has +locale settings. This can happen if the locale support in the +operating system is lacking (broken) is some way. If this string has an integer value differing from zero, Perl will not complain. -B: this is just hiding the warning message: the message tells + +B: This is just hiding the warning message. The message tells about some problem in your system's locale support and you should investigate what the problem is. =back -The following environment variables are not specific to Perl: they are +The following environment variables are not specific to Perl: They are part of the standardized (ISO C, XPG4, POSIX 1.c) setlocale method to control an application's opinion on data. @@ -159,32 +202,33 @@ set, it overrides all the rest of the locale environment variables. =item LC_CTYPE -C controls the classification of characters, see above. - -If this is unset and the C is set, the C is used as -the C. If both this and the C are unset but the C -is set, the C is used as the C. -If none of these three is set, the default locale C<"C"> -is used as the C. +In the absence of C, C chooses the character type +locale. In the absence of both C and C, C +chooses the character type locale. =item LC_COLLATE -C controls the collation of characters, see above. +In the absence of C, C chooses the collation +locale. In the absence of both C and C, C +chooses the collation locale. + +=item LC_NUMERIC -If this is unset and the C is set, the C is used as -the C. If both this and the C are unset but the -C is set, the C is used as the C. -If none of these three is set, the default locale C<"C"> -is used as the C. +In the absence of C, C chooses the numeric format +locale. In the absence of both C and C, C +chooses the numeric format. =item LANG -LC_ALL is the "catch-all" locale environment variable. If it is set, -it is used as the last resort if neither of the C and the -category-specific C are set. +C is the "catch-all" locale environment variable. If it is set, +it is used as the last resort after the overall C and the +category-specific C. =back There are further locale-controlling environment variables -(C) but Perl -B currently obey them. +(C) but Perl B currently +use them, except possibly as they affect the behavior of library +functions called by Perl extensions. + +=cut