Better support for =for
[p5sagit/p5-mst-13.2.git] / pod / perllocale.pod
CommitLineData
5f05dabc 1=head1 NAME
2
3perllocale - Perl locale handling (internationlization)
4
5=head1 DESCRIPTION
6
7Perl supports language-specific notions of data such as "is this a
8letter", "what is the upper-case equivalent of this letter", and
9"which of these letters comes first". These are important issues,
10especially for languages other than English - but also for English: it
11would be very naÔve to think that C<A-Za-z> defines all the "letters".
12Perl is also aware that some character other than '.' may be preferred
13as a decimal point, and that output date representations may be
14language-specific.
15
16Perl can understand language-specific data via the standardized
17(ISO C, XPG4, POSIX 1.c) method called "the locale system".
18The locale system is controlled per application using a pragma, one
19function call, and several environment variables.
20
21B<NOTE>: This feature is new in Perl 5.004, and does not apply unless
22an application specifically requests it - see L<Backward
23compatibility>.
24
25=head1 PREPARING TO USE LOCALES
26
27If Perl applications are to be able to understand and present your
28data correctly according a locale of your choice, B<all> of the following
29must be true:
30
31=over 4
32
33=item *
34
35B<Your operating system must support the locale system>. If it does,
36you should find that the C<setlocale> function is a documented part of
37its C library.
38
39=item *
40
41B<Definitions for the locales which you use must be installed>. You,
42or your system administrator, must make sure that this is the case.
43The available locales, the location in which they are kept, and the
44manner in which they are installed, vary from system to system. Some
45systems provide only a few, hard-wired, locales, and do not allow more
46to be added; others allow you to add "canned" locales provided by the
47system supplier; still others allow you or the system administrator
48to define and add arbitrary locales. (You may have to ask your
49supplier to provide canned locales whch are not delivered with your
50operating system.) Read your system documentation for further
51illumination.
52
53=item *
54
55B<Perl must believe that the locale system is supported>. If it does,
56C<perl -V:d_setlocale> will say that the value for C<d_setlocale> is
57C<define>.
58
59=back
60
61If you want a Perl application to process and present your data
62according to a particular locale, the application code should include
63the S<C<use locale>> pragma (L<The use locale Pragma>) where
64appropriate, and B<at least one> of the following must be true:
65
66=over 4
67
68=item *
69
70B<The locale-determining environment variables (see L<ENVIRONMENT>) must
71be correctly set up>, either by yourself, or by the person who set up
72your system account, at the time the application is started.
73
74=item *
75
76B<The application must set its own locale> using the method described
77in L<The C<setlocale> function>.
78
79=back
80
81=head1 USING LOCALES
82
83=head2 The use locale pragma
84
85By default, Perl ignores the current locale. The S<C<use locale>> pragma
86tells Perl to use the current locale for some operations:
87
88=over 4
89
90=item *
91
92B<The comparison operators> (C<lt>, C<le>, C<cmp>, C<ge>, and C<gt>)
93use C<LC_COLLATE>. The C<sort> function is also affected if it is
94used without an explicit comparison function because it uses C<cmp> by
95default.
96
97B<Note:> The C<eq> and C<ne> operators are unaffected by the locale:
98they always perform a byte-by-byte comparison of their scalar
99arguments. If you really want to know if two strings - which C<eq>
100may consider different - are equal as far as collation is concerned,
101use something like
102
103 !("space and case ignored" cmp "SpaceAndCaseIgnored")
104
105(which would be true if the collation locale specified a
106dictionary-like ordering).
107
108I<Editor's note:> I am right about C<eq> and C<ne>, aren't I?
109
110=item *
111
112B<Regular expressions and case-modification functions> (C<uc>,
113C<lc>, C<ucfirst>, and C<lcfirst>) use C<LC_CTYPE>
114
115=item *
116
117B<The formatting functions> (C<printf> and C<sprintf>) use
118C<LC_NUMERIC>
119
120=item *
121
122B<The POSIX date formatting function> (C<strftime>) uses C<LC_TIME>.
123
124=back
125
126C<LC_COLLATE>, C<LC_CTYPE>, and so on, are discussed further in
127L<LOCALE CATEGORIES>.
128
129The default behaviour returns with S<C<no locale>> or on reaching the end
130of the enclosing block.
131
132Note that the result of any operation that uses locale information is
133tainted (see L<perlsec.pod>), since locales can be created by
134unprivileged users on some systems.
135
136=head2 The setlocale function
137
138You can switch locales as often as you wish at runtime with the
139C<POSIX::setlocale> function:
140
141 # This functionality not usable prior to Perl 5.004
142 require 5.004;
143
144 # Import locale-handling tool set from POSIX module.
145 # This example uses: setlocale -- the function call
146 # LC_CTYPE -- explained below
147 use POSIX qw(locale_h);
148
149 # query and save the old locale.
150 $old_locale = setlocale(LC_CTYPE);
151
152 setlocale(LC_CTYPE, "fr_CA.ISO8859-1");
153 # LC_CTYPE now in locale "French, Canada, codeset ISO 8859-1"
154
155 setlocale(LC_CTYPE, "");
156 # LC_CTYPE now reset to default defined by LC_ALL/LC_CTYPE/LANG
157 # environment variables. See below for documentation.
158
159 # restore the old locale
160 setlocale(LC_CTYPE, $old_locale);
161
162The first argument of C<setlocale> gives the B<category>, the second
163the B<locale>. The category tells in what aspect of data processing
164you want to apply locale-specific rules. Category names are discussed
165in L<LOCALE CATEGORIES> and L<ENVIRONMENT>. The locale is the name of
166a collection of customization information corresponding to a paricular
167combination of language, country or territory, and codeset. Read on
168for hints on the naming of locales: not all systems name locales as in
169the example.
170
171If no second argument is provided, the function returns a string
172naming the current locale for the category. You can use this value as
173the second argument in a subsequent call to C<setlocale>. If a second
174argument is given and it corresponds to a valid locale, the locale for
175the category is set to that value, and the function returns the
176now-current locale value. You can use this in a subsequent call to
177C<setlocale>. (In some implementations, the return value may sometimes
178differ from the value you gave as the second argument - think of it as
179an alias for the value that you gave.)
180
181As the example shows, if the second argument is an empty string, the
182category's locale is returned to the default specified by the
183corresponding environment variables. Generally, this results in a
184return to the default which was in force when Perl started up: changes
185to the environment made by the application after start-up may or may
186not be noticed, depending on the implementation of your system's C
187library.
188
189If the second argument does not correspond to a valid locale, the
190locale for the category is not changed, and the function returns
191C<undef>.
192
193For further information about the categories, consult
194L<setlocale(3)>. For the locales available in your system,
195also consult L<setlocale(3)> and see whether it leads you
196to the list of the available locales (search for the C<SEE ALSO>
197section). If that fails, try the following command lines:
198
199 locale -a
200
201 nlsinfo
202
203 ls /usr/lib/nls/loc
204
205 ls /usr/lib/locale
206
207 ls /usr/lib/nls
208
209and see whether they list something resembling these
210
211 en_US.ISO8859-1 de_DE.ISO8859-1 ru_RU.ISO8859-5
212 en_US de_DE ru_RU
213 en de ru
214 english german russian
215 english.iso88591 german.iso88591 russian.iso88595
216
217Sadly, even though the calling interface for C<setlocale> has been
218standardized, the names of the locales have not. The form of the name
219is usually I<language_country>B</>I<territory>B<.>I<codeset>, but the
220latter parts are not always present.
221
222Two special locales are worth particular mention: "C" and
223"POSIX". Currently these are effectively the same locale: the
224difference is mainly that the first one is defined by the C standard
225and the second by the POSIX standard. What they define is the
226B<default locale> in which every program starts in the absence of
227locale information in its environment. (The default default locale,
228if you will.) Its language is (American) English and its character
229codeset ASCII.
230
231B<NOTE>: Not all systems have the "POSIX" locale (not all systems
232are POSIX-conformant), so use "C" when you need explicitly to
233specify this default locale.
234
235=head2 The localeconv function
236
237The C<POSIX::localeconv> function allows you to get particulars of the
238locale-dependent numeric formatting information specified by the
239current C<LC_NUMERIC> and C<LC_MONETARY> locales. (If you just want
240the name of the current locale for a particular category, use
241C<POSIX::setlocale> with a single parameter - see L<The setlocale
242function>.)
243
244 use POSIX qw(locale_h);
245 use locale;
246
247 # Get a reference to a hash of locale-dependent info
248 $locale_values = localeconv();
249
250 # Output sorted list of the values
251 for (sort keys %$locale_values) {
252 printf "%-20s = %s\n", $_, $locale_values->{$_}
253 }
254
255C<localeconv> takes no arguments, and returns B<a reference to> a
256hash. The keys of this hash are formatting variable names such as
257C<decimal_point> and C<thousands_sep>; the values are the
258corresponding values. See L<POSIX (3)/localeconv> for a longer
259example, which lists all the categories an implementation might be
260expected to provide; some provide more and others fewer, however.
261
262I<Editor's note:> I can't work out whether C<POSIX::localeconv>
263correctly obeys C<use locale> and C<no locale>. In my opinion, it
264should, if only to be consistent with other locale stuff - although
265it's hardly a show-stopper if it doesn't. Could someone check,
266please?
267
268Here's a simple-minded example program which rewrites its command line
269parameters as integers formatted correctly in the current locale:
270
271 # See comments in previous example
272 require 5.004;
273 use POSIX qw(locale_h);
274 use locale;
275
276 # Get some of locale's numeric formatting parameters
277 my ($thousands_sep, $grouping) =
278 @{localeconv()}{'thousands_sep', 'grouping'};
279
280 # Apply defaults if values are missing
281 $thousands_sep = ',' unless $thousands_sep;
282 $grouping = 3 unless $grouping;
283
284 # Format command line params for current locale
285 for (@ARGV)
286 {
287 $_ = int; # Chop non-integer part
288 1 while
289 s/(\d)(\d{$grouping}($|$thousands_sep))/$1$thousands_sep$2/;
290 print "$_ ";
291 }
292 print "\n";
293
294I<Editor's note:> Like all the examples, this needs testing on systems
295which, unlike mine, have non-toy implementations of locale handling.
296
297=head1 LOCALE CATEGORIES
298
299The subsections which follow descibe basic locale categories. As well
300as these, there are some combination categories which allow the
301manipulation of of more than one basic category at a time. See
302L<ENVIRONMENT VARIABLES> for a discussion of these.
303
304=head2 Category LC_COLLATE: Collation
305
306When in the scope of S<C<use locale>>, Perl looks to the B<LC_COLLATE>
307environment variable to determine the application's notions on the
308collation (ordering) of characters. ('B' follows 'A' in Latin
309alphabets, but where do '¡' and 'Ÿ' belong?)
310
311Here is a code snippet that will tell you what are the alphanumeric
312characters in the current locale, in the locale order:
313
314 use locale;
315 print +(sort grep /\w/, map { chr() } 0..255), "\n";
316
317I<Editor's note:> The original example had C<setlocale(LC_COLLATE, "")>
318prior to C<print ...>. I think this is wrong: as soon as you utter
319S<C<use locale>>, the default behaviour of C<sort> (well, C<cmp>, really)
320becomes locale-aware. The locale it's aware of is the current locale
321which, unless you've changed it yourself, is the default locale
322defined by your environment.
323
324Compare this with the characters that you see and their order if you state
325explicitly that the locale should be ignored:
326
327 no locale;
328 print +(sort grep /\w/, map { chr() } 0..255), "\n";
329
330This machine-native collation (which is what you get unless S<C<use
331locale>> has appeared earlier in the same block) must be used for
332sorting raw binary data, whereas the locale-dependent collation of the
333first example is useful for written text.
334
335B<NOTE>: In some locales some characters may have no collation value
336at all - for example, if '-' is such a character, 'relocate' and
337're-locate' may be considered to be equal to each other, and so sort
338to the same position.
339
340=head2 Category LC_CTYPE: Character Types
341
342When in the scope of S<C<use locale>>, Perl obeys the C<LC_CTYPE> locale
343setting. This controls the application's notion of which characters
344are alphabetic. This affects Perl's C<\w> regular expression
345metanotation, which stands for alphanumeric characters - that is,
346alphabetic and numeric characters. (Consult L<perlre> for more
347information about regular expressions.) Thanks to C<LC_CTYPE>,
348depending on your locale setting, characters like '', 'Š',
349'þ', and '¯' may be understood as C<\w> characters.
350
351C<LC_CTYPE> also affects the POSIX character-class test functions -
352C<isalpha>, C<islower> and so on. For example, if you move from the
353"C" locale to a 7-bit Scandinavian one, you may find - possibly to
354your surprise -that "|" moves from the C<ispunct> class to C<isalpha>.
355
356I<Editor's note:> I can't work out whether the C<POSIX::is...> stuff
357correctly obeys C<use locale> and C<no locale>. In my opinion, they
358should. Could someone check, please?
359
360B<Note:> A broken or malicious C<LC_CTYPE> locale definition may
361result in clearly ineligible characters being considered to be
362alphanumeric by your application. For strict matching of (unaccented)
363letters and digits - for example, in command strings - locale-aware
364applications should use C<\w> inside a C<no locale> block.
365
366=head2 Category LC_NUMERIC: Numeric Formatting
367
368When in the scope of S<C<use locale>>, Perl obeys the C<LC_NUMERIC>
369locale information which controls application's idea of how numbers
370should be formatted for human readability by the C<printf>, C<fprintf>,
371and C<write> functions. String to numeric conversion by the
372C<POSIX::strtod> function is also affected. In most impementations
373the only effect is to change the character used for the decimal point
374- perhaps from '.' to ',': these functions aren't aware of such
375niceties as thousands separation and so on. (See L<The localeconv
376function> if you care about these things.)
377
378I<Editor's note:> I can't work out whether C<POSIX::strtod> correctly
379obeys C<use locale> and C<no locale>. In my opinion, it should -
380although it's hardly a show-stopper if it doesn't. Could someone
381check, please?
382
383Note that output produced by C<print> is B<never> affected by the
384current locale: it is independent of whether C<use locale> or C<no
385locale> is in effect, and corresponds to what you'd get from C<printf>
386in the "C" locale. The same is true for Perl's internal conversions
387between numeric and string formats:
388
389 use POSIX qw(strtod);
390 use locale;
391 $n = 5/2; # Assign numeric 2.5 to $n
392
393 $a = " $n"; # Locale-independent conversion to string
394
395 print "half five is $n\n"; # Locale-independent output
396
397 printf "half five is %g\n", $n; # Locale-dependent output
398
399 print "DECIMAL POINT IS COMMA\n" # Locale-dependent conversion
400 if $n == (strtod("2,5"))[0];
401
402=head2 Category LC_MONETARY: Formatting of monetary amounts
403
404The C standard defines the C<LC_MONETARY> category, but no function
405that is affected by its contents. (Those with experience of standards
406committees will recognise that the working group decided to punt on
407the issue.) Consequently, Perl takes no notice of it. If you really
408want to use C<LC_MONETARY>, you can query its contents - see L<The
409localeconv function> - and use the information that it returns in your
410application's own formating of currency amounts. However, you may
411well find that the information, though voluminous and complex, does
412not quite meet your requirements: currency formatting is a hard nut to
413crack.
414
415=head2 LC_TIME
416
417The output produced by C<POSIX::strftime>, which builds a formatted
418human-readable date/time string, is affected by the current C<LC_TIME>
419locale. Thus, in a French locale, the output produced by the C<%B>
420format element (full month name) for the first month of the year would
421be "janvier". Here's how to get a list of the long month names in the
422current locale:
423
424 use POSIX qw(strftime);
425 use locale;
426 for (0..11)
427 {
428 $long_month_name[$_] = strftime("%B", 0, 0, 0, 1, $_, 96);
429 }
430
431I<Editor's note:> Unchecked in "alien" locales: my system can't do
432French...
433
434=head2 Other categories
435
436The remaining locale category, C<LC_MESSAGES> (possibly supplemented by
437others in particular implementations) is not currently used by Perl -
438except possibly to affect the behaviour of library functions called
439by extensions which are not part of the standard Perl distribution.
440
441=head1 ENVIRONMENT
442
443=over 12
444
445=item PERL_BADLANG
446
447A string that controls whether Perl warns in its startup about failed
448locale settings. This can happen if the locale support in the
449operating system is lacking (broken) is some way. If this string has
450an integer value differing from zero, Perl will not complain.
451
452B<NOTE>: This is just hiding the warning message. The message tells
453about some problem in your system's locale support and you should
454investigate what the problem is.
455
456=back
457
458The following environment variables are not specific to Perl: They are
459part of the standardized (ISO C, XPG4, POSIX 1.c) setlocale method to
460control an application's opinion on data.
461
462=over 12
463
464=item LC_ALL
465
466C<LC_ALL> is the "override-all" locale environment variable. If it is
467set, it overrides all the rest of the locale environment variables.
468
469=item LC_CTYPE
470
471In the absence of C<LC_ALL>, C<LC_CTYPE> chooses the character type
472locale. In the absence of both C<LC_ALL> and C<LC_CTYPE>, C<LANG>
473chooses the character type locale.
474
475=item LC_COLLATE
476
477In the absence of C<LC_ALL>, C<LC_COLLATE> chooses the collation (sorting)
478locale. In the absence of both C<LC_ALL> and C<LC_COLLATE>, C<LANG>
479chooses the collation locale.
480
481=item LC_MONETARY
482
483In the absence of C<LC_ALL>, C<LC_MONETARY> chooses the montary formatting
484locale. In the absence of both C<LC_ALL> and C<LC_MONETARY>, C<LANG>
485chooses the monetary formatting locale.
486
487=item LC_NUMERIC
488
489In the absence of C<LC_ALL>, C<LC_NUMERIC> chooses the numeric format
490locale. In the absence of both C<LC_ALL> and C<LC_NUMERIC>, C<LANG>
491chooses the numeric format.
492
493=item LC_TIME
494
495In the absence of C<LC_ALL>, C<LC_TIME> chooses the date and time formatting
496locale. In the absence of both C<LC_ALL> and C<LC_TIME>, C<LANG>
497chooses the date and time formatting locale.
498
499=item LANG
500
501C<LANG> is the "catch-all" locale environment variable. If it is set,
502it is used as the last resort after the overall C<LC_ALL> and the
503category-specific C<LC_...>.
504
505=back
506
507=head1 NOTES
508
509=head2 Backward compatibility
510
511Versions of Perl prior to 5.004 ignored locale information, generally
512behaving as if something similar to the C<"C"> locale (see L<The
513setlocale function>) was always in force, even if the program
514environment suggested otherwise. By default, Perl still behaves this
515way so as to maintain backward compatibility. If you want a Perl
516application to pay attention to locale information, you B<must> use
517the S<C<use locale>> pragma (see L<The S<C<use locale>> Pragma>) to
518instruct it to do so.
519
520=head2 Sort speed
521
522Comparing and sorting by locale is usually slower than the default
523sorting; factors of 2 to 4 have been observed. It will also consume
524more memory: while a Perl scalar variable is participating in any
525string comparison or sorting operation and obeying the locale
526collation rules it will take about 3-15 (the exact value depends on
527the operating system and the locale) times more memory than normally.
528These downsides are dictated more by the operating system
529implementation of the locale system than by Perl.
530
531=head2 I18N:Collate
532
533In Perl 5.003 (and later development releases prior to 5.003_06),
534per-locale collation was possible using the C<I18N::Collate> library
535module. This is now mildly obsolete and should be avoided in new
536applications. The C<LC_COLLATE> functionality is integrated into the
537Perl core language and one can use locale-specific scalar data
538completely normally - there is no need to juggle with the scalar
539references of C<I18N::Collate>.
540
541=head2 An imperfect standard
542
543Internationalization, as defined in the C and POSIX standards, can be
544criticized as incomplete, ungainly, and having too large a
545granularity. (Locales apply to a whole process, when it would
546arguably be more useful to have them apply to a single thread, window
547group, or whatever.) They also have a tendency, like standards
548groups, to divide the world into nations, when we all know that the
549world can equally well be divided into bankers, bikers, gamers, and so
550on. But, for now, it's the only standard we've got. This may be
551construed as a bug.
552
553=head2 Freely available locale definitions
554
555There is a large collection of locale definitions at
556C<ftp://dkuug.dk/i18n/WG15-collection>. You should be aware that they
557are unsupported, and are not claimed to be fit for any purpose. If
558your system allows the installation of arbitrary locales, you may find
559them useful as they are, or as a basis for the development of your own
560locales.
561
562=head2 i18n and l10n
563
564Internationalization is often abbreviated as B<i18n> because its first
565and last letters are separated by eighteen others. You can also talk of
566localization (B<l10n>), the process of tailoring an
567internationalizated application for use in a particular locale.
568
569=head1 BUGS
570
571=head2 Broken systems
572
573In certain system environments the operating system's locale support
574is broken and cannot be fixed or used by Perl. Such deficiencies can
575and will result in mysterious hangs and/or Perl core dumps. One
576example is IRIX before release 6.2, in which the C<LC_COLLATE> support
577simply does not work. When confronted with such a system, please
578report in excruciating detail to C<perlbug@perl.com>, and complain to
579your vendor: maybe some bug fixes exist for these problems in your
580operating system. Sometimes such bug fixes are called an operating
581system upgrade.
582
583=head2 Rendering of this documentation
584
585This manual page contains non-ASCII characters, which should all be
586rendered as accented letters, and which should make some sort of sense
587in context. If this is not the case, your system is probably not
588using the ISO 8859-1 character set which was used to write them,
589and/or your formatting, display, and printing software are not
590correctly mapping them to your host's character set. If this annoys
591you, and if you can convince yourself that it is due to a bug in one
592of Perl's various C<pod2>... utilities, by all means report it as a
593Perl bug. Otherwise, pausing only to curse anyone who ever invented
594yet another character set, see if you can make it handle ISO 8859-1
595sensibly.
596
597=head1 SEE ALSO
598
599L<POSIX (3)/isalnum>, L<POSIX (3)/isalpha>, L<POSIX (3)/isdigit>,
600L<POSIX (3)/isgraph>, L<POSIX (3)/islower>, L<POSIX (3)/isprint>,
601L<POSIX (3)/ispunct>, L<POSIX (3)/isspace>, L<POSIX (3)/isupper>,
602L<POSIX (3)/isxdigit>, L<POSIX (3)/localeconv>, L<POSIX (3)/setlocale>,
603L<POSIX (3)/strtod>
604
605I<Editor's note:> That looks horrible after going through C<pod2man>.
606But I do want to call out all thse sectins by name. What should I
607have done?
608
609=head1 HISTORY
610
611Perl 5.003's F<perli18n.pod> heavily hacked by Dominic Dunlop.
612
613Last update:
614Mon Dec 16 14:13:10 WET 1996