think UTF-8, embrace your inner UTF-8, as suggested by Larry.
(And as suggested by Markus Kuhn.)
While we are at it, document also the case of
mixed hash keys as a known potential troublemaker.
(Since it's locale-related, sometimes.)
p4raw-id: //depot/perl@15350
#define PL_utf8_upper (PERL_GET_INTERP->Iutf8_upper)
#define PL_utf8_xdigit (PERL_GET_INTERP->Iutf8_xdigit)
#define PL_uudmap (PERL_GET_INTERP->Iuudmap)
+#define PL_wantutf8 (PERL_GET_INTERP->Iwantutf8)
#define PL_warnhook (PERL_GET_INTERP->Iwarnhook)
#define PL_widesyscalls (PERL_GET_INTERP->Iwidesyscalls)
#define PL_xiv_arenaroot (PERL_GET_INTERP->Ixiv_arenaroot)
#define PL_utf8_upper (vTHX->Iutf8_upper)
#define PL_utf8_xdigit (vTHX->Iutf8_xdigit)
#define PL_uudmap (vTHX->Iuudmap)
+#define PL_wantutf8 (vTHX->Iwantutf8)
#define PL_warnhook (vTHX->Iwarnhook)
#define PL_widesyscalls (vTHX->Iwidesyscalls)
#define PL_xiv_arenaroot (vTHX->Ixiv_arenaroot)
#define PL_Iutf8_upper PL_utf8_upper
#define PL_Iutf8_xdigit PL_utf8_xdigit
#define PL_Iuudmap PL_uudmap
+#define PL_Iwantutf8 PL_wantutf8
#define PL_Iwarnhook PL_warnhook
#define PL_Iwidesyscalls PL_widesyscalls
#define PL_Ixiv_arenaroot PL_xiv_arenaroot
PERLVAR(IOpSlab,I32 *)
#endif
+PERLVAR(Iwantutf8, bool) /* want utf8 as the default discipline */
+
/* New variables must be added to the very end for binary compatibility.
* XSUB.h provides wrapper functions via perlapi.h that make this
* irrelevant, but not all code may be expected to #include XSUB.h. */
=back
+If your locale environment variables (LANGUAGE, LC_ALL, LC_CTYPE, LANG)
+contain the strings 'UTF-8' or 'UTF8' (case-insensitive matching),
+the default encoding of your STDIN, STDOUT, and STDERR, and of
+B<any subsequent file open>, is UTF-8.
+
Directory handles may also support disciplines in future.
=head1 NONPERLIO FUNCTIONALITY
# include <locale.h>
#endif
+#ifdef I_LANGINFO
+# include <langinfo.h>
+#endif
+
/*
* Standardize the locale name from a string returned by 'setlocale'.
*
#ifdef USE_LOCALE_NUMERIC
new_numeric(curnum);
#endif /* USE_LOCALE_NUMERIC */
+
}
#endif /* USE_LOCALE */
+ {
+ bool wantutf8 = FALSE;
+ char *codeset = NULL;
+#if defined(HAS_NL_LANGINFO) && defined(CODESET)
+ codeset = nl_langinfo(CODESET);
+#endif
+ if (codeset &&
+ (ibcmp(codeset, "UTF-8", 5) == 0 ||
+ ibcmp(codeset, "UTF8", 4) == 0))
+ wantutf8 = TRUE;
+#ifdef __GLIBC__
+ if (!wantutf8 && language &&
+ (ibcmp(language, "UTF-8", 5) == 0 ||
+ ibcmp(language, "UTF8", 4) == 0))
+ wantutf8 = TRUE;
+#endif
+ if (!wantutf8 && lc_all &&
+ (ibcmp(lc_all, "UTF-8", 5) == 0 ||
+ ibcmp(lc_all, "UTF8", 4) == 0))
+ wantutf8 = TRUE;
+#ifdef USE_LOCALE_CTYPE
+ if (!wantutf8 && curctype &&
+ (ibcmp(curctype, "UTF-8", 5) == 0 ||
+ ibcmp(curctype, "UTF8", 4) == 0))
+ wantutf8 = TRUE;
+#endif
+ if (!wantutf8 && lang &&
+ (ibcmp(lang, "UTF-8", 5) == 0 ||
+ ibcmp(lang, "UTF8", 4) == 0))
+ wantutf8 = TRUE;
+ if (wantutf8)
+ PL_wantutf8 = TRUE;
+ }
+
#ifdef USE_LOCALE_CTYPE
if (curctype != NULL)
Safefree(curctype);
if (!PL_do_undump)
init_postdump_symbols(argc,argv,env);
+ if (PL_wantutf8) { /* Requires init_predump_symbols(). */
+ IO* io;
+ PerlIO* fp;
+ SV* sv;
+ if (PL_stdingv && (io = GvIO(PL_stdingv)) && (fp = IoIFP(io)))
+ PerlIO_binmode(aTHX_ fp, IoTYPE(io), 0, ":utf8");
+ if (PL_defoutgv && (io = GvIO(PL_defoutgv)) && (fp = IoOFP(io)))
+ PerlIO_binmode(aTHX_ fp, IoTYPE(io), 0, ":utf8");
+ if (PL_stderrgv && (io = GvIO(PL_stderrgv)) && (fp = IoOFP(io)))
+ PerlIO_binmode(aTHX_ fp, IoTYPE(io), 0, ":utf8");
+ if ((sv = GvSV(gv_fetchpv("\017PEN", TRUE, SVt_PV)))) {
+ sv_setpvn(sv, ":utf8\0:utf8", 11);
+ SvSETMAGIC(sv);
+ }
+ }
+
init_lexer();
/* now parse the script */
#define PL_utf8_xdigit (*Perl_Iutf8_xdigit_ptr(aTHX))
#undef PL_uudmap
#define PL_uudmap (*Perl_Iuudmap_ptr(aTHX))
+#undef PL_wantutf8
+#define PL_wantutf8 (*Perl_Iwantutf8_ptr(aTHX))
#undef PL_warnhook
#define PL_warnhook (*Perl_Iwarnhook_ptr(aTHX))
#undef PL_widesyscalls
creates a pipe, and runs the equivalent of exec('cat', '/etc/motd') in
the child process.
+=item *
+
+If your locale environment variables (LANGUAGE, LC_ALL, LC_CTYPE, LANG)
+contain the strings 'UTF-8' or 'UTF8' (case-insensitive matching),
+the default encoding of your STDIN, STDOUT, and STDERR, and of
+B<any subsequent file open>, is UTF-8.
+
=back
=head2 Safe Signals
Unicode in general should be now much more usable than in Perl 5.6.0
(or even in 5.6.1). Unicode can be used in hash keys, Unicode in
regular expressions should work now, Unicode in tr/// should work now,
-Unicode in I/O should work now.
+Unicode in I/O should work now. See L<perluniintro> for introduction
+and L<perlunicode> for details.
=over 4
into bankers, bikers, gamers, and so on. But, for now, it's the only
standard we've got. This may be construed as a bug.
+=head1 Unicode and UTF-8
+
+The support of Unicode is new starting from Perl version 5.6, and
+more fully implemented in the version 5.8. See L<perluniintro> and
+L<perlunicode> for more details.
+
+Usually locale settings and Unicode do not affect each other, but
+there are exceptions, see L<perlunicode/Locales> for examples.
+
=head1 BUGS
=head2 Broken systems
=head1 SEE ALSO
-L<I18N::Langinfo>, L<POSIX/isalnum>, L<POSIX/isalpha>,
+L<I18N::Langinfo>, L<perluniintro>, L<perlunicode>, L<open>,
+L<POSIX/isalnum>, L<POSIX/isalpha>,
L<POSIX/isdigit>, L<POSIX/isgraph>, L<POSIX/islower>,
L<POSIX/isprint>, L<POSIX/ispunct>, L<POSIX/isspace>,
L<POSIX/isupper>, L<POSIX/isxdigit>, L<POSIX/localeconv>,
the platform's "natural" 8-bit encoding of Unicode. See L<perlebcdic>
for more discussion of the issues.
+=head2 Locales
+
+Usually locale settins and Unicode do not affect each other, but
+there are a couple of exceptions:
+
+=over 4
+
+=item *
+
+If your locale environment variables (LANGUAGE, LC_ALL, LC_CTYPE, LANG)
+contain the strings 'UTF-8' or 'UTF8' (case-insensitive matching),
+the default encoding of your STDIN, STDOUT, and STDERR, and of
+B<any subsequent file open>, is UTF-8.
+
+=item *
+
+Perl tries really hard to work both with Unicode and the old byte
+oriented world: most often this is nice, but sometimes this causes
+problems. See L</BUGS> for example how sometimes using locales
+with Unicode can be a good thing.
+
+=back
+
=head2 Using Unicode in XS
If you want to handle Perl Unicode in XS extensions, you may find
there is some attempt to apply 8-bit locale info to characters in the
range 0..255, but this is demonstrably incorrect for locales that use
characters above that range when mapped into Unicode. It will also
-tend to run slower. Avoidance of locales is strongly encouraged.
+tend to run slower. Avoidance of locales is strongly encouraged,
+with one known expection, see the next paragraph.
+
+If the keys of a hash are "mixed", that is, some keys are Unicode,
+while some keys are "byte", the keys may behave differently in regular
+expressions since the definition of character classes like C</\w/>
+is different for byte strings and character strings. This problem can
+sometimes be helped by using an appropriate locale (see L<perllocale>).
+Another way is to force all the strings to be character encoded by
+using utf8::upgrade() (see L<utf8>).
Some functions are slower when working on UTF-8 encoded strings than
on byte encoded strings. All functions that need to hop over
to this sample program ensures the output is completely UTF-8, and
of course, removes the warning.
+If your locale environment variables (LANGUAGE, LC_ALL, LC_CTYPE, LANG)
+contain the strings 'UTF-8' or 'UTF8' (case-insensitive matching),
+the default encoding of your STDIN, STDOUT, and STDERR, and of
+B<any subsequent file open>, is UTF-8.
+
=head2 Unicode and EBCDIC
Perl 5.8.0 also supports Unicode on EBCDIC platforms. There,