From: Jarkko Hietaniemi Date: Mon, 17 Dec 2001 19:26:34 +0000 (+0000) Subject: perluniintro tweaks as suggested by Jeffrey Friedl. X-Git-Url: http://git.shadowcat.co.uk/gitweb/gitweb.cgi?a=commitdiff_plain;h=4192de816d0ff2d984cdb4ba868ae8de3b925f3d;p=p5sagit%2Fp5-mst-13.2.git perluniintro tweaks as suggested by Jeffrey Friedl. p4raw-id: //depot/perl@13743 --- diff --git a/pod/perluniintro.pod b/pod/perluniintro.pod index 67ce214..775609c 100644 --- a/pod/perluniintro.pod +++ b/pod/perluniintro.pod @@ -139,9 +139,33 @@ that Perl tries to keep its data as eight-bit bytes for as long as possible, but as soon as Unicodeness cannot be avoided, the data is transparently upgraded to Unicode. -The internal encoding of Unicode in Perl is UTF-8. The internal -encoding is normally hidden, however, and one need not and should not -worry about the internal encoding at all: it is all just characters. +Internally, Perl currently uses either whatever the native eight-bit +character set of the platform (for example Latin-1) or UTF-8 to encode +Unicode strings. Specifically, if all code points in the string are +0xFF or less, Perl uses Latin-1. Otherwise, it uses UTF-8. + +A user of Perl does not normally need to know nor care how Perl happens +to encodes its internal strings, but it becomes relevant when outputting +Unicode strings to a stream without a discipline (one with the "default +default"). In such a case, the raw bytes used internally (the native +character set or UTF-8, as appropriate for each string) will be used, +and if warnings are turned on, a "Wide character" warning will be issued +if those strings contain a character beyond 0x00FF. + +For example, + + perl -w -e 'print "\x{DF}\n", "\x{0100}\x{DF}\n"' + +produces a fairly useless mixture of native bytes and UTF-8, as well +as a warning. + +To output UTF-8 always, use the ":utf8" output discipline. Prepending + + binmode(STDOUT, ":utf8"); + +to this sample program ensures the output is completely UTF-8, and +of course, removes the warning. Another way to achieve this is the +L pragma, discussed later in L. Perl 5.8.0 will also support Unicode on EBCDIC platforms. There the support is somewhat harder to implement since additional conversions @@ -149,12 +173,14 @@ are needed at every step. Because of these difficulties the Unicode support won't be quite as full as in other, mainly ASCII-based, platforms (the Unicode support will be better than in the 5.6 series, which didn't work much at all for EBCDIC platform). On EBCDIC -platforms the internal encoding form used is UTF-EBCDIC. +platforms the internal encoding form used is UTF-EBCDIC instead +of UTF-8 (the difference is that as UTF-8 is "ASCII-safe" in that +ASCII characters encode to UTF-8 as-is, UTF-EBCDIC is "EBCDIC-safe"). =head2 Creating Unicode -To create Unicode literals, use the C<\x{...}> notation in -doublequoted strings: +To create Unicode literals for code points above 0xFF, use the +C<\x{...}> notation in doublequoted strings: my $smiley = "\x{263a}";