Integrate change #14062 from macperl;

[p5sagit/p5-mst-13.2.git] / pod / perluniintro.pod
diff --git a/pod/perluniintro.pod b/pod/perluniintro.pod

index 55f8d56..68f8a01 100644 (file)
--- a/pod/perluniintro.pod
+++ b/pod/perluniintro.pod
@@ -51,18 +51,19 @@ more I<modifiers> (like C<COMBINING ACUTE ACCENT>).  This sequence of
 a base character and modifiers is called a I<combining character
 sequence>.
 
-Whether to call these combining character sequences, as a whole,            
-"characters" depends on your point of view. If you are a programmer, you     
-probably would tend towards seeing each element in the sequences as one
-unit, one "character", but from the user viewpoint, the sequence as a
-whole is probably considered one "character", since that's probably what   
-it looks like in the context of the user's language.
+Whether to call these combining character sequences, as a whole,
+"characters" depends on your point of view. If you are a programmer,
+you probably would tend towards seeing each element in the sequences
+as one unit, one "character", but from the user viewpoint, the
+sequence as a whole is probably considered one "character", since
+that's probably what it looks like in the context of the user's
+language.
 
 With this "as a whole" view of characters, the number of characters is
-open-ended. But in the programmer's "one unit is one character" point of
-view, the concept of "characters" is more deterministic, and so we take
-that point of view in this document: one "character" is one Unicode
-code point, be it a base character or a combining character.
+open-ended. But in the programmer's "one unit is one character" point
+of view, the concept of "characters" is more deterministic, and so we
+take that point of view in this document: one "character" is one
+Unicode code point, be it a base character or a combining character.
 
 For some of the combinations there are I<precomposed> characters,
 for example C<LATIN CAPITAL LETTER A WITH ACUTE> is defined as
@@ -105,7 +106,7 @@ output these abstract numbers, the numbers must be I<encoded> somehow.
 Unicode defines several I<character encoding forms>, of which I<UTF-8>
 is perhaps the most popular.  UTF-8 is a variable length encoding that
 encodes Unicode characters as 1 to 6 bytes (only 4 with the currently
-defined characters).  Other encodings are UTF-16 and UTF-32 and their
+defined characters).  Other encodings include UTF-16 and UTF-32 and their
 big and little endian variants (UTF-8 is byteorder independent).
 The ISO/IEC 10646 defines the UCS-2 and UCS-4 encoding forms.
 
@@ -126,10 +127,11 @@ that operations in the current block or file would be Unicode-aware.
 This model was found to be wrong, or at least clumsy: the Unicodeness
 is now carried with the data, not attached to the operations.  (There
 is one remaining case where an explicit C<use utf8> is needed: if your
-Perl script is in UTF-8, you can use UTF-8 in your variable and
-subroutine names, and in your string and regular expression literals,
-by saying C<use utf8>.  This is not the default because that would
-break existing scripts having legacy 8-bit data in them.)
+Perl script itself is encoded in UTF-8, you can use UTF-8 in your
+variable and subroutine names, and in your string and regular
+expression literals, by saying C<use utf8>.  This is not the default
+because that would break existing scripts having legacy 8-bit data in
+them.)
 
 =head2 Perl's Unicode Model
 
@@ -139,26 +141,52 @@ that Perl tries to keep its data as eight-bit bytes for as long as
 possible, but as soon as Unicodeness cannot be avoided, the data is
 transparently upgraded to Unicode.
 
-The internal encoding of Unicode in Perl is UTF-8.  The internal
-encoding is normally hidden, however, and one need not and should not
-worry about the internal encoding at all: it is all just characters.
+Internally, Perl currently uses either whatever the native eight-bit
+character set of the platform (for example Latin-1) or UTF-8 to encode
+Unicode strings. Specifically, if all code points in the string are
+0xFF or less, Perl uses the native eight-bit character set.
+Otherwise, it uses UTF-8.
 
-Perl 5.8.0 will also support Unicode on EBCDIC platforms.  There the
+A user of Perl does not normally need to know nor care how Perl happens
+to encodes its internal strings, but it becomes relevant when outputting
+Unicode strings to a stream without a discipline (one with the "default
+default").  In such a case, the raw bytes used internally (the native
+character set or UTF-8, as appropriate for each string) will be used,
+and if warnings are turned on, a "Wide character" warning will be issued
+if those strings contain a character beyond 0x00FF.
+
+For example,
+
+      perl -w -e 'print "\x{DF}\n", "\x{0100}\x{DF}\n"'              
+
+produces a fairly useless mixture of native bytes and UTF-8, as well
+as a warning.
+
+To output UTF-8 always, use the ":utf8" output discipline.  Prepending
+
+      binmode(STDOUT, ":utf8");
+
+to this sample program ensures the output is completely UTF-8, and      
+of course, removes the warning.
+
+Perl 5.8.0 also supports Unicode on EBCDIC platforms.  There, the
 support is somewhat harder to implement since additional conversions
-are needed at every step.  Because of these difficulties the Unicode
-support won't be quite as full as in other, mainly ASCII-based,
-platforms (the Unicode support will be better than in the 5.6 series,
-which didn't work much at all for EBCDIC platform).  On EBCDIC
-platforms the internal encoding form used is UTF-EBCDIC.
+are needed at every step.  Because of these difficulties, the Unicode
+support isn't quite as full as in other, mainly ASCII-based, platforms
+(the Unicode support is better than in the 5.6 series, which didn't
+work much at all for EBCDIC platform).  On EBCDIC platforms, the
+internal Unicode encoding form is UTF-EBCDIC instead of UTF-8 (the
+difference is that as UTF-8 is "ASCII-safe" in that ASCII characters
+encode to UTF-8 as-is, UTF-EBCDIC is "EBCDIC-safe").
 
 =head2 Creating Unicode
 
-To create Unicode literals, use the C<\x{...}> notation in
-doublequoted strings:
+To create Unicode characters in literals for code points above 0xFF,
+use the C<\x{...}> notation in doublequoted strings:
 
     my $smiley = "\x{263a}";
 
-Similarly for regular expression literals
+Similarly in regular expression literals
 
     $smiley =~ /\x{263a}/;
 
@@ -170,11 +198,13 @@ At run-time you can use C<chr()>:
 
 Naturally, C<ord()> will do the reverse: turn a character to a code point.
 
-Note that C<\x..>, C<\x{..}> and C<chr(...)> for arguments less than
-0x100 (decimal 256) will generate an eight-bit character for backward
-compatibility with older Perls.  For arguments of 0x100 or more,
-Unicode will always be produced.  If you want UTF-8 always, use
-C<pack("U", ...)> instead of C<\x..>, C<\x{..}>, or C<chr()>.
+Note that C<\x..> (no C<{}> and only two hexadecimal digits),
+C<\x{...}>, and C<chr(...)> for arguments less than 0x100 (decimal
+256) generate an eight-bit character for backward compatibility with
+older Perls.  For arguments of 0x100 or more, Unicode characters are
+always produced. If you want to force the production of Unicode
+characters regardless of the numeric value, use C<pack("U", ...)>
+instead of C<\x..>, C<\x{...}>, or C<chr()>.
 
 You can also use the C<charnames> pragma to invoke characters
 by name in doublequoted strings:
@@ -187,6 +217,10 @@ characters:
 
    my $georgian_an  = pack("U", 0x10a0);
 
+Note that both C<\x{...}> and C<\N{...}> are compile-time string
+constants: you cannot use variables in them.  if you want similar
+run-time functionality, use C<chr()> and C<charnames::vianame()>.
+
 =head2 Handling Unicode
 
 Handling Unicode is for the most part transparent: just use the
@@ -234,27 +268,40 @@ for doing conversions between those encodings:
 
 =head2 Unicode I/O
 
-Normally writing out Unicode data
+Normally, writing out Unicode data
 
-    print FH chr(0x100), "\n";
+    print FH $some_string_with_unicode, "\n";
 
-will print out the raw UTF-8 bytes, but you will get a warning
-out of that if you use C<-w> or C<use warnings>.  To avoid the
-warning open the stream explicitly in UTF-8:
+produces raw bytes that Perl happens to use to internally encode the
+Unicode string (which depends on the system, as well as what
+characters happen to be in the string at the time). If any of the
+characters are at code points 0x100 or above, you will get a warning
+if you use C<-w> or C<use warnings>. To ensure that the output is
+explicitly rendered in the encoding you desire (and to avoid the
+warning), open the stream with the desired encoding. Some examples:
 
-    open FH, ">:utf8", "file";
+    open FH, ">:ucs2",      "file"
+    open FH, ">:utf8",      "file";
+    open FH, ">:Shift-JIS", "file";
 
 and on already open streams use C<binmode()>:
 
+    binmode(STDOUT, ":ucs2");
     binmode(STDOUT, ":utf8");
+    binmode(STDOUT, ":Shift-JIS");
 
-Reading in correctly formed UTF-8 data will not magically turn
-the data into Unicode in Perl's eyes.
+See documentation for the C<Encode> module for many supported encodings.
 
-You can use either the C<':utf8'> I/O discipline when opening files
+Reading in a file that you know happens to be encoded in one of the
+Unicode encodings does not magically turn the data into Unicode in
+Perl's eyes.  To do that, specify the appropriate discipline when
+opening files
 
     open(my $fh,'<:utf8', 'anything');
-    my $line_of_utf8 = <$fh>;
+    my $line_of_unicode = <$fh>;
+
+    open(my $fh,'<:Big5', 'anything');
+    my $line_of_unicode = <$fh>;
 
 The I/O disciplines can also be specified more flexibly with
 the C<open> pragma; see L<open>:
@@ -269,8 +316,8 @@ the C<open> pragma; see L<open>:
 
 With the C<open> pragma you can use the C<:locale> discipline
 
-    $ENV{LANG} = 'ru_RU.KOI8-R';
-    # the :locale will probe the locale environment variables like LANG
+    $ENV{LC_ALL} = $ENV{LANG} = 'ru_RU.KOI8-R';
+    # the :locale will probe the locale environment variables like LC_ALL
     use open OUT => ':locale'; # russki parusski
     open(O, ">koi8");
     print O chr(0x430); # Unicode CYRILLIC SMALL LETTER A = KOI8-R 0xc1
@@ -282,64 +329,95 @@ With the C<open> pragma you can use the C<:locale> discipline
 or you can also use the C<':encoding(...)'> discipline
 
     open(my $epic,'<:encoding(iso-8859-7)','iliad.greek');
-    my $line_of_iliad = <$epic>;
+    my $line_of_unicode = <$epic>;
 
-Both of these methods install a transparent filter on the I/O stream that
-will convert data from the specified encoding when it is read in from the
-stream.  In the first example the F<anything> file is assumed to be UTF-8
-encoded Unicode, in the second example the F<iliad.greek> file is assumed
-to be ISO-8858-7 encoded Greek, but the lines read in will be in both
-cases Unicode.
+These methods install a transparent filter on the I/O stream that
+converts data from the specified encoding when it is read in from the
+stream.  The result is always Unicode.
 
 The L<open> pragma affects all the C<open()> calls after the pragma by
 setting default disciplines.  If you want to affect only certain
 streams, use explicit disciplines directly in the C<open()> call.
 
 You can switch encodings on an already opened stream by using
-C<binmode()>, see L<perlfunc/binmode>.
+C<binmode()>; see L<perlfunc/binmode>.
 
-The C<:locale> does not currently work with C<open()> and
-C<binmode()>, only with the C<open> pragma.  The C<:utf8> and
-C<:encoding(...)> do work with all of C<open()>, C<binmode()>,
-and the C<open> pragma.
+The C<:locale> does not currently (as of Perl 5.8.0) work with
+C<open()> and C<binmode()>, only with the C<open> pragma.  The
+C<:utf8> and C<:encoding(...)> methods do work with all of C<open()>,
+C<binmode()>, and the C<open> pragma.
 
-Similarly, you may use these I/O disciplines on input streams to
-automatically convert data from the specified encoding when it is
-written to the stream.
+Similarly, you may use these I/O disciplines on output streams to
+automatically convert Unicode to the specified encoding when it is
+written to the stream. For example, the following snippet copies the
+contents of the file "text.jis" (encoded as ISO-2022-JP, aka JIS) to
+the file "text.utf8", encoded as UTF-8:
 
-    open(my $unicode, '<:utf8',                 'japanese.uni');
-    open(my $nihongo, '>:encoding(iso2022-jp)', 'japanese.jp');
-    while (<$unicode>) { print $nihongo }
+    open(my $nihongo, '<:encoding(iso2022-jp)', 'text.jis');
+    open(my $unicode, '>:utf8',                 'text.utf8');
+    while (<$nihongo>) { print $unicode }
 
 The naming of encodings, both by the C<open()> and by the C<open>
 pragma, is similarly understanding as with the C<encoding> pragma:
 C<koi8-r> and C<KOI8R> will both be understood.
 
 Common encodings recognized by ISO, MIME, IANA, and various other
-standardisation organisations are recognised, for a more detailed
+standardisation organisations are recognised; for a more detailed
 list see L<Encode>.
 
 C<read()> reads characters and returns the number of characters.
 C<seek()> and C<tell()> operate on byte counts, as do C<sysread()>
 and C<sysseek()>.
 
-Notice that because of the default behaviour "input is not UTF-8"
+Notice that because of the default behaviour of not doing any
+conversion upon input if there is no default discipline,
 it is easy to mistakenly write code that keeps on expanding a file
-by repeatedly encoding it in UTF-8:
+by repeatedly encoding:
 
     # BAD CODE WARNING
     open F, "file";
-    local $/; # read in the whole file
+    local $/; ## read in the whole file of 8-bit characters
     $t = <F>;
     close F;
     open F, ">:utf8", "file";
-    print F $t;
+    print F $t; ## convert to UTF-8 on output
     close F;
 
 If you run this code twice, the contents of the F<file> will be twice
 UTF-8 encoded.  A C<use open ':utf8'> would have avoided the bug, or
 explicitly opening also the F<file> for input as UTF-8.
 
+B<NOTE>: the C<:utf8> and C<:encoding> features work only if your
+Perl has been built with the new "perlio" feature.  Almost all 
+Perl 5.8 platforms do use "perlio", though: you can see whether
+yours is by running "perl -V" and looking for C<useperlio=define>.
+
+=head2 Displaying Unicode As Text
+
+Sometimes you might want to display Perl scalars containing Unicode as
+simple ASCII (or EBCDIC) text.  The following subroutine converts
+its argument so that Unicode characters with code points greater than
+255 are displayed as "\x{...}", control characters (like "\n") are
+displayed as "\x..", and the rest of the characters as themselves:
+
+   sub nice_string {
+       join("",
+         map { $_ > 255 ?                  # if wide character...
+               sprintf("\\x{%04X}", $_) :  # \x{...}
+               chr($_) =~ /[[:cntrl:]]/ ?  # else if control character ...
+               sprintf("\\x%02X", $_) :    # \x..
+               chr($_)                     # else as themselves
+         } unpack("U*", $_[0]));           # unpack Unicode characters
+   }
+
+For example,
+
+   nice_string("foo\x{100}bar\n")
+
+returns:
+
+   "foo\x{0100}bar\x0A"
+
 =head2 Special Cases
 
 =over 4
@@ -348,29 +426,36 @@ explicitly opening also the F<file> for input as UTF-8.
 
 Bit Complement Operator ~ And vec()
 
-The bit complement operator C<~> will produce surprising results if
-used on strings containing Unicode characters.  The results are
-consistent with the internal UTF-8 encoding of the characters, but not
-with much else.  So don't do that.  Similarly for vec(): you will be
-operating on the UTF-8 bit patterns of the Unicode characters, not on
-the bytes, which is very probably not what you want.
+The bit complement operator C<~> may produce surprising results if used on
+strings containing characters with ordinal values above 255. In such a
+case, the results are consistent with the internal encoding of the
+characters, but not with much else. So don't do that. Similarly for vec():
+you will be operating on the internally encoded bit patterns of the Unicode
+characters, not on the code point values, which is very probably not what
+you want.
 
 =item *
 
-Peeking At UTF-8
+Peeking At Perl's Internal Encoding
+
+Normal users of Perl should never care how Perl encodes any particular
+Unicode string (because the normal ways to get at the contents of a
+string with Unicode -- via input and output -- should always be via
+explicitly-defined I/O disciplines). But if you must, there are two
+ways of looking behind the scenes.
 
 One way of peeking inside the internal encoding of Unicode characters
 is to use C<unpack("C*", ...> to get the bytes, or C<unpack("H*", ...)>
 to display the bytes:
 
-    # this will print c4 80 for the UTF-8 bytes 0xc4 0x80
+    # this prints  c4 80  for the UTF-8 bytes 0xc4 0x80
     print join(" ", unpack("H*", pack("U", 0x100))), "\n";
 
 Yet another way would be to use the Devel::Peek module:
 
     perl -MDevel::Peek -e 'Dump(chr(0x100))'
 
-That will show the UTF8 flag in FLAGS and both the UTF-8 bytes
+That shows the UTF8 flag in FLAGS and both the UTF-8 bytes
 and Unicode characters in PV.  See also later in this document
 the discussion about the C<is_utf8> function of the C<Encode> module.
 
@@ -390,9 +475,9 @@ in Unicode: what do you mean by equal?
 (Is C<LATIN CAPITAL LETTER A WITH ACUTE> equal to
 C<LATIN CAPITAL LETTER A>?)
 
-The short answer is that by default Perl compares equivalence
-(C<eq>, C<ne>) based only on code points of the characters.
-In the above case, no (because 0x00C1 != 0x0041).  But sometimes any
+The short answer is that by default Perl compares equivalence (C<eq>,
+C<ne>) based only on code points of the characters.  In the above
+case, the answer is no (because 0x00C1 != 0x0041).  But sometimes any
 CAPITAL LETTER As being considered equal, or even any As of any case,
 would be desirable.
 
@@ -402,7 +487,7 @@ Reports #15 and #21, I<Unicode Normalization Forms> and I<Case
 Mappings>, http://www.unicode.org/unicode/reports/tr15/
 http://www.unicode.org/unicode/reports/tr21/
 
-As of Perl 5.8.0, the's regular expression case-ignoring matching
+As of Perl 5.8.0, regular expression case-ignoring matching
 implements only 1:1 semantics: one character matches one character.
 In I<Case Mappings> both 1:N and N:1 matches are defined.
 
@@ -416,9 +501,9 @@ parlance goes, collated.  But again, what do you mean by collate?
 (Does C<LATIN CAPITAL LETTER A WITH ACUTE> come before or after
 C<LATIN CAPITAL LETTER A WITH GRAVE>?)
 
-The short answer is that by default Perl compares strings (C<lt>,
+The short answer is that by default, Perl compares strings (C<lt>,
 C<le>, C<cmp>, C<ge>, C<gt>) based only on the code points of the
-characters.  In the above case, after, since 0x00C1 > 0x00C0.
+characters.  In the above case, the answer is "after", since 0x00C1 > 0x00C0.
 
 The long answer is that "it depends", and a good answer cannot be
 given without knowing (at the very least) the language context.
@@ -437,15 +522,15 @@ Character Ranges
 
 Character ranges in regular expression character classes (C</[a-z]/>)
 and in the C<tr///> (also known as C<y///>) operator are not magically
-Unicode-aware.  What this means that C<[a-z]> will not magically start
+Unicode-aware.  What this means that C<[A-Za-z]> will not magically start
 to mean "all alphabetic letters" (not that it does mean that even for
 8-bit characters, you should be using C</[[:alpha]]/> for that).
 
-For specifying things like that in regular expressions you can use the
-various Unicode properties, C<\pL> in this particular case.  You can
-use Unicode code points as the end points of character ranges, but
-that means that particular code point range, nothing more.  For
-further information, see L<perlunicode>.
+For specifying things like that in regular expressions, you can use
+the various Unicode properties, C<\pL> or perhaps C<\p{Alphabetic}>,
+in this particular case.  You can use Unicode code points as the end
+points of character ranges, but that means that particular code point
+range, nothing more.  For further information, see L<perlunicode>.
 
 =item *
 
@@ -454,7 +539,7 @@ String-To-Number Conversions
 Unicode does define several other decimal (and numeric) characters
 than just the familiar 0 to 9, such as the Arabic and Indic digits.
 Perl does not support string-to-number conversion for digits other
-than the 0 to 9 (and a to f for hexadecimal).
+than ASCII 0 to 9 (and ASCII a to f for hexadecimal).
 
 =back
 
@@ -506,62 +591,80 @@ and its only defined function C<length()>:
     use bytes;
     print length($unicode), "\n"; # will print 2 (the 0xC4 0x80 of the UTF-8)
 
-=item How Do I Detect Invalid UTF-8?
+=item How Do I Detect Data That's Not Valid In a Particular Encoding
 
-Either
+Use the C<Encode> package to try converting it.
+For example,
 
     use Encode 'encode_utf8';
-    if (encode_utf8($string)) {
+    if (encode_utf8($string_of_bytes_that_I_think_is_utf8)) {
         # valid
     } else {
         # invalid
     }
 
-or
+For UTF-8 only, you can use:
 
     use warnings;
-    @chars = unpack("U0U*", "\xFF"); # will warn
+    @chars = unpack("U0U*", $string_of_bytes_that_I_think_is_utf8);
+
+If invalid, a C<Malformed UTF-8 character (byte 0x##) in
+unpack> is produced. The "U0" means "expect strictly UTF-8
+encoded Unicode". Without that the C<unpack("U*", ...)>
+would accept also data like C<chr(0xFF>).
 
-The warning will be C<Malformed UTF-8 character (byte 0xff) in
-unpack>.  The "U0" means "expect strictly UTF-8 encoded Unicode".
-Without that the C<unpack("U*", ...)> would accept also data like
-C<chr(0xFF>).
+=item How Do I Convert Binary Data Into a Particular Encoding, Or Vice Versa?
 
-=item How Do I Convert Data Into UTF-8?  Or Vice Versa?
+This probably isn't as useful as you might think.
+Normally, you shouldn't need to.
 
-This probably isn't as useful (or simple) as you might think.
-Also, normally you shouldn't need to.
+In one sense, what you are asking doesn't make much sense: Encodings
+are for characters, and binary data is not "characters", so converting
+"data" into some encoding isn't meaningful unless you know in what
+character set and encoding the binary data is in, in which case it's
+not binary data, now is it?
 
-In one sense what you are asking doesn't make much sense: UTF-8 is
-(intended as an) Unicode encoding, so converting "data" into UTF-8
-isn't meaningful unless you know in what character set and encoding
-the binary data is in, and in this case you can use C<Encode>.
+If you have a raw sequence of bytes that you know should be interpreted via
+a particular encoding, you can use C<Encode>:
 
     use Encode 'from_to';
     from_to($data, "iso-8859-1", "utf-8"); # from latin-1 to utf-8
 
-If you have ASCII (really 7-bit US-ASCII), you already have valid
-UTF-8, the lowest 128 characters of UTF-8 encoded Unicode and US-ASCII
-are equivalent.
+The call to from_to() changes the bytes in $data, but nothing material
+about the nature of the string has changed as far as Perl is concerned.
+Both before and after the call, the string $data contains just a bunch of
+8-bit bytes. As far as Perl is concerned, the encoding of the string (as
+Perl sees it) remains as "system-native 8-bit bytes".
+
+You might relate this to a fictional 'Translate' module:
 
-If you have Latin-1 (or want Latin-1), you can just use pack/unpack:
+   use Translate;
+   my $phrase = "Yes";
+   Translate::from_to($phrase, 'english', 'deutsch');
+   ## phrase now contains "Ja"
 
-    $latin1 = pack("C*", unpack("U*", $utf8));
-    $utf8   = pack("U*", unpack("C*", $latin1));
+The contents of the string changes, but not the nature of the string.
+Perl doesn't know any more after the call than before that the contents
+of the string indicates the affirmative.
 
-(The same works for EBCDIC.)
+Back to converting data, if you have (or want) data in your system's
+native 8-bit encoding (e.g. Latin-1, EBCDIC, etc.), you can use
+pack/unpack to convert to/from Unicode.
+
+    $native_string  = pack("C*", unpack("U*", $Unicode_string));
+    $Unicode_string = pack("U*", unpack("C*", $native_string));
 
 If you have a sequence of bytes you B<know> is valid UTF-8,
 but Perl doesn't know it yet, you can make Perl a believer, too:
 
     use Encode 'decode_utf8';
-    $utf8 = decode_utf8($bytes);
+    $Unicode = decode_utf8($bytes);
 
 You can convert well-formed UTF-8 to a sequence of bytes, but if
 you just want to convert random binary data into UTF-8, you can't.
 Any random collection of bytes isn't well-formed UTF-8.  You can
 use C<unpack("C*", $string)> for the former, and you can create
-well-formed Unicode/UTF-8 data by C<pack("U*", 0xff, ...)>.
+well-formed Unicode data by C<pack("U*", 0xff, ...)>.
 
 =item How Do I Display Unicode?  How Do I Input Unicode?
 
@@ -588,8 +691,7 @@ a-f (or A-F, case doesn't matter).  Each hexadecimal digit represents
 four bits, or half a byte.  C<print 0x..., "\n"> will show a
 hexadecimal number in decimal, and C<printf "%x\n", $decimal> will
 show a decimal number in hexadecimal.  If you have just the
-"hexdigits" of a hexadecimal number, you can use the C<hex()>
-function.
+"hexdigits" of a hexadecimal number, you can use the C<hex()> function.
 
     print 0x0009, "\n";    # 9
     print 0x000a, "\n";    # 10
@@ -680,6 +782,23 @@ the C<Unicode::UCD> module.
 
 =back
 
+=head1 UNICODE IN OLDER PERLS
+
+If you cannot upgrade your Perl to 5.8.0 or later, you can still
+do some Unicode processing by using the modules C<Unicode::String>,
+C<Unicode::Map8>, and C<Unicode::Map>, available from CPAN.
+If you have the GNU recode installed, you can also use the
+Perl frontend C<Convert::Recode> for character conversions.
+
+The following are fast conversions from ISO 8859-1 (Latin-1) bytes
+to UTF-8 bytes, the code works even with older Perl 5 versions.
+
+    # ISO 8859-1 to UTF-8
+    s/([\x80-\xFF])/chr(0xC0|ord($1)>>6).chr(0x80|ord($1)&0x3F)/eg;
+
+    # UTF-8 to ISO 8859-1
+    s/([\xC2\xC3])([\x80-\xBF])/chr(ord($1)<<6&0xC0|ord($2)&0x3F)/eg;
+
 =head1 SEE ALSO
 
 L<perlunicode>, L<Encode>, L<encoding>, L<open>, L<utf8>, L<bytes>,