lib/utf8.pm

   1 package utf8;
   2
   3 $^U = 1;
   4
   5 sub import {
   6     $^H |= 0x00000008;
   7     $enc{caller()} = $_[1] if $_[1];
   8 }
   9
  10 sub unimport {
  11     $^H &= ~0x00000008;
  12 }
  13
  14 sub AUTOLOAD {
  15     require "utf8_heavy.pl";
  16     goto &$AUTOLOAD;
  17 }
  18
  19 1;
  20 __END__
  21
  22 =head1 NAME
  23
  24 utf8 - Perl pragma to turn on UTF-8 and Unicode support
  25
  26 =head1 SYNOPSIS
  27
  28     use utf8;
  29     no utf8;
  30
  31 =head1 DESCRIPTION
  32
  33 The utf8 pragma tells Perl to use UTF-8 as its internal string
  34 representation for the rest of the enclosing block.  (The "no utf8"
  35 pragma tells Perl to switch back to ordinary byte-oriented processing
  36 for the rest of the enclosing block.)  Under utf8, many operations that
  37 formerly operated on bytes change to operating on characters.  For
  38 ASCII data this makes no difference, because UTF-8 stores ASCII in
  39 single bytes, but for any character greater than C<chr(127)>, the
  40 character is stored in a sequence of two or more bytes, all of which
  41 have the high bit set.  But by and large, the user need not worry about
  42 this, because the utf8 pragma hides it from the user.  A character
  43 under utf8 is logically just a number ranging from 0 to 2**32 or so.
  44 Larger characters encode to longer sequences of bytes, but again, this
  45 is hidden.
  46
  47 Use of the utf8 pragma has the following effects:
  48
  49 =over 4
  50
  51 =item *
  52
  53 Strings and patterns may contain characters that have an ordinal value
  54 larger than 255.  Presuming you use a Unicode editor to edit your
  55 program, these will typically occur directly within the literal strings
  56 as UTF-8 characters, but you can also specify a particular character
  57 with an extension of the C<\x> notation.  UTF-8 characters are
  58 specified by putting the hexadecimal code within curlies after the
  59 C<\x>.  For instance, a Unicode smiley face is C<\x{263A}>.  A
  60 character in the Latin-1 range (128..255) should be written C<\x{ab}>
  61 rather than C<\xab>, since the former will turn into a two-byte UTF-8
  62 code, while the latter will continue to be interpreted as generating a
  63 8-bit byte rather than a character.  In fact, if C<-w> is turned on, it will
  64 produce a warning that you might be generating invalid UTF-8.
  65
  66 =item *
  67
  68 Identifiers within the Perl script may contain Unicode alphanumeric
  69 characters, including ideographs.  (You are currently on your own when
  70 it comes to using the canonical forms of characters--Perl doesn't (yet)
  71 attempt to canonicalize variable names for you.)
  72
  73 =item *
  74
  75 Regular expressions match characters instead of bytes.  For instance,
  76 "." matches a character instead of a byte.  (However, the C<\C> pattern
  77 is provided to force a match a single byte ("C<char>" in C, hence
  78 C<\C>).)
  79
  80 =item *
  81
  82 Character classes in regular expressions match characters instead of
  83 bytes, and match against the character properties specified in the
  84 Unicode properties database.  So C<\w> can be used to match an ideograph,
  85 for instance.
  86
  87 =item *
  88
  89 Named Unicode properties and block ranges make be used as character
  90 classes via the new C<\p{}> (matches property) and C<\P{}> (doesn't
  91 match property) constructs.  For instance, C<\p{Lu}> matches any
  92 character with the Unicode uppercase property, while C<\p{M}> matches
  93 any mark character.  Single letter properties may omit the brackets, so
  94 that can be written C<\pM> also.  Many predefined character classes are
  95 available, such as C<\p{IsMirrored}> and  C<\p{InTibetan}>.
  96
  97 =item *
  98
  99 The special pattern C<\X> match matches any extended Unicode sequence
 100 (a "combining character sequence" in Standardese), where the first
 101 character is a base character and subsequent characters are mark
 102 characters that apply to the base character.  It is equivalent to
 103 C<(?:\PM\pM*)>.
 104
 105 =item *
 106
 107 The C<tr///> operator translates characters instead of bytes.  It can also
 108 be forced to translate between 8-bit codes and UTF-8 regardless of the
 109 surrounding utf8 state.  For instance, if you know your input in Latin-1,
 110 you can say:
 111
 112     use utf8;
 113     while (<>) {
 114         tr/\0-\xff//CU;         # latin1 char to utf8
 115         ...
 116     }
 117
 118 Similarly you could translate your output with
 119
 120     tr/\0-\x{ff}//UC;           # utf8 to latin1 char
 121
 122 No, C<s///> doesn't take /U or /C (yet?).
 123
 124 =item *
 125
 126 Case translation operators use the Unicode case translation tables.
 127 Note that C<uc()> translates to uppercase, while C<ucfirst> translates
 128 to titlecase (for languages that make the distinction).  Naturally
 129 the corresponding backslash sequences have the same semantics.
 130
 131 =item *
 132
 133 Most operators that deal with positions or lengths in the string will
 134 automatically switch to using character positions, including C<chop()>,
 135 C<substr()>, C<pos()>, C<index()>, C<rindex()>, C<sprintf()>,
 136 C<write()>, and C<length()>.  Operators that specifically don't switch
 137 include C<vec()>, C<pack()>, and C<unpack()>.  Operators that really
 138 don't care include C<chomp()>, as well as any other operator that
 139 treats a string as a bucket of bits, such as C<sort()>, and the
 140 operators dealing with filenames.
 141
 142 =item *
 143
 144 The C<pack()>/C<unpack()> letters "C<c>" and "C<C>" do I<not> change,
 145 since they're often used for byte-oriented formats.  (Again, think
 146 "C<char>" in the C language.)  However, there is a new "C<U>" specifier
 147 that will convert between UTF-8 characters and integers.  (It works
 148 outside of the utf8 pragma too.)
 149
 150 =item *
 151
 152 The C<chr()> and C<ord()> functions work on characters.  This is like
 153 C<pack("U")> and C<unpack("U")>, not like C<pack("C")> and
 154 C<unpack("C")>.  In fact, the latter are how you now emulate
 155 byte-oriented C<chr()> and C<ord()> under utf8.
 156
 157 =item *
 158
 159 And finally, C<scalar reverse()> reverses by character rather than by byte.
 160
 161 =back
 162
 163 =head1 CAVEATS
 164
 165 As of yet, there is no method for automatically coercing input and
 166 output to some encoding other than UTF-8.  This is planned in the near
 167 future, however.
 168
 169 In any event, you'll need to keep track of whether interfaces to other
 170 modules expect UTF-8 data or something else.  The utf8 pragma does not
 171 magically mark strings for you in order to remember their encoding, nor
 172 will any automatic coercion happen (other than that eventually planned
 173 for I/O).  If you want such automatic coercion, you can build yourself
 174 a set of pretty object-oriented modules.  Expect it to run considerably
 175 slower than than this low-level support.
 176
 177 Use of locales with utf8 may lead to odd results.  Currently there is
 178 some attempt to apply 8-bit locale info to characters in the range
 179 0..255, but this is demonstrably incorrect for locales that use
 180 characters above that range (when mapped into Unicode).  It will also
 181 tend to run slower.  Avoidance of locales is strongly encouraged.
 182
 183 =cut