Make the UTF-8 decoding stricter and more verbose when
[p5sagit/p5-mst-13.2.git] / pod / perlunicode.pod
CommitLineData
393fec97 1=head1 NAME
2
3perlunicode - Unicode support in Perl
4
5=head1 DESCRIPTION
6
21bad921 7=head2 Important Caveat
8
393fec97 9WARNING: The implementation of Unicode support in Perl is incomplete.
21bad921 10
11The following areas need further work.
12
13=over
14
15=item Input and Output Disciplines
16
17There is currently no easy way to mark data read from a file or other
18external source as being utf8. This will be one of the major areas of
19focus in the near future.
20
21=item Regular Expressions
22
23The existing regular expression compiler does not produce polymorphic
24opcodes. This means that the determination on whether to match Unicode
25characters is made when the pattern is compiled, based on whether the
26pattern contains Unicode characters, and not when the matching happens
27at run time. This needs to be changed to adaptively match Unicode if
28the string to be matched is Unicode.
29
30=item C<use utf8> still needed to enable a few features
31
32The C<utf8> pragma implements the tables used for Unicode support. These
33tables are automatically loaded on demand, so the C<utf8> pragma need not
34normally be used.
35
36However, as a compatibility measure, this pragma must be explicitly used
37to enable recognition of UTF-8 encoded literals and identifiers in the
38source text.
39
40=back
41
42=head2 Byte and Character semantics
393fec97 43
44Beginning with version 5.6, Perl uses logically wide characters to
45represent strings internally. This internal representation of strings
46uses the UTF-8 encoding.
47
21bad921 48In future, Perl-level operations can be expected to work with characters
393fec97 49rather than bytes, in general.
50
8cbd9a7a 51However, as strictly an interim compatibility measure, Perl v5.6 aims to
52provide a safe migration path from byte semantics to character semantics
53for programs. For operations where Perl can unambiguously decide that the
54input data is characters, Perl now switches to character semantics.
55For operations where this determination cannot be made without additional
56information from the user, Perl decides in favor of compatibility, and
57chooses to use byte semantics.
58
59This behavior preserves compatibility with earlier versions of Perl,
60which allowed byte semantics in Perl operations, but only as long as
61none of the program's inputs are marked as being as source of Unicode
62character data. Such data may come from filehandles, from calls to
63external programs, from information provided by the system (such as %ENV),
21bad921 64or from literals and constants in the source text.
8cbd9a7a 65
46487f74 66If the C<-C> command line switch is used, (or the ${^WIDE_SYSTEM_CALLS}
67global flag is set to C<1>), all system calls will use the
3969a896 68corresponding wide character APIs. This is currently only implemented
46487f74 69on Windows.
8cbd9a7a 70
8058d7ab 71Regardless of the above, the C<bytes> pragma can always be used to force
72byte semantics in a particular lexical scope. See L<bytes>.
8cbd9a7a 73
ba210ebe 74One effect of the C<utf8> pragma is that the internal UTF-8 decoding
75becomes stricter so that the character 0xFFFF (UTF-8 bytes 0xEF 0xBF
760xBF), and the bytes 0xFE and 0xFF, start to cause warnings if they
77appear in the data.
78
8cbd9a7a 79The C<utf8> pragma is primarily a compatibility device that enables
21bad921 80recognition of UTF-8 in literals encountered by the parser. It may also
81be used for enabling some of the more experimental Unicode support features.
8cbd9a7a 82Note that this pragma is only required until a future version of Perl
83in which character semantics will become the default. This pragma may
84then become a no-op. See L<utf8>.
85
86Unless mentioned otherwise, Perl operators will use character semantics
87when they are dealing with Unicode data, and byte semantics otherwise.
88Thus, character semantics for these operations apply transparently; if
89the input data came from a Unicode source (for example, by adding a
90character encoding discipline to the filehandle whence it came, or a
91literal UTF-8 string constant in the program), character semantics
92apply; otherwise, byte semantics are in effect. To force byte semantics
8058d7ab 93on Unicode data, the C<bytes> pragma should be used.
393fec97 94
95Under character semantics, many operations that formerly operated on
96bytes change to operating on characters. For ASCII data this makes
97no difference, because UTF-8 stores ASCII in single bytes, but for
21bad921 98any character greater than C<chr(127)>, the character may be stored in
393fec97 99a sequence of two or more bytes, all of which have the high bit set.
100But by and large, the user need not worry about this, because Perl
101hides it from the user. A character in Perl is logically just a number
102ranging from 0 to 2**32 or so. Larger characters encode to longer
103sequences of bytes internally, but again, this is just an internal
104detail which is hidden at the Perl level.
105
8cbd9a7a 106=head2 Effects of character semantics
393fec97 107
108Character semantics have the following effects:
109
110=over 4
111
112=item *
113
114Strings and patterns may contain characters that have an ordinal value
21bad921 115larger than 255.
393fec97 116
117Presuming you use a Unicode editor to edit your program, such characters
118will typically occur directly within the literal strings as UTF-8
119characters, but you can also specify a particular character with an
120extension of the C<\x> notation. UTF-8 characters are specified by
121putting the hexadecimal code within curlies after the C<\x>. For instance,
4375e838 122a Unicode smiley face is C<\x{263A}>.
393fec97 123
124=item *
125
126Identifiers within the Perl script may contain Unicode alphanumeric
127characters, including ideographs. (You are currently on your own when
128it comes to using the canonical forms of characters--Perl doesn't (yet)
129attempt to canonicalize variable names for you.)
130
393fec97 131=item *
132
133Regular expressions match characters instead of bytes. For instance,
134"." matches a character instead of a byte. (However, the C<\C> pattern
135is provided to force a match a single byte ("C<char>" in C, hence
136C<\C>).)
137
393fec97 138=item *
139
140Character classes in regular expressions match characters instead of
141bytes, and match against the character properties specified in the
142Unicode properties database. So C<\w> can be used to match an ideograph,
143for instance.
144
393fec97 145=item *
146
147Named Unicode properties and block ranges make be used as character
148classes via the new C<\p{}> (matches property) and C<\P{}> (doesn't
149match property) constructs. For instance, C<\p{Lu}> matches any
150character with the Unicode uppercase property, while C<\p{M}> matches
151any mark character. Single letter properties may omit the brackets, so
152that can be written C<\pM> also. Many predefined character classes are
153available, such as C<\p{IsMirrored}> and C<\p{InTibetan}>.
154
393fec97 155=item *
156
157The special pattern C<\X> match matches any extended Unicode sequence
158(a "combining character sequence" in Standardese), where the first
159character is a base character and subsequent characters are mark
160characters that apply to the base character. It is equivalent to
161C<(?:\PM\pM*)>.
162
393fec97 163=item *
164
383e7cdd 165The C<tr///> operator translates characters instead of bytes. Note
166that the C<tr///CU> functionality has been removed, as the interface
167was a mistake. For similar functionality see pack('U0', ...) and
168pack('C0', ...).
393fec97 169
393fec97 170=item *
171
172Case translation operators use the Unicode case translation tables
173when provided character input. Note that C<uc()> translates to
174uppercase, while C<ucfirst> translates to titlecase (for languages
175that make the distinction). Naturally the corresponding backslash
176sequences have the same semantics.
177
178=item *
179
180Most operators that deal with positions or lengths in the string will
181automatically switch to using character positions, including C<chop()>,
182C<substr()>, C<pos()>, C<index()>, C<rindex()>, C<sprintf()>,
183C<write()>, and C<length()>. Operators that specifically don't switch
184include C<vec()>, C<pack()>, and C<unpack()>. Operators that really
185don't care include C<chomp()>, as well as any other operator that
186treats a string as a bucket of bits, such as C<sort()>, and the
187operators dealing with filenames.
188
189=item *
190
191The C<pack()>/C<unpack()> letters "C<c>" and "C<C>" do I<not> change,
192since they're often used for byte-oriented formats. (Again, think
193"C<char>" in the C language.) However, there is a new "C<U>" specifier
194that will convert between UTF-8 characters and integers. (It works
195outside of the utf8 pragma too.)
196
197=item *
198
199The C<chr()> and C<ord()> functions work on characters. This is like
200C<pack("U")> and C<unpack("U")>, not like C<pack("C")> and
201C<unpack("C")>. In fact, the latter are how you now emulate
202byte-oriented C<chr()> and C<ord()> under utf8.
203
204=item *
205
206And finally, C<scalar reverse()> reverses by character rather than by byte.
207
208=back
209
8cbd9a7a 210=head2 Character encodings for input and output
211
212[XXX: This feature is not yet implemented.]
213
393fec97 214=head1 CAVEATS
215
216As of yet, there is no method for automatically coercing input and
217output to some encoding other than UTF-8. This is planned in the near
218future, however.
219
8cbd9a7a 220Whether an arbitrary piece of data will be treated as "characters" or
221"bytes" by internal operations cannot be divined at the current time.
393fec97 222
223Use of locales with utf8 may lead to odd results. Currently there is
224some attempt to apply 8-bit locale info to characters in the range
2250..255, but this is demonstrably incorrect for locales that use
226characters above that range (when mapped into Unicode). It will also
227tend to run slower. Avoidance of locales is strongly encouraged.
228
229=head1 SEE ALSO
230
8058d7ab 231L<bytes>, L<utf8>, L<perlvar/"${^WIDE_SYSTEM_CALLS}">
393fec97 232
233=cut