[p5sagit/p5-mst-13.2.git] / pod / perlunicode.pod

=head1 NAME

perlunicode - Unicode support in Perl

=head1 DESCRIPTION

=head2 Important Caveat

WARNING: The implementation of Unicode support in Perl is incomplete.

The following areas need further work.

=over 4

=item Input and Output Disciplines

There is currently no easy way to mark data read from a file or other
external source as being utf8.  This will be one of the major areas of
focus in the near future.  Unfortunately it is unlikely that the Perl
5.6 and earlier will ever gain this capability.

=item Regular Expressions

The existing regular expression compiler does not produce polymorphic
opcodes.  This means that the determination on whether to match Unicode
characters is made when the pattern is compiled, based on whether the
pattern contains Unicode characters, and not when the matching happens
at run time.  This needs to be changed to adaptively match Unicode if
the string to be matched is Unicode.

=item C<use utf8> still needed to enable a few features

The C<utf8> pragma implements the tables used for Unicode support.  These
tables are automatically loaded on demand, so the C<utf8> pragma need not
normally be used.

However, as a compatibility measure, this pragma must be explicitly used
to enable recognition of UTF-8 encoded literals and identifiers in the
source text.

=back

=head2 Byte and Character semantics

Beginning with version 5.6, Perl uses logically wide characters to
represent strings internally.  This internal representation of strings
uses the UTF-8 encoding.

In future, Perl-level operations can be expected to work with characters
rather than bytes, in general.

However, as strictly an interim compatibility measure, Perl v5.6 aims to
provide a safe migration path from byte semantics to character semantics
for programs.  For operations where Perl can unambiguously decide that the
input data is characters, Perl now switches to character semantics.
For operations where this determination cannot be made without additional
information from the user, Perl decides in favor of compatibility, and
chooses to use byte semantics.

This behavior preserves compatibility with earlier versions of Perl,
which allowed byte semantics in Perl operations, but only as long as
none of the program's inputs are marked as being as source of Unicode
character data.  Such data may come from filehandles, from calls to
external programs, from information provided by the system (such as %ENV),
or from literals and constants in the source text.

If the C<-C> command line switch is used, (or the ${^WIDE_SYSTEM_CALLS}
global flag is set to C<1>), all system calls will use the
corresponding wide character APIs.  This is currently only implemented
on Windows as other platforms do not have a unified way of handling
wide character APIs.

Regardless of the above, the C<bytes> pragma can always be used to force
byte semantics in a particular lexical scope.  See L<bytes>.

The C<utf8> pragma is primarily a compatibility device that enables
recognition of UTF-8 in literals encountered by the parser.  It may also
be used for enabling some of the more experimental Unicode support features.
Note that this pragma is only required until a future version of Perl
in which character semantics will become the default.  This pragma may
then become a no-op.  See L<utf8>.

Unless mentioned otherwise, Perl operators will use character semantics
when they are dealing with Unicode data, and byte semantics otherwise.
Thus, character semantics for these operations apply transparently; if
the input data came from a Unicode source (for example, by adding a
character encoding discipline to the filehandle whence it came, or a
literal UTF-8 string constant in the program), character semantics
apply; otherwise, byte semantics are in effect.  To force byte semantics
on Unicode data, the C<bytes> pragma should be used.

Under character semantics, many operations that formerly operated on
bytes change to operating on characters.  For ASCII data this makes
no difference, because UTF-8 stores ASCII in single bytes, but for
any character greater than C<chr(127)>, the character may be stored in
a sequence of two or more bytes, all of which have the high bit set.
But by and large, the user need not worry about this, because Perl
hides it from the user.  A character in Perl is logically just a number
ranging from 0 to 2**32 or so.  Larger characters encode to longer
sequences of bytes internally, but again, this is just an internal
detail which is hidden at the Perl level.

=head2 Effects of character semantics

Character semantics have the following effects:

=over 4

=item *

Strings and patterns may contain characters that have an ordinal value
larger than 255.

Presuming you use a Unicode editor to edit your program, such characters
will typically occur directly within the literal strings as UTF-8
characters, but you can also specify a particular character with an
extension of the C<\x> notation.  UTF-8 characters are specified by
putting the hexadecimal code within curlies after the C<\x>.  For instance,
a Unicode smiley face is C<\x{263A}>.

=item *

Identifiers within the Perl script may contain Unicode alphanumeric
characters, including ideographs.  (You are currently on your own when
it comes to using the canonical forms of characters--Perl doesn't (yet)
attempt to canonicalize variable names for you.)

=item *

Regular expressions match characters instead of bytes.  For instance,
"." matches a character instead of a byte.  (However, the C<\C> pattern
is available to force a match a single byte ("C<char>" in C, hence C<\C>).)

=item *

Character classes in regular expressions match characters instead of
bytes, and match against the character properties specified in the
Unicode properties database.  So C<\w> can be used to match an ideograph,
for instance.

=item *

Named Unicode properties and block ranges make be used as character
classes via the new C<\p{}> (matches property) and C<\P{}> (doesn't
match property) constructs.  For instance, C<\p{Lu}> matches any
character with the Unicode uppercase property, while C<\p{M}> matches
any mark character.  Single letter properties may omit the brackets, so
that can be written C<\pM> also.  Many predefined character classes are
available, such as C<\p{IsMirrored}> and  C<\p{InTibetan}>.

=item *

The special pattern C<\X> match matches any extended Unicode sequence
(a "combining character sequence" in Standardese), where the first
character is a base character and subsequent characters are mark
characters that apply to the base character.  It is equivalent to
C<(?:\PM\pM*)>.

=item *

The C<tr///> operator translates characters instead of bytes.  Note
that the C<tr///CU> functionality has been removed, as the interface
was a mistake.  For similar functionality see pack('U0', ...) and
pack('C0', ...).

=item *

Case translation operators use the Unicode case translation tables
when provided character input.  Note that C<uc()> translates to
uppercase, while C<ucfirst> translates to titlecase (for languages
that make the distinction).  Naturally the corresponding backslash
sequences have the same semantics.

=item *

Most operators that deal with positions or lengths in the string will
automatically switch to using character positions, including C<chop()>,
C<substr()>, C<pos()>, C<index()>, C<rindex()>, C<sprintf()>,
C<write()>, and C<length()>.  Operators that specifically don't switch
include C<vec()>, C<pack()>, and C<unpack()>.  Operators that really
don't care include C<chomp()>, as well as any other operator that
treats a string as a bucket of bits, such as C<sort()>, and the
operators dealing with filenames.

=item *

The C<pack()>/C<unpack()> letters "C<c>" and "C<C>" do I<not> change,
since they're often used for byte-oriented formats.  (Again, think
"C<char>" in the C language.)  However, there is a new "C<U>" specifier
that will convert between UTF-8 characters and integers.  (It works
outside of the utf8 pragma too.)

=item *

The C<chr()> and C<ord()> functions work on characters.  This is like
C<pack("U")> and C<unpack("U")>, not like C<pack("C")> and
C<unpack("C")>.  In fact, the latter are how you now emulate
byte-oriented C<chr()> and C<ord()> under utf8.

=item *

The bit string operators C<& | ^ ~> can operate on character data.
However, for backward compatibility reasons (bit string operations
when the characters all are less than 256 in ordinal value) one cannot
mix C<~> (the bit complement) and characters both less than 256 and
equal or greater than 256.  Most importantly, the DeMorgan's laws
(C<~($x|$y) eq ~$x&~$y>, C<~($x&$y) eq ~$x|~$y>) won't hold.
Another way to look at this is that the complement cannot return
B<both> the 8-bit (byte) wide bit complement, and the full character
wide bit complement.

=item *

And finally, C<scalar reverse()> reverses by character rather than by byte.

=back

=head2 Character encodings for input and output

This feature is in the process of getting implemented.

(For Perl 5.6 and earlier the support is unlikely to get integrated
to the core language and some external module will be required.)

=head1 CAVEATS

As of yet, there is no method for automatically coercing input and
output to some encoding other than UTF-8.  This is planned in the near
future, however.

Whether an arbitrary piece of data will be treated as "characters" or
"bytes" by internal operations cannot be divined at the current time.

Use of locales with utf8 may lead to odd results.  Currently there is
some attempt to apply 8-bit locale info to characters in the range
0..255, but this is demonstrably incorrect for locales that use
characters above that range (when mapped into Unicode).  It will also
tend to run slower.  Avoidance of locales is strongly encouraged.

=head1 SEE ALSO

L<bytes>, L<utf8>, L<perlvar/"${^WIDE_SYSTEM_CALLS}">

=cut
Commit	Line	Data
393fec97	1	=head1 NAME
	2
	3	perlunicode - Unicode support in Perl
	4
	5	=head1 DESCRIPTION
	6
21bad921	7	=head2 Important Caveat
21bad921	8
393fec97	9	WARNING: The implementation of Unicode support in Perl is incomplete.
21bad921	10
	11	The following areas need further work.
	12
13a2d996	13	=over 4
21bad921	14
	15	=item Input and Output Disciplines
	16
	17	There is currently no easy way to mark data read from a file or other
	18	external source as being utf8. This will be one of the major areas of
49cb94c6	19	focus in the near future. Unfortunately it is unlikely that the Perl
49cb94c6	20	5.6 and earlier will ever gain this capability.
21bad921	21
	22	=item Regular Expressions
	23
	24	The existing regular expression compiler does not produce polymorphic
	25	opcodes. This means that the determination on whether to match Unicode
	26	characters is made when the pattern is compiled, based on whether the
	27	pattern contains Unicode characters, and not when the matching happens
	28	at run time. This needs to be changed to adaptively match Unicode if
	29	the string to be matched is Unicode.
	30
	31	=item C<use utf8> still needed to enable a few features
	32
	33	The C<utf8> pragma implements the tables used for Unicode support. These
	34	tables are automatically loaded on demand, so the C<utf8> pragma need not
	35	normally be used.
	36
	37	However, as a compatibility measure, this pragma must be explicitly used
	38	to enable recognition of UTF-8 encoded literals and identifiers in the
	39	source text.
	40
	41	=back
	42
	43	=head2 Byte and Character semantics
393fec97	44
	45	Beginning with version 5.6, Perl uses logically wide characters to
	46	represent strings internally. This internal representation of strings
	47	uses the UTF-8 encoding.
	48
21bad921	49	In future, Perl-level operations can be expected to work with characters
393fec97	50	rather than bytes, in general.
393fec97	51
8cbd9a7a	52	However, as strictly an interim compatibility measure, Perl v5.6 aims to
	53	provide a safe migration path from byte semantics to character semantics
	54	for programs. For operations where Perl can unambiguously decide that the
	55	input data is characters, Perl now switches to character semantics.
	56	For operations where this determination cannot be made without additional
	57	information from the user, Perl decides in favor of compatibility, and
	58	chooses to use byte semantics.
	59
	60	This behavior preserves compatibility with earlier versions of Perl,
	61	which allowed byte semantics in Perl operations, but only as long as
	62	none of the program's inputs are marked as being as source of Unicode
	63	character data. Such data may come from filehandles, from calls to
	64	external programs, from information provided by the system (such as %ENV),
21bad921	65	or from literals and constants in the source text.
8cbd9a7a	66
46487f74	67	If the C<-C> command line switch is used, (or the ${^WIDE_SYSTEM_CALLS}
46487f74	68	global flag is set to C<1>), all system calls will use the
3969a896	69	corresponding wide character APIs. This is currently only implemented
49cb94c6	70	on Windows as other platforms do not have a unified way of handling
49cb94c6	71	wide character APIs.
8cbd9a7a	72
8058d7ab	73	Regardless of the above, the C<bytes> pragma can always be used to force
8058d7ab	74	byte semantics in a particular lexical scope. See L<bytes>.
8cbd9a7a	75
8cbd9a7a	76	The C<utf8> pragma is primarily a compatibility device that enables
21bad921	77	recognition of UTF-8 in literals encountered by the parser. It may also
21bad921	78	be used for enabling some of the more experimental Unicode support features.
8cbd9a7a	79	Note that this pragma is only required until a future version of Perl
	80	in which character semantics will become the default. This pragma may
	81	then become a no-op. See L<utf8>.
	82
	83	Unless mentioned otherwise, Perl operators will use character semantics
	84	when they are dealing with Unicode data, and byte semantics otherwise.
	85	Thus, character semantics for these operations apply transparently; if
	86	the input data came from a Unicode source (for example, by adding a
	87	character encoding discipline to the filehandle whence it came, or a
	88	literal UTF-8 string constant in the program), character semantics
	89	apply; otherwise, byte semantics are in effect. To force byte semantics
8058d7ab	90	on Unicode data, the C<bytes> pragma should be used.
393fec97	91
	92	Under character semantics, many operations that formerly operated on
	93	bytes change to operating on characters. For ASCII data this makes
	94	no difference, because UTF-8 stores ASCII in single bytes, but for
21bad921	95	any character greater than C<chr(127)>, the character may be stored in
393fec97	96	a sequence of two or more bytes, all of which have the high bit set.
	97	But by and large, the user need not worry about this, because Perl
	98	hides it from the user. A character in Perl is logically just a number
	99	ranging from 0 to 2**32 or so. Larger characters encode to longer
	100	sequences of bytes internally, but again, this is just an internal
	101	detail which is hidden at the Perl level.
	102
8cbd9a7a	103	=head2 Effects of character semantics
393fec97	104
	105	Character semantics have the following effects:
	106
	107	=over 4
	108
	109	=item *
	110
	111	Strings and patterns may contain characters that have an ordinal value
21bad921	112	larger than 255.
393fec97	113
	114	Presuming you use a Unicode editor to edit your program, such characters
	115	will typically occur directly within the literal strings as UTF-8
	116	characters, but you can also specify a particular character with an
	117	extension of the C<\x> notation. UTF-8 characters are specified by
	118	putting the hexadecimal code within curlies after the C<\x>. For instance,
4375e838	119	a Unicode smiley face is C<\x{263A}>.
393fec97	120
	121	=item *
	122
	123	Identifiers within the Perl script may contain Unicode alphanumeric
	124	characters, including ideographs. (You are currently on your own when
	125	it comes to using the canonical forms of characters--Perl doesn't (yet)
	126	attempt to canonicalize variable names for you.)
	127
393fec97	128	=item *
	129
	130	Regular expressions match characters instead of bytes. For instance,
	131	"." matches a character instead of a byte. (However, the C<\C> pattern
49cb94c6	132	is available to force a match a single byte ("C<char>" in C, hence C<\C>).)
393fec97	133
393fec97	134	=item *
	135
	136	Character classes in regular expressions match characters instead of
	137	bytes, and match against the character properties specified in the
	138	Unicode properties database. So C<\w> can be used to match an ideograph,
	139	for instance.
	140
393fec97	141	=item *
	142
	143	Named Unicode properties and block ranges make be used as character
	144	classes via the new C<\p{}> (matches property) and C<\P{}> (doesn't
	145	match property) constructs. For instance, C<\p{Lu}> matches any
	146	character with the Unicode uppercase property, while C<\p{M}> matches
	147	any mark character. Single letter properties may omit the brackets, so
	148	that can be written C<\pM> also. Many predefined character classes are
	149	available, such as C<\p{IsMirrored}> and C<\p{InTibetan}>.
	150
393fec97	151	=item *
	152
	153	The special pattern C<\X> match matches any extended Unicode sequence
	154	(a "combining character sequence" in Standardese), where the first
	155	character is a base character and subsequent characters are mark
	156	characters that apply to the base character. It is equivalent to
	157	C<(?:\PM\pM*)>.
	158
393fec97	159	=item *
393fec97	160
383e7cdd	161	The C<tr///> operator translates characters instead of bytes. Note
	162	that the C<tr///CU> functionality has been removed, as the interface
	163	was a mistake. For similar functionality see pack('U0', ...) and
	164	pack('C0', ...).
393fec97	165
393fec97	166	=item *
	167
	168	Case translation operators use the Unicode case translation tables
	169	when provided character input. Note that C<uc()> translates to
	170	uppercase, while C<ucfirst> translates to titlecase (for languages
	171	that make the distinction). Naturally the corresponding backslash
	172	sequences have the same semantics.
	173
	174	=item *
	175
	176	Most operators that deal with positions or lengths in the string will
	177	automatically switch to using character positions, including C<chop()>,
	178	C<substr()>, C<pos()>, C<index()>, C<rindex()>, C<sprintf()>,
	179	C<write()>, and C<length()>. Operators that specifically don't switch
	180	include C<vec()>, C<pack()>, and C<unpack()>. Operators that really
	181	don't care include C<chomp()>, as well as any other operator that
	182	treats a string as a bucket of bits, such as C<sort()>, and the
	183	operators dealing with filenames.
	184
	185	=item *
	186
	187	The C<pack()>/C<unpack()> letters "C<c>" and "C<C>" do I<not> change,
	188	since they're often used for byte-oriented formats. (Again, think
	189	"C<char>" in the C language.) However, there is a new "C<U>" specifier
	190	that will convert between UTF-8 characters and integers. (It works
	191	outside of the utf8 pragma too.)
	192
	193	=item *
	194
	195	The C<chr()> and C<ord()> functions work on characters. This is like
	196	C<pack("U")> and C<unpack("U")>, not like C<pack("C")> and
	197	C<unpack("C")>. In fact, the latter are how you now emulate
	198	byte-oriented C<chr()> and C<ord()> under utf8.
	199
	200	=item *
	201
a1ca4561	202	The bit string operators C<& \| ^ ~> can operate on character data.
	203	However, for backward compatibility reasons (bit string operations
	204	when the characters all are less than 256 in ordinal value) one cannot
	205	mix C<~> (the bit complement) and characters both less than 256 and
	206	equal or greater than 256. Most importantly, the DeMorgan's laws
	207	(C<~($x\|$y) eq ~$x&~$y>, C<~($x&$y) eq ~$x\|~$y>) won't hold.
	208	Another way to look at this is that the complement cannot return
	209	B<both> the 8-bit (byte) wide bit complement, and the full character
	210	wide bit complement.
	211
	212	=item *
	213
393fec97	214	And finally, C<scalar reverse()> reverses by character rather than by byte.
	215
	216	=back
	217
8cbd9a7a	218	=head2 Character encodings for input and output
8cbd9a7a	219
49cb94c6	220	This feature is in the process of getting implemented.
	221
	222	(For Perl 5.6 and earlier the support is unlikely to get integrated
	223	to the core language and some external module will be required.)
8cbd9a7a	224
393fec97	225	=head1 CAVEATS
	226
	227	As of yet, there is no method for automatically coercing input and
	228	output to some encoding other than UTF-8. This is planned in the near
	229	future, however.
	230
8cbd9a7a	231	Whether an arbitrary piece of data will be treated as "characters" or
8cbd9a7a	232	"bytes" by internal operations cannot be divined at the current time.
393fec97	233
	234	Use of locales with utf8 may lead to odd results. Currently there is
	235	some attempt to apply 8-bit locale info to characters in the range
	236	0..255, but this is demonstrably incorrect for locales that use
	237	characters above that range (when mapped into Unicode). It will also
	238	tend to run slower. Avoidance of locales is strongly encouraged.
	239
	240	=head1 SEE ALSO
	241
8058d7ab	242	L<bytes>, L<utf8>, L<perlvar/"${^WIDE_SYSTEM_CALLS}">
393fec97	243
393fec97	244	=cut