[p5sagit/p5-mst-13.2.git] / pod / perlunicode.pod

=head1 NAME

perlunicode - Unicode support in Perl

=head1 DESCRIPTION

=head2 Important Caveats

WARNING: While the implementation of Unicode support in Perl is now fairly
complete it is still evolving to some extent.

In particular the way Unicode is handled on EBCDIC platforms is still
rather experimental. On such a platform references to UTF-8 encoding
in this document and elsewhere should be read as meaning UTF-EBCDIC as
specified in Unicode Technical Report 16 unless ASCII vs EBCDIC issues
are specifically discussed. There is no C<utfebcdic> pragma or
":utfebcdic" layer, rather "utf8" and ":utf8" are re-used to mean
platform's "natural" 8-bit encoding of Unicode. See L<perlebcdic> for
more discussion of the issues.

The following areas are still under development.

=over 4

=item Input and Output Disciplines

A filehandle can be marked as containing perl's internal Unicode
encoding (UTF-8 or UTF-EBCDIC) by opening it with the ":utf8" layer.
Other encodings can be converted to perl's encoding on input, or from
perl's encoding on output by use of the ":encoding()" layer.  There is
not yet a clean way to mark the Perl source itself as being in an
particular encoding.

=item Regular Expressions

The regular expression compiler does now attempt to produce
polymorphic opcodes.  That is the pattern should now adapt to the data
and automatically switch to the Unicode character scheme when
presented with Unicode data, or a traditional byte scheme when
presented with byte data.  The implementation is still new and
(particularly on EBCDIC platforms) may need further work.

=item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts

The C<utf8> pragma implements the tables used for Unicode support.
These tables are automatically loaded on demand, so the C<utf8> pragma
need not normally be used.

However, as a compatibility measure, this pragma must be explicitly
used to enable recognition of UTF-8 in the Perl scripts themselves on
ASCII based machines or recognize UTF-EBCDIC on EBCDIC based machines.
B<This should be the only place where an explicit C<use utf8> is needed>.

=back

=head2 Byte and Character semantics

Beginning with version 5.6, Perl uses logically wide characters to
represent strings internally.  This internal representation of strings
uses either the UTF-8 or the UTF-EBCDIC encoding.

In future, Perl-level operations can be expected to work with
characters rather than bytes, in general.

However, as strictly an interim compatibility measure, Perl aims to
provide a safe migration path from byte semantics to character
semantics for programs.  For operations where Perl can unambiguously
decide that the input data is characters, Perl now switches to
character semantics.  For operations where this determination cannot
be made without additional information from the user, Perl decides in
favor of compatibility, and chooses to use byte semantics.

This behavior preserves compatibility with earlier versions of Perl,
which allowed byte semantics in Perl operations, but only as long as
none of the program's inputs are marked as being as source of Unicode
character data.  Such data may come from filehandles, from calls to
external programs, from information provided by the system (such as %ENV),
or from literals and constants in the source text.

If the C<-C> command line switch is used, (or the
${^WIDE_SYSTEM_CALLS} global flag is set to C<1>), all system calls
will use the corresponding wide character APIs.  Note that this is
currently only implemented on Windows since other platforms API
standard on this area.

Regardless of the above, the C<bytes> pragma can always be used to
force byte semantics in a particular lexical scope.  See L<bytes>.

The C<utf8> pragma is primarily a compatibility device that enables
recognition of UTF-(8|EBCDIC) in literals encountered by the parser.
It may also be used for enabling some of the more experimental Unicode
support features.  Note that this pragma is only required until a
future version of Perl in which character semantics will become the
default.  This pragma may then become a no-op.  See L<utf8>.

Unless mentioned otherwise, Perl operators will use character semantics
when they are dealing with Unicode data, and byte semantics otherwise.
Thus, character semantics for these operations apply transparently; if
the input data came from a Unicode source (for example, by adding a
character encoding discipline to the filehandle whence it came, or a
literal UTF-8 string constant in the program), character semantics
apply; otherwise, byte semantics are in effect.  To force byte semantics
on Unicode data, the C<bytes> pragma should be used.

Under character semantics, many operations that formerly operated on
bytes change to operating on characters.  For ASCII data this makes no
difference, because UTF-8 stores ASCII in single bytes, but for any
character greater than C<chr(127)>, the character B<may> be stored in
a sequence of two or more bytes, all of which have the high bit set.

For C1 controls or Latin 1 characters on an EBCDIC platform the
character may be stored in a UTF-EBCDIC multi byte sequence.  But by
and large, the user need not worry about this, because Perl hides it
from the user.  A character in Perl is logically just a number ranging
from 0 to 2**32 or so.  Larger characters encode to longer sequences
of bytes internally, but again, this is just an internal detail which
is hidden at the Perl level.

=head2 Effects of character semantics

Character semantics have the following effects:

=over 4

=item *

Strings and patterns may contain characters that have an ordinal value
larger than 255.

Presuming you use a Unicode editor to edit your program, such
characters will typically occur directly within the literal strings as
UTF-8 (or UTF-EBCDIC on EBCDIC platforms) characters, but you can also
specify a particular character with an extension of the C<\x>
notation.  UTF-X characters are specified by putting the hexadecimal
code within curlies after the C<\x>.  For instance, a Unicode smiley
face is C<\x{263A}>.

=item *

Identifiers within the Perl script may contain Unicode alphanumeric
characters, including ideographs.  (You are currently on your own when
it comes to using the canonical forms of characters--Perl doesn't
(yet) attempt to canonicalize variable names for you.)

=item *

Regular expressions match characters instead of bytes.  For instance,
"." matches a character instead of a byte.  (However, the C<\C> pattern
is provided to force a match a single byte ("C<char>" in C, hence C<\C>).)

=item *

Character classes in regular expressions match characters instead of
bytes, and match against the character properties specified in the
Unicode properties database.  So C<\w> can be used to match an
ideograph, for instance.

=item *

Named Unicode properties and block ranges make be used as character
classes via the new C<\p{}> (matches property) and C<\P{}> (doesn't
match property) constructs.  For instance, C<\p{Lu}> matches any
character with the Unicode uppercase property, while C<\p{M}> matches
any mark character.  Single letter properties may omit the brackets,
so that can be written C<\pM> also.  Many predefined character classes
are available, such as C<\p{IsMirrored}> and C<\p{InTibetan}>.  The
names of the C<In> classes are the official Unicode script and block
names but with all non-alphanumeric characters removed, for example
the block name C<"Latin-1 Supplement"> becomes C<\p{InLatin1Supplement}>.

Here is the list as of Unicode 3.1.0 (the two-letter classes) and
as defined by Perl (the one-letter classes) (in Unicode materials
what Perl calls C<L> is often called C<L&>):

   L  Letter
   Lu Letter, Uppercase
   Ll Letter, Lowercase
   Lt Letter, Titlecase
   Lm Letter, Modifier
   Lo Letter, Other
   M  Mark
   Mn Mark, Non-Spacing
   Mc Mark, Spacing Combining
   Me Mark, Enclosing
   N  Number
   Nd Number, Decimal Digit
   Nl Number, Letter
   No Number, Other
   P  Punctuation
   Pc Punctuation, Connector
   Pd Punctuation, Dash
   Ps Punctuation, Open
   Pe Punctuation, Close
   Pi Punctuation, Initial quote
       (may behave like Ps or Pe depending on usage)
   Pf Punctuation, Final quote
       (may behave like Ps or Pe depending on usage)
   Po Punctuation, Other
   S  Symbol
   Sm Symbol, Math
   Sc Symbol, Currency
   Sk Symbol, Modifier
   So Symbol, Other
   Z  Separator
   Zs Separator, Space
   Zl Separator, Line
   Zp Separator, Paragraph
   C  Other
   Cc Other, Control
   Cf Other, Format
   Cs Other, Surrogate
   Co Other, Private Use
   Cn Other, Not Assigned (Unicode defines no Cn characters)

Additionally, because scripts differ in their directionality
(for example Hebrew is written right to left), all characters
have their directionality defined:

   BidiL   Left-to-Right
   BidiLRE Left-to-Right Embedding
   BidiLRO Left-to-Right Override
   BidiR   Right-to-Left
   BidiAL  Right-to-Left Arabic
   BidiRLE Right-to-Left Embedding
   BidiRLO Right-to-Left Override
   BidiPDF Pop Directional Format
   BidiEN  European Number
   BidiES  European Number Separator
   BidiET  European Number Terminator
   BidiAN  Arabic Number
   BidiCS  Common Number Separator
   BidiNSM Non-Spacing Mark
   BidiBN  Boundary Neutral
   BidiB   Paragraph Separator
   BidiS   Segment Separator
   BidiWS  Whitespace
   BidiON  Other Neutrals

=head2 Scripts

The scripts available for C<\p{In...}> and C<\P{In...}>, for example
\p{InCyrillic>, are as follows, for example C<\p{InLatin}> or C<\P{InHan}>:

   Latin
   Greek
   Cyrillic
   Armenian
   Hebrew
   Arabic
   Syriac
   Thaana
   Devanagari
   Bengali
   Gurmukhi
   Gujarati
   Oriya
   Tamil
   Telugu
   Kannada
   Malayalam
   Sinhala
   Thai
   Lao
   Tibetan
   Myanmar
   Georgian
   Hangul
   Ethiopic
   Cherokee
   CanadianAboriginal
   Ogham
   Runic
   Khmer
   Mongolian
   Hiragana
   Katakana
   Bopomofo
   Han
   Yi
   OldItalic
   Gothic
   Deseret
   Inherited

=head2 Blocks

In addition to B<scripts>, Unicode also defines B<blocks> of
characters.  The difference between scripts and blocks is that the
former concept is closer to natural languages, while the latter
concept is more an artificial grouping based on groups of 256 Unicode
characters.  For example, the C<Latin> script contains letters from
many blocks, but it does not contain all the characters from those
blocks, it does not for example contain digits.

For more about scripts see the UTR #24:
http://www.unicode.org/unicode/reports/tr24/
For more about blocks see
http://www.unicode.org/Public/UNIDATA/Blocks.txt

Because there are overlaps in naming (there are, for example, both
a script called C<Katakana> and a block called C<Katakana>, the block
version has C<Block> appended to its name, C<\p{InKatakanaBlock}>.

Notice that this definition was introduced in Perl 5.8.0: in Perl
5.6.0 only the blocks were used; in Perl 5.8.0 scripts became the
preferential character class definition; this meant that the
definitions of some character classes changed (the ones in the
below list that have the C<Block> appended).

   BasicLatin
   Latin1Supplement
   LatinExtendedA
   LatinExtendedB
   IPAExtensions
   SpacingModifierLetters
   CombiningDiacriticalMarks
   GreekBlock
   CyrillicBlock
   ArmenianBlock
   HebrewBlock
   ArabicBlock
   SyriacBlock
   ThaanaBlock
   DevanagariBlock
   BengaliBlock
   GurmukhiBlock
   GujaratiBlock
   OriyaBlock
   TamilBlock
   TeluguBlock
   KannadaBlock
   MalayalamBlock
   SinhalaBlock
   ThaiBlock
   LaoBlock
   TibetanBlock
   MyanmarBlock
   GeorgianBlock
   HangulJamo
   EthiopicBlock
   CherokeeBlock
   UnifiedCanadianAboriginalSyllabics
   OghamBlock
   RunicBlock
   KhmerBlock
   MongolianBlock
   LatinExtendedAdditional
   GreekExtended
   GeneralPunctuation
   SuperscriptsandSubscripts
   CurrencySymbols
   CombiningMarksforSymbols
   LetterlikeSymbols
   NumberForms
   Arrows
   MathematicalOperators
   MiscellaneousTechnical
   ControlPictures
   OpticalCharacterRecognition
   EnclosedAlphanumerics
   BoxDrawing
   BlockElements
   GeometricShapes
   MiscellaneousSymbols
   Dingbats
   BraillePatterns
   CJKRadicalsSupplement
   KangxiRadicals
   IdeographicDescriptionCharacters
   CJKSymbolsandPunctuation
   HiraganaBlock
   KatakanaBlock
   BopomofoBlock
   HangulCompatibilityJamo
   Kanbun
   BopomofoExtended
   EnclosedCJKLettersandMonths
   CJKCompatibility
   CJKUnifiedIdeographsExtensionA
   CJKUnifiedIdeographs
   YiSyllables
   YiRadicals
   HangulSyllables
   HighSurrogates
   HighPrivateUseSurrogates
   LowSurrogates
   PrivateUse
   CJKCompatibilityIdeographs
   AlphabeticPresentationForms
   ArabicPresentationFormsA
   CombiningHalfMarks
   CJKCompatibilityForms
   SmallFormVariants
   ArabicPresentationFormsB
   Specials
   HalfwidthandFullwidthForms
   OldItalicBlock
   GothicBlock
   DeseretBlock
   ByzantineMusicalSymbols
   MusicalSymbols
   MathematicalAlphanumericSymbols
   CJKUnifiedIdeographsExtensionB
   CJKCompatibilityIdeographsSupplement
   Tags

=item *

The special pattern C<\X> match matches any extended Unicode sequence
(a "combining character sequence" in Standardese), where the first
character is a base character and subsequent characters are mark
characters that apply to the base character.  It is equivalent to
C<(?:\PM\pM*)>.

=item *

The C<tr///> operator translates characters instead of bytes.  Note
that the C<tr///CU> functionality has been removed, as the interface
was a mistake.  For similar functionality see pack('U0', ...) and
pack('C0', ...).

=item *

Case translation operators use the Unicode case translation tables
when provided character input.  Note that C<uc()> translates to
uppercase, while C<ucfirst> translates to titlecase (for languages
that make the distinction).  Naturally the corresponding backslash
sequences have the same semantics.

=item *

Most operators that deal with positions or lengths in the string will
automatically switch to using character positions, including
C<chop()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>,
C<sprintf()>, C<write()>, and C<length()>.  Operators that
specifically don't switch include C<vec()>, C<pack()>, and
C<unpack()>.  Operators that really don't care include C<chomp()>, as
well as any other operator that treats a string as a bucket of bits,
such as C<sort()>, and the operators dealing with filenames.

=item *

The C<pack()>/C<unpack()> letters "C<c>" and "C<C>" do I<not> change,
since they're often used for byte-oriented formats.  (Again, think
"C<char>" in the C language.)  However, there is a new "C<U>" specifier
that will convert between UTF-8 characters and integers.  (It works
outside of the utf8 pragma too.)

=item *

The C<chr()> and C<ord()> functions work on characters.  This is like
C<pack("U")> and C<unpack("U")>, not like C<pack("C")> and
C<unpack("C")>.  In fact, the latter are how you now emulate
byte-oriented C<chr()> and C<ord()> under utf8.

=item *

The bit string operators C<& | ^ ~> can operate on character data.
However, for backward compatibility reasons (bit string operations
when the characters all are less than 256 in ordinal value) one should
not mix C<~> (the bit complement) and characters both less than 256 and
equal or greater than 256.  Most importantly, the DeMorgan's laws
(C<~($x|$y) eq ~$x&~$y>, C<~($x&$y) eq ~$x|~$y>) won't hold.
Another way to look at this is that the complement cannot return
B<both> the 8-bit (byte) wide bit complement B<and> the full character
wide bit complement.

=item *

And finally, C<scalar reverse()> reverses by character rather than by byte.

=back

=head2 Character encodings for input and output

See L<Encode>.

=head1 CAVEATS

As of yet, there is no method for automatically coercing input and
output to some encoding other than UTF-8 or UTF-EBCDIC.  This is planned 
in the near future, however.

Whether an arbitrary piece of data will be treated as "characters" or
"bytes" by internal operations cannot be divined at the current time.

Use of locales with utf8 may lead to odd results.  Currently there is
some attempt to apply 8-bit locale info to characters in the range
0..255, but this is demonstrably incorrect for locales that use
characters above that range (when mapped into Unicode).  It will also
tend to run slower.  Avoidance of locales is strongly encouraged.

=head1 SEE ALSO

L<bytes>, L<utf8>, L<perlretut>, L<perlvar/"${^WIDE_SYSTEM_CALLS}">

=cut
Commit	Line	Data
393fec97	1	=head1 NAME
	2
	3	perlunicode - Unicode support in Perl
	4
	5	=head1 DESCRIPTION
	6
0a1f2d14	7	=head2 Important Caveats
21bad921	8
0a1f2d14	9	WARNING: While the implementation of Unicode support in Perl is now fairly
0a1f2d14	10	complete it is still evolving to some extent.
21bad921	11
75daf61c	12	In particular the way Unicode is handled on EBCDIC platforms is still
	13	rather experimental. On such a platform references to UTF-8 encoding
	14	in this document and elsewhere should be read as meaning UTF-EBCDIC as
	15	specified in Unicode Technical Report 16 unless ASCII vs EBCDIC issues
	16	are specifically discussed. There is no C<utfebcdic> pragma or
	17	":utfebcdic" layer, rather "utf8" and ":utf8" are re-used to mean
	18	platform's "natural" 8-bit encoding of Unicode. See L<perlebcdic> for
	19	more discussion of the issues.
0a1f2d14	20
0a1f2d14	21	The following areas are still under development.
21bad921	22
13a2d996	23	=over 4
21bad921	24
	25	=item Input and Output Disciplines
	26
75daf61c	27	A filehandle can be marked as containing perl's internal Unicode
75daf61c	28	encoding (UTF-8 or UTF-EBCDIC) by opening it with the ":utf8" layer.
0a1f2d14	29	Other encodings can be converted to perl's encoding on input, or from
75daf61c	30	perl's encoding on output by use of the ":encoding()" layer. There is
	31	not yet a clean way to mark the Perl source itself as being in an
	32	particular encoding.
21bad921	33
	34	=item Regular Expressions
	35
e6739005	36	The regular expression compiler does now attempt to produce
e6739005	37	polymorphic opcodes. That is the pattern should now adapt to the data
75daf61c	38	and automatically switch to the Unicode character scheme when
	39	presented with Unicode data, or a traditional byte scheme when
	40	presented with byte data. The implementation is still new and
	41	(particularly on EBCDIC platforms) may need further work.
21bad921	42
ad0029c4	43	=item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts
21bad921	44
75daf61c	45	The C<utf8> pragma implements the tables used for Unicode support.
	46	These tables are automatically loaded on demand, so the C<utf8> pragma
	47	need not normally be used.
21bad921	48
75daf61c	49	However, as a compatibility measure, this pragma must be explicitly
ad0029c4	50	used to enable recognition of UTF-8 in the Perl scripts themselves on
	51	ASCII based machines or recognize UTF-EBCDIC on EBCDIC based machines.
	52	B<This should be the only place where an explicit C<use utf8> is needed>.
21bad921	53
	54	=back
	55
	56	=head2 Byte and Character semantics
393fec97	57
	58	Beginning with version 5.6, Perl uses logically wide characters to
	59	represent strings internally. This internal representation of strings
b3419ed8	60	uses either the UTF-8 or the UTF-EBCDIC encoding.
393fec97	61
75daf61c	62	In future, Perl-level operations can be expected to work with
75daf61c	63	characters rather than bytes, in general.
393fec97	64
75daf61c	65	However, as strictly an interim compatibility measure, Perl aims to
	66	provide a safe migration path from byte semantics to character
	67	semantics for programs. For operations where Perl can unambiguously
	68	decide that the input data is characters, Perl now switches to
	69	character semantics. For operations where this determination cannot
	70	be made without additional information from the user, Perl decides in
	71	favor of compatibility, and chooses to use byte semantics.
8cbd9a7a	72
	73	This behavior preserves compatibility with earlier versions of Perl,
	74	which allowed byte semantics in Perl operations, but only as long as
	75	none of the program's inputs are marked as being as source of Unicode
	76	character data. Such data may come from filehandles, from calls to
	77	external programs, from information provided by the system (such as %ENV),
21bad921	78	or from literals and constants in the source text.
8cbd9a7a	79
75daf61c	80	If the C<-C> command line switch is used, (or the
	81	${^WIDE_SYSTEM_CALLS} global flag is set to C<1>), all system calls
	82	will use the corresponding wide character APIs. Note that this is
	83	currently only implemented on Windows since other platforms API
	84	standard on this area.
8cbd9a7a	85
75daf61c	86	Regardless of the above, the C<bytes> pragma can always be used to
75daf61c	87	force byte semantics in a particular lexical scope. See L<bytes>.
8cbd9a7a	88
8cbd9a7a	89	The C<utf8> pragma is primarily a compatibility device that enables
75daf61c	90	recognition of UTF-(8\|EBCDIC) in literals encountered by the parser.
	91	It may also be used for enabling some of the more experimental Unicode
	92	support features. Note that this pragma is only required until a
	93	future version of Perl in which character semantics will become the
	94	default. This pragma may then become a no-op. See L<utf8>.
8cbd9a7a	95
	96	Unless mentioned otherwise, Perl operators will use character semantics
	97	when they are dealing with Unicode data, and byte semantics otherwise.
	98	Thus, character semantics for these operations apply transparently; if
	99	the input data came from a Unicode source (for example, by adding a
	100	character encoding discipline to the filehandle whence it came, or a
	101	literal UTF-8 string constant in the program), character semantics
	102	apply; otherwise, byte semantics are in effect. To force byte semantics
8058d7ab	103	on Unicode data, the C<bytes> pragma should be used.
393fec97	104
393fec97	105	Under character semantics, many operations that formerly operated on
75daf61c	106	bytes change to operating on characters. For ASCII data this makes no
	107	difference, because UTF-8 stores ASCII in single bytes, but for any
	108	character greater than C<chr(127)>, the character B<may> be stored in
393fec97	109	a sequence of two or more bytes, all of which have the high bit set.
2796c109	110
	111	For C1 controls or Latin 1 characters on an EBCDIC platform the
	112	character may be stored in a UTF-EBCDIC multi byte sequence. But by
	113	and large, the user need not worry about this, because Perl hides it
	114	from the user. A character in Perl is logically just a number ranging
	115	from 0 to 2**32 or so. Larger characters encode to longer sequences
	116	of bytes internally, but again, this is just an internal detail which
	117	is hidden at the Perl level.
393fec97	118
8cbd9a7a	119	=head2 Effects of character semantics
393fec97	120
	121	Character semantics have the following effects:
	122
	123	=over 4
	124
	125	=item *
	126
	127	Strings and patterns may contain characters that have an ordinal value
21bad921	128	larger than 255.
393fec97	129
75daf61c	130	Presuming you use a Unicode editor to edit your program, such
	131	characters will typically occur directly within the literal strings as
	132	UTF-8 (or UTF-EBCDIC on EBCDIC platforms) characters, but you can also
	133	specify a particular character with an extension of the C<\x>
	134	notation. UTF-X characters are specified by putting the hexadecimal
	135	code within curlies after the C<\x>. For instance, a Unicode smiley
	136	face is C<\x{263A}>.
393fec97	137
	138	=item *
	139
	140	Identifiers within the Perl script may contain Unicode alphanumeric
	141	characters, including ideographs. (You are currently on your own when
75daf61c	142	it comes to using the canonical forms of characters--Perl doesn't
75daf61c	143	(yet) attempt to canonicalize variable names for you.)
393fec97	144
393fec97	145	=item *
	146
	147	Regular expressions match characters instead of bytes. For instance,
	148	"." matches a character instead of a byte. (However, the C<\C> pattern
75daf61c	149	is provided to force a match a single byte ("C<char>" in C, hence C<\C>).)
393fec97	150
393fec97	151	=item *
	152
	153	Character classes in regular expressions match characters instead of
	154	bytes, and match against the character properties specified in the
75daf61c	155	Unicode properties database. So C<\w> can be used to match an
75daf61c	156	ideograph, for instance.
393fec97	157
393fec97	158	=item *
	159
	160	Named Unicode properties and block ranges make be used as character
	161	classes via the new C<\p{}> (matches property) and C<\P{}> (doesn't
	162	match property) constructs. For instance, C<\p{Lu}> matches any
	163	character with the Unicode uppercase property, while C<\p{M}> matches
9fdf68be	164	any mark character. Single letter properties may omit the brackets,
	165	so that can be written C<\pM> also. Many predefined character classes
	166	are available, such as C<\p{IsMirrored}> and C<\p{InTibetan}>. The
75daf61c	167	names of the C<In> classes are the official Unicode script and block
	168	names but with all non-alphanumeric characters removed, for example
	169	the block name C<"Latin-1 Supplement"> becomes C<\p{InLatin1Supplement}>.
393fec97	170
32293815	171	Here is the list as of Unicode 3.1.0 (the two-letter classes) and
2796c109	172	as defined by Perl (the one-letter classes) (in Unicode materials
2796c109	173	what Perl calls C<L> is often called C<L&>):
32293815	174
	175	L Letter
	176	Lu Letter, Uppercase
	177	Ll Letter, Lowercase
	178	Lt Letter, Titlecase
	179	Lm Letter, Modifier
	180	Lo Letter, Other
	181	M Mark
	182	Mn Mark, Non-Spacing
	183	Mc Mark, Spacing Combining
	184	Me Mark, Enclosing
	185	N Number
	186	Nd Number, Decimal Digit
	187	Nl Number, Letter
	188	No Number, Other
	189	P Punctuation
	190	Pc Punctuation, Connector
	191	Pd Punctuation, Dash
	192	Ps Punctuation, Open
	193	Pe Punctuation, Close
	194	Pi Punctuation, Initial quote
	195	(may behave like Ps or Pe depending on usage)
	196	Pf Punctuation, Final quote
	197	(may behave like Ps or Pe depending on usage)
	198	Po Punctuation, Other
	199	S Symbol
	200	Sm Symbol, Math
	201	Sc Symbol, Currency
	202	Sk Symbol, Modifier
	203	So Symbol, Other
	204	Z Separator
	205	Zs Separator, Space
	206	Zl Separator, Line
	207	Zp Separator, Paragraph
	208	C Other
	209	Cc Other, Control
	210	Cf Other, Format
	211	Cs Other, Surrogate
	212	Co Other, Private Use
	213	Cn Other, Not Assigned (Unicode defines no Cn characters)
	214
	215	Additionally, because scripts differ in their directionality
	216	(for example Hebrew is written right to left), all characters
	217	have their directionality defined:
	218
	219	BidiL Left-to-Right
	220	BidiLRE Left-to-Right Embedding
	221	BidiLRO Left-to-Right Override
	222	BidiR Right-to-Left
	223	BidiAL Right-to-Left Arabic
	224	BidiRLE Right-to-Left Embedding
	225	BidiRLO Right-to-Left Override
	226	BidiPDF Pop Directional Format
	227	BidiEN European Number
	228	BidiES European Number Separator
	229	BidiET European Number Terminator
	230	BidiAN Arabic Number
	231	BidiCS Common Number Separator
	232	BidiNSM Non-Spacing Mark
	233	BidiBN Boundary Neutral
	234	BidiB Paragraph Separator
	235	BidiS Segment Separator
	236	BidiWS Whitespace
	237	BidiON Other Neutrals
238
2796c109	239	=head2 Scripts
2796c109	240
75daf61c	241	The scripts available for C<\p{In...}> and C<\P{In...}>, for example
75daf61c	242	\p{InCyrillic>, are as follows, for example C<\p{InLatin}> or C<\P{InHan}>:
2796c109	243
	244	Latin
	245	Greek
	246	Cyrillic
	247	Armenian
	248	Hebrew
	249	Arabic
	250	Syriac
	251	Thaana
	252	Devanagari
	253	Bengali
	254	Gurmukhi
	255	Gujarati
	256	Oriya
	257	Tamil
	258	Telugu
	259	Kannada
	260	Malayalam
	261	Sinhala
	262	Thai
	263	Lao
	264	Tibetan
	265	Myanmar
	266	Georgian
	267	Hangul
	268	Ethiopic
	269	Cherokee
	270	CanadianAboriginal
	271	Ogham
	272	Runic
	273	Khmer
	274	Mongolian
	275	Hiragana
	276	Katakana
	277	Bopomofo
	278	Han
	279	Yi
	280	OldItalic
	281	Gothic
	282	Deseret
	283	Inherited
	284
	285	=head2 Blocks
	286
	287	In addition to B<scripts>, Unicode also defines B<blocks> of
	288	characters. The difference between scripts and blocks is that the
	289	former concept is closer to natural languages, while the latter
	290	concept is more an artificial grouping based on groups of 256 Unicode
	291	characters. For example, the C<Latin> script contains letters from
	292	many blocks, but it does not contain all the characters from those
	293	blocks, it does not for example contain digits.
	294
	295	For more about scripts see the UTR #24:
	296	http://www.unicode.org/unicode/reports/tr24/
	297	For more about blocks see
	298	http://www.unicode.org/Public/UNIDATA/Blocks.txt
	299
	300	Because there are overlaps in naming (there are, for example, both
	301	a script called C<Katakana> and a block called C<Katakana>, the block
	302	version has C<Block> appended to its name, C<\p{InKatakanaBlock}>.
	303
	304	Notice that this definition was introduced in Perl 5.8.0: in Perl
	305	5.6.0 only the blocks were used; in Perl 5.8.0 scripts became the
	306	preferential character class definition; this meant that the
307	definitions of some character classes changed (the ones in the
308	below list that have the C<Block> appended).
309
310	BasicLatin
311	Latin1Supplement
312	LatinExtendedA
313	LatinExtendedB
314	IPAExtensions
315	SpacingModifierLetters
316	CombiningDiacriticalMarks
317	GreekBlock
318	CyrillicBlock
319	ArmenianBlock
320	HebrewBlock
321	ArabicBlock
322	SyriacBlock
323	ThaanaBlock
324	DevanagariBlock
325	BengaliBlock
326	GurmukhiBlock
327	GujaratiBlock
328	OriyaBlock
329	TamilBlock
330	TeluguBlock
331	KannadaBlock
332	MalayalamBlock
333	SinhalaBlock
334	ThaiBlock
335	LaoBlock
336	TibetanBlock
337	MyanmarBlock
338	GeorgianBlock
339	HangulJamo
340	EthiopicBlock
341	CherokeeBlock
342	UnifiedCanadianAboriginalSyllabics
343	OghamBlock
344	RunicBlock
345	KhmerBlock
346	MongolianBlock
347	LatinExtendedAdditional
348	GreekExtended
349	GeneralPunctuation
350	SuperscriptsandSubscripts
351	CurrencySymbols
352	CombiningMarksforSymbols
353	LetterlikeSymbols
354	NumberForms
355	Arrows
356	MathematicalOperators
357	MiscellaneousTechnical
358	ControlPictures
359	OpticalCharacterRecognition
360	EnclosedAlphanumerics
361	BoxDrawing
362	BlockElements
363	GeometricShapes
364	MiscellaneousSymbols
365	Dingbats
366	BraillePatterns
367	CJKRadicalsSupplement
368	KangxiRadicals
369	IdeographicDescriptionCharacters
370	CJKSymbolsandPunctuation
371	HiraganaBlock
372	KatakanaBlock
373	BopomofoBlock
374	HangulCompatibilityJamo
375	Kanbun
376	BopomofoExtended
377	EnclosedCJKLettersandMonths
378	CJKCompatibility
379	CJKUnifiedIdeographsExtensionA
380	CJKUnifiedIdeographs
381	YiSyllables
382	YiRadicals
383	HangulSyllables
384	HighSurrogates
385	HighPrivateUseSurrogates
386	LowSurrogates
387	PrivateUse
388	CJKCompatibilityIdeographs
389	AlphabeticPresentationForms
390	ArabicPresentationFormsA
391	CombiningHalfMarks
392	CJKCompatibilityForms
393	SmallFormVariants
394	ArabicPresentationFormsB
395	Specials
396	HalfwidthandFullwidthForms
397	OldItalicBlock
398	GothicBlock
399	DeseretBlock
400	ByzantineMusicalSymbols
401	MusicalSymbols
402	MathematicalAlphanumericSymbols
403	CJKUnifiedIdeographsExtensionB
404	CJKCompatibilityIdeographsSupplement
405	Tags
32293815	406
393fec97	407	=item *
	408
	409	The special pattern C<\X> match matches any extended Unicode sequence
	410	(a "combining character sequence" in Standardese), where the first
	411	character is a base character and subsequent characters are mark
	412	characters that apply to the base character. It is equivalent to
	413	C<(?:\PM\pM*)>.
	414
393fec97	415	=item *
393fec97	416
383e7cdd	417	The C<tr///> operator translates characters instead of bytes. Note
	418	that the C<tr///CU> functionality has been removed, as the interface
	419	was a mistake. For similar functionality see pack('U0', ...) and
	420	pack('C0', ...).
393fec97	421
393fec97	422	=item *
	423
	424	Case translation operators use the Unicode case translation tables
	425	when provided character input. Note that C<uc()> translates to
	426	uppercase, while C<ucfirst> translates to titlecase (for languages
	427	that make the distinction). Naturally the corresponding backslash
	428	sequences have the same semantics.
	429
	430	=item *
	431
	432	Most operators that deal with positions or lengths in the string will
75daf61c	433	automatically switch to using character positions, including
	434	C<chop()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>,
	435	C<sprintf()>, C<write()>, and C<length()>. Operators that
	436	specifically don't switch include C<vec()>, C<pack()>, and
	437	C<unpack()>. Operators that really don't care include C<chomp()>, as
	438	well as any other operator that treats a string as a bucket of bits,
	439	such as C<sort()>, and the operators dealing with filenames.
393fec97	440
	441	=item *
	442
	443	The C<pack()>/C<unpack()> letters "C<c>" and "C<C>" do I<not> change,
	444	since they're often used for byte-oriented formats. (Again, think
	445	"C<char>" in the C language.) However, there is a new "C<U>" specifier
	446	that will convert between UTF-8 characters and integers. (It works
	447	outside of the utf8 pragma too.)
	448
	449	=item *
	450
	451	The C<chr()> and C<ord()> functions work on characters. This is like
	452	C<pack("U")> and C<unpack("U")>, not like C<pack("C")> and
	453	C<unpack("C")>. In fact, the latter are how you now emulate
	454	byte-oriented C<chr()> and C<ord()> under utf8.
	455
	456	=item *
	457
a1ca4561	458	The bit string operators C<& \| ^ ~> can operate on character data.
a1ca4561	459	However, for backward compatibility reasons (bit string operations
75daf61c	460	when the characters all are less than 256 in ordinal value) one should
75daf61c	461	not mix C<~> (the bit complement) and characters both less than 256 and
a1ca4561	462	equal or greater than 256. Most importantly, the DeMorgan's laws
	463	(C<~($x\|$y) eq ~$x&~$y>, C<~($x&$y) eq ~$x\|~$y>) won't hold.
	464	Another way to look at this is that the complement cannot return
75daf61c	465	B<both> the 8-bit (byte) wide bit complement B<and> the full character
a1ca4561	466	wide bit complement.
	467
	468	=item *
	469
393fec97	470	And finally, C<scalar reverse()> reverses by character rather than by byte.
	471
	472	=back
	473
8cbd9a7a	474	=head2 Character encodings for input and output
8cbd9a7a	475
7221edc9	476	See L<Encode>.
8cbd9a7a	477
393fec97	478	=head1 CAVEATS
	479
	480	As of yet, there is no method for automatically coercing input and
b3419ed8	481	output to some encoding other than UTF-8 or UTF-EBCDIC. This is planned
b3419ed8	482	in the near future, however.
393fec97	483
8cbd9a7a	484	Whether an arbitrary piece of data will be treated as "characters" or
8cbd9a7a	485	"bytes" by internal operations cannot be divined at the current time.
393fec97	486
	487	Use of locales with utf8 may lead to odd results. Currently there is
	488	some attempt to apply 8-bit locale info to characters in the range
	489	0..255, but this is demonstrably incorrect for locales that use
	490	characters above that range (when mapped into Unicode). It will also
	491	tend to run slower. Avoidance of locales is strongly encouraged.
	492
	493	=head1 SEE ALSO
	494
32293815	495	L<bytes>, L<utf8>, L<perlretut>, L<perlvar/"${^WIDE_SYSTEM_CALLS}">
393fec97	496
393fec97	497	=cut