=head1 NAME
-feature - Perl pragma to enable new syntactic features
+feature - Perl pragma to enable new features
=head1 SYNOPSIS
=head2 the 'unicode_strings' feature
C<use feature 'unicode_strings'> tells the compiler to treat
-strings with codepoints larger than 128 as Unicode. It is available
-starting with Perl 5.11.3.
-
-In greater detail:
-
-This feature modifies the semantics for the 128 characters on ASCII
-systems that have the 8th bit set. (See L</EBCDIC platforms> below for
-EBCDIC systems.) By default, unless C<S<use locale>> is specified, or the
-scalar containing such a character is known by Perl to be encoded in UTF8,
-the semantics are essentially that the characters have an ordinal number,
-and that's it. They are caseless, and aren't anything: they're not
-controls, not letters, not punctuation, ..., not anything.
-
-This behavior stems from when Perl did not support Unicode, and ASCII was the
-only known character set outside of C<S<use locale>>. In order to not
-possibly break pre-Unicode programs, these characters have retained their old
-non-meanings, except when it is clear to Perl that Unicode is what is meant,
-for example by calling utf8::upgrade() on a scalar, or if the scalar also
-contains characters that are only available in Unicode. Then these 128
-characters take on their Unicode meanings.
-
-The problem with this behavior is that a scalar that encodes these characters
-has a different meaning depending on if it is stored as utf8 or not.
-In general, the internal storage method should not affect the
-external behavior.
-
-The behavior is known to have effects on these areas:
+all strings outside of C<use locale> and C<use bytes> as Unicode. It is
+available starting with Perl 5.11.3.
-=over 4
-
-=item *
-
-Changing the case of a scalar, that is, using C<uc()>, C<ucfirst()>, C<lc()>,
-and C<lcfirst()>, or C<\L>, C<\U>, C<\u> and C<\l> in regular expression
-substitutions.
-
-=item *
-
-Using caseless (C</i>) regular expression matching
-
-=item *
-
-Matching a number of properties in regular expressions, such as C<\w>
-
-=item *
-
-User-defined case change mappings. You can create a C<ToUpper()> function, for
-example, which overrides Perl's built-in case mappings. The scalar must be
-encoded in utf8 for your function to actually be invoked.
-
-=back
-
-B<This lack of semantics for these characters is currently the default,>
-outside of C<use locale>. See below for EBCDIC.
-
-To turn on B<case changing semantics only> for these characters, use
-C<use feature "unicode_strings">.
-
-The other old (legacy) behaviors regarding these characters are currently
-unaffected by this pragma.
-
-=head4 EBCDIC platforms
-
-On EBCDIC platforms, the situation is somewhat different. The legacy
-semantics are whatever the underlying semantics of the native C language
-library are. Each of the three EBCDIC encodings currently known by Perl is an
-isomorph of the Latin-1 character set. That means every character in Latin-1
-has a corresponding EBCDIC equivalent, and vice-versa. Specifying C<S<no
-legacy>> currently makes sure that all EBCDIC characters have the same
-B<casing only> semantics as their corresponding Latin-1 characters.
+See L<perlunicode/The "Unicode Bug"> for details.
=head1 FEATURE BUNDLES
perluniintro Perl Unicode introduction
perlunicode Perl Unicode support
perlunifaq Perl Unicode FAQ
- perluniprops Complete index of Unicode Version 5.1.0 properties
+ perluniprops Perl Unicode property index
perlunitut Perl Unicode tutorial
perlebcdic Considerations for running Perl on EBCDIC platforms
=item Can't find %s character property "%s"
(F) You used C<\p{}> or C<\P{}> but the character property by that name
-could not be found. Maybe you misspelled the name of the property
-(remember that the names of character properties consist only of
-alphanumeric characters), or maybe you forgot the C<Is> or C<In> prefix?
+could not be found. Maybe you misspelled the name of the property?
+See L<perluniprops/Properties accessible through \p{} and \P{}>
+for a complete list of available properties.
=item Can't find label %s
=item Can't find Unicode property definition "%s"
(F) You may have tried to use C<\p> which means a Unicode property (for
-example C<\p{Lu}> is all uppercase letters). If you did mean to use a
-Unicode property, see L<perlunicode> for the list of known properties.
+example C<\p{Lu}> matches all uppercase letters). If you did mean to use a
+Unicode property, see
+L<perluniprops/Properties accessible through \p{} and \P{}>
+for a complete list of available properties.
If you didn't mean to use a Unicode property, escape the C<\p>, either
by C<\\p> (just the C<\p>) or by C<\Q\p> (the rest of the string, until
possible C<\E>).
0xDFFF (inclusive). That range is reserved exclusively for the use of
UTF-16 encoding (by having two 16-bit UCS-2 characters); but Perl
encodes its characters in UTF-8, so what you got is a very illegal
-character. If you really know what you are doing you can turn off
+character. If you really really know what you are doing you can turn off
this warning by C<no warnings 'utf8';>.
=item Value of %s can be "0"; test with defined()
Portions that are still incomplete are marked with XXX.
+Perl used to work on EBCDIC machines, but there are now areas of the code where
+it doesn't. If you want to use Perl on an EBCDIC machine, please let us know
+by sending mail to perlbug@perl.org
+
=head1 COMMON CHARACTER CODE SETS
=head2 ASCII
=head2 EBCDIC
The Extended Binary Coded Decimal Interchange Code refers to a
-large collection of slightly different single and multi byte
-coded character sets that are different from ASCII or ISO 8859-1
-and typically run on host computers. The EBCDIC encodings derive
-from 8 bit byte extensions of Hollerith punched card encodings.
-The layout on the cards was such that high bits were set for the
-upper and lower case alphabet characters [a-z] and [A-Z], but there
-were gaps within each Latin alphabet range.
+large collection of single and multi byte coded character sets that are
+different from ASCII or ISO 8859-1 and are all slightly different from each
+other; they typically run on host computers. The EBCDIC encodings derive from
+8 bit byte extensions of Hollerith punched card encodings. The layout on the
+cards was such that high bits were set for the upper and lower case alphabet
+characters [a-z] and [A-Z], but there were gaps within each Latin alphabet
+range.
Some IBM EBCDIC character sets may be known by character code set
identification numbers (CCSID numbers) or code page numbers. Leading
mentioned above.)
For example, the ordinal value of 'A' is 193 in most EBCDIC code pages,
and also is 193 when encoded in UTF-EBCDIC.
-All other code points occupy at least two bytes when encoded.
+All variant code points occupy at least two bytes when encoded.
In UTF-8, the code points corresponding to the lowest 128
ordinal numbers (0 - 127: the ASCII characters) are invariant.
In UTF-EBCDIC, there are 160 invariant characters.
=item lc
Returns a lowercased version of EXPR. This is the internal function
-implementing the C<\L> escape in double-quoted strings. Respects
-current LC_CTYPE locale if C<use locale> in force. See L<perllocale>
-and L<perlunicode> for more details about locale and Unicode support.
+implementing the C<\L> escape in double-quoted strings.
If EXPR is omitted, uses C<$_>.
+What gets returned depends on several factors:
+
+=over
+
+=item If C<use bytes> is in effect:
+
+=over
+
+=item On EBCDIC platforms
+
+The results are what the C language system call C<tolower()> returns.
+
+=item On ASCII platforms
+
+The results follow ASCII semantics. Only characters C<A-Z> change, to C<a-z>
+respectively.
+
+=back
+
+=item Otherwise, If EXPR has the UTF8 flag set
+
+If the current package has a subroutine named C<ToLower>, it will be used to
+change the case (See L<perlunicode/User-Defined Case Mappings>.)
+Otherwise Unicode semantics are used for the case change.
+
+=item Otherwise, if C<use locale> is in effect
+
+Respects current LC_CTYPE locale. See L<perllocale>.
+
+=item Otherwise, if C<use feature 'unicode_strings'> is in effect:
+
+Unicode semantics are used for the case change. Any subroutine named
+C<ToLower> will not be used.
+
+=item Otherwise:
+
+=over
+
+=item On EBCDIC platforms
+
+The results are what the C language system call C<tolower()> returns.
+
+=item On ASCII platforms
+
+ASCII semantics are used for the case change. The lowercase of any character
+outside the ASCII range is the character itself.
+
+=back
+
+=back
+
=item lcfirst EXPR
X<lcfirst> X<lowercase>
Returns the value of EXPR with the first character lowercased. This
is the internal function implementing the C<\l> escape in
-double-quoted strings. Respects current LC_CTYPE locale if C<use
-locale> in force. See L<perllocale> and L<perlunicode> for more
-details about locale and Unicode support.
+double-quoted strings.
If EXPR is omitted, uses C<$_>.
+This function behaves the same way under various pragma, such as in a locale,
+as L</lc> does.
+
=item length EXPR
X<length> X<size>
an integer may be represented by a sequence of 4 bytes that will be
converted to a sequence of 4 characters.
+See L<perlpacktut> for an introduction to this function.
+
The TEMPLATE is a sequence of characters that give the order and type
of values, as follows:
=item uc
Returns an uppercased version of EXPR. This is the internal function
-implementing the C<\U> escape in double-quoted strings. Respects
-current LC_CTYPE locale if C<use locale> in force. See L<perllocale>
-and L<perlunicode> for more details about locale and Unicode support.
+implementing the C<\U> escape in double-quoted strings.
It does not attempt to do titlecase mapping on initial letters. See
C<ucfirst> for that.
If EXPR is omitted, uses C<$_>.
+This function behaves the same way under various pragma, such as in a locale,
+as L</lc> does.
+
=item ucfirst EXPR
X<ucfirst> X<uppercase>
Returns the value of EXPR with the first character in uppercase
(titlecase in Unicode). This is the internal function implementing
-the C<\u> escape in double-quoted strings. Respects current LC_CTYPE
-locale if C<use locale> in force. See L<perllocale> and L<perlunicode>
-for more details about locale and Unicode support.
+the C<\u> escape in double-quoted strings.
If EXPR is omitted, uses C<$_>.
+This function behaves the same way under various pragma, such as in a locale,
+as L</lc> does.
+
=item umask EXPR
X<umask>
and expands it out into a list of values.
(In scalar context, it returns merely the first value produced.)
-If EXPR is omitted, unpacks the C<$_> string.
+If EXPR is omitted, unpacks the C<$_> string. for an introduction to this function.
+
+See L<perlpacktut> for an introduction to this function.
The string is broken into chunks described by the TEMPLATE. Each chunk
is converted separately to a value. Typically, either the string is a result
character set adequate only for poorly representing English text).
Often used loosely to describe the lowest 128 values of the various
ISO-8859-X character sets, a bunch of mutually incompatible 8-bit
-codes best described as half ASCII. See also L</Unicode>.
+codes sometimes described as half ASCII. See also L</Unicode>.
=item assertion
=item Unicode
A character set comprising all the major character sets of the world,
-more or less. See L<http://www.unicode.org>.
+more or less. See L<perlunicode> and L<http://www.unicode.org>.
=item Unix
slightly differently. A flag in the SV, C<SVf_UTF8>, indicates that the
string is internally encoded as UTF-8. Without it, the byte value is the
codepoint number and vice versa (in other words, the string is encoded
-as iso-8859-1). You can check and manipulate this flag with the
+as iso-8859-1, but C<use feature 'unicode_strings'> is needed to get iso-8859-1
+semantics). You can check and manipulate this flag with the
following macros:
SvUTF8(sv)
characters have different meanings depending on the locale. Absent a locale,
currently these extra characters are generally considered to be unassigned,
and this has presented some problems.
-This is scheduled to be changed in 5.12 so that these characters will
+This is being changed starting in 5.12 so that these characters will
be considered to be Latin-1 (ISO-8859-1).
=item *
\n (Logical) newline character.
\N Any character but newline.
\N{} Named (Unicode) character.
- \p{}, \pP Character with a Unicode property.
- \P{}, \PP Character without a Unicode property.
+ \p{}, \pP Character with the given Unicode property.
+ \P{}, \PP Character without the given Unicode property.
\Q Quotemeta till \E.
\r Return character.
\R Generic new line.
C<\pP> and C<\p{Prop}> are character classes to match characters that
fit given Unicode classes. One letter classes can be used in the C<\pP>
-form, with the class name following the C<\p>, otherwise, the property
-name is enclosed in braces, and follows the C<\p>. For instance, a
-match for a number can be written as C</\pN/> or as C</\p{Number}/>.
-Lowercase letters are matched by the property I<LowercaseLetter> which
-has as short form I<Ll>. They have to be written as C</\p{Ll}/> or
-C</\p{LowercaseLetter}/>. C</\pLl/> is valid, but means something different.
+form, with the class name following the C<\p>, otherwise, braces are required.
+There is a single form, which is just the property name enclosed in the braces,
+and a compound form which looks like C<\p{name=value}>, which means to match
+if the property C<name> for the character has the particular C<value>.
+For instance, a match for a number can be written as C</\pN/> or as
+C</\p{Number}/>, or as C</\p{Number=True}/>.
+Lowercase letters are matched by the property I<Lowercase_Letter> which
+has as short form I<Ll>. They need the braces, so are written as C</\p{Ll}/> or
+C</\p{Lowercase_Letter}/>, or C</\p{General_Category=Lowercase_Letter}/>
+(the underscores are optional).
+C</\pLl/> is valid, but means something different.
It matches a two character string: a letter (Unicode property C<\pL>),
followed by a lowercase C<l>.
-For a list of possible properties, see
-L<perlunicode/Unicode Character Properties>. It is also possible to
-defined your own properties. This is discussed in
+For more details, see L<perlunicode/Unicode Character Properties>; for a
+complete list of possible properties, see
+L<perluniprops/Properties accessible through \p{} and \P{}>.
+It is also possible to define your own properties. This is discussed in
L<perlunicode/User-Defined Character Properties>.
word IsWord
xdigit IsXDigit
-Some character classes may have a non-obvious name:
+Some of these names may not be obvious:
=over 4
\S A non-whitespace character
\h An horizontal white space
\H A non horizontal white space
- \N A non newline (like . without /s)
+ \N A non newline (when not followed by a '{'; it's like . without /s)
\v A vertical white space
\V A non vertical white space
\R A generic newline (?>\v|\x0D\x0A)
\C Match a byte (with Unicode, '.' matches a character)
\pP Match P-named (Unicode) property
- \p{...} Match Unicode property with long name
+ \p{...} Match Unicode property with name longer than 1 character
\PP Match non-P
- \P{...} Match lack of Unicode property with long name
+ \P{...} Match lack of Unicode property with name longer than 1 char
\X Match Unicode extended grapheme cluster
POSIX character classes and their Unicode and Perl equivalents:
or C<\P{Katakana}>. Other sets are the Unicode blocks, the names
of which begin with "In". One such block is dedicated to mathematical
operators, and its pattern formula is <C\p{InMathematicalOperators>}>.
-For the full list see L<perlunicode>.
+For the full list see L<perluniprops>.
+
+What we have described so far is the single form of the C<\p{...}> character
+classes. There is also a compound form which you may run into. These
+look like C<\p{name=value}> or C<\p{name:value}> (the equals sign and colon
+can be used interchangeably). These are more general than the single form,
+and in fact most of the single forms are just Perl-defined shortcuts for common
+compound forms. For example, the script examples in the previous paragraph
+could be written equivalently as C<\p{Script=Latin}>, C<\p{Script:Greek}>, and
+C<\P{script=katakana}> (case is irrelevant between the C<{}> braces). You may
+never have to use the compound forms, but sometimes it is necessary, and their
+use can make your code easier to understand.
C<\X> is an abbreviation for a character class that comprises
-the Unicode I<combining character sequences>. A combining character
-sequence is a base character followed by any number of diacritics, i.e.,
-signs like accents used to indicate different sounds of a letter. Using
-the Unicode full names, e.g., S<C<A + COMBINING RING>> is a combining
-character sequence with base character C<A> and combining character
-S<C<COMBINING RING>>, which translates in Danish to A with the circle
-atop it, as in the word Angstrom. C<\X> is equivalent to C<\PM\pM*}>,
-i.e., a non-mark followed by one or more marks.
+a Unicode I<extended grapheme cluster>. This represents a "logical character",
+what appears to be a single character, but may be represented internally by more
+than one. As an example, using the Unicode full names, e.g., S<C<A + COMBINING
+RING>> is a grapheme cluster with base character C<A> and combining character
+S<C<COMBINING RING>>, which translates in Danish to A with the circle atop it,
+as in the word Angstrom.
For the full and latest information about Unicode see the latest
-Unicode standard, or the Unicode Consortium's website http://www.unicode.org/
+Unicode standard, or the Unicode Consortium's website L<http://www.unicode.org>
As if all those classes weren't enough, Perl also defines POSIX style
character classes. These have the form C<[:name:]>, with C<name> the
Currently glob patterns and filenames returned from File::Glob::glob()
are always byte strings. See L</"Virtualize operating system access">.
-=head2 Unicode and lc/uc operators
-
-Some built-in operators (C<lc>, C<uc>, etc.) behave differently, based on
-what the internal encoding of their argument is. That should not be the
-case. Maybe add a pragma to switch behaviour.
-
=head2 use less 'memory'
Investigate trade offs to switch out perl's choices on memory usage.
The handling of Unicode is unclean in many places. For example, the regexp
engine matches in Unicode semantics whenever the string or the pattern is
flagged as UTF-8, but that should not be dependent on an internal storage
-detail of the string. Likewise, case folding behaviour is dependent on the
-UTF8 internal flag being on or off.
+detail of the string.
=head2 Properly Unicode safe tokeniser and pads.
favor of compatibility and chooses to use byte semantics.
Under byte semantics, when C<use locale> is in effect, Perl uses the
-semantics associated with the current locale. Absent a C<use locale>, Perl
-currently uses US-ASCII (or Basic Latin in Unicode terminology) byte semantics,
-meaning that characters whose ordinal numbers are in the range 128 - 255 are
-undefined except for their ordinal numbers. This means that none have case
-(upper and lower), nor are any a member of character classes, like C<[:alpha:]>
-or C<\w>.
-(But all do belong to the C<\W> class or the Perl regular expression extension
-C<[:^alpha:]>.)
+semantics associated with the current locale. Absent a C<use locale>, and
+absent a C<use feature 'unicode_strings'> pragma, Perl currently uses US-ASCII
+(or Basic Latin in Unicode terminology) byte semantics, meaning that characters
+whose ordinal numbers are in the range 128 - 255 are undefined except for their
+ordinal numbers. This means that none have case (upper and lower), nor are any
+a member of character classes, like C<[:alpha:]> or C<\w>. (But all do belong
+to the C<\W> class or the Perl regular expression extension C<[:^alpha:]>.)
This behavior preserves compatibility with earlier versions of Perl,
which allowed byte semantics in Perl operations only if
-none of the program's inputs were marked as being as source of Unicode
+none of the program's inputs were marked as being a source of Unicode
character data. Such data may come from filehandles, from calls to
external programs, from information provided by the system (such as %ENV),
or from literals and constants in the source text.
The C<bytes> pragma will always, regardless of platform, force byte
semantics in a particular lexical scope. See L<bytes>.
+The C<use feature 'unicode_strings'> pragma is intended to always, regardless
+of platform, force Unicode semantics in a particular lexical scope. In
+release 5.12, it is partially implemented, applying only to case changes.
+See L</The "Unicode Bug"> below.
+
The C<utf8> pragma is primarily a compatibility device that enables
recognition of UTF-(8|EBCDIC) in literals encountered by the parser.
Note that this pragma is only required while Perl defaults to byte
encoding layer is added to a filehandle or a literal Unicode
string constant appears in a program--character semantics apply.
Otherwise, byte semantics are in effect. The C<bytes> pragma should
-be used to force byte semantics on Unicode data.
+be used to force byte semantics on Unicode data, and the C<use feature
+'unicode_strings'> pragma to force Unicode semantics on byte data (though in
+5.12 it isn't fully implemented).
If strings operating under byte semantics and strings with Unicode
character data are concatenated, the new string will have
Named Unicode properties, scripts, and block ranges may be used like
character classes via the C<\p{}> "matches property" construct and
the C<\P{}> negation, "doesn't match property".
-
See L</"Unicode Character Properties"> for more details.
You can define your own character properties and use them
in the regular expression with the C<\p{}> or C<\P{}> construct.
-
See L</"User-Defined Character Properties"> for more details.
=item *
when character input is provided. Note that C<uc()>, or C<\U> in
interpolated strings, translates to uppercase, while C<ucfirst>,
or C<\u> in interpolated strings, translates to titlecase in languages
-that make the distinction.
+that make the distinction (which is equivalent to uppercase in languages
+without the distinction).
=item *
=item *
-lc(), uc(), lcfirst(), and ucfirst() work for the following cases:
-
-=over 8
-
-=item *
-
-the case mapping is from a single Unicode character to another
-single Unicode character, or
-
-=item *
-
-the case mapping is from a single Unicode character to more
-than one Unicode character.
-
-=back
-
-Things to do with locales (Lithuanian, Turkish, Azeri) do B<not> work
-since Perl does not understand the concept of Unicode locales.
-
-See the Unicode Technical Report #21, Case Mappings, for more details.
-
-But you can also define your own mappings to be used in the lc(),
+You can define your own mappings to be used in lc(),
lcfirst(), uc(), and ucfirst() (or their string-inlined versions).
-
See L</"User-Defined Case Mappings"> for more details.
=back
General_Category of "L" (letter) property. Brackets are not
required for single letter properties, so C<\p{L}> is equivalent to C<\pL>.
-More formally, C<\p{Uppercase}> matches any character whose Uppercase property
-value is True, and C<\P{Uppercase}> matches any character whose Uppercase
-property value is False, and they could have been written as
+More formally, C<\p{Uppercase}> matches any character whose Unicode Uppercase
+property value is True, and C<\P{Uppercase}> matches any character whose
+Uppercase property value is False, and they could have been written as
C<\p{Uppercase=True}> and C<\p{Uppercase=False}>, respectively
This formality is needed when properties are not binary, that is if they can
take on more values than just True and False. For example, the Bidi_Class (see
L</"Bidirectional Character Types"> below), can take on a number of different
values, such as Left, Right, Whitespace, and others. To match these, one needs
-to specify the property name (Bidi_Class), and the value being matched with
+to specify the property name (Bidi_Class), and the value being matched against
(Left, Right, etc.). This is done, as in the examples above, by having the two
components separated by an equal sign (or interchangeably, a colon), like
C<\p{Bidi_Class: Left}>.
various synonyms for the values the property can be. For binary properties,
"True" has 3 synonyms: "T", "Yes", and "Y"; and "False has correspondingly "F",
"No", and "N". But be careful. A short form of a value for one property may
-not mean the same thing as the same name for another. Thus, for the
+not mean the same thing as the same short form for another. Thus, for the
General_Category property, "L" means "Letter", but for the Bidi_Class property,
"L" means "Left". A complete list of properties and synonyms is in
L<perluniprops>.
=head3 B<Scripts>
-The world's languages are written in a number of scripts. This sentence is
-written in Latin, while Russian is written in Cyrllic, and Greek is written in,
-well, Greek; Japanese mainly in Hiragana or Katakana. There are many more.
+The world's languages are written in a number of scripts. This sentence
+(unless you're reading it in translation) is written in Latin, while Russian is
+written in Cyrllic, and Greek is written in, well, Greek; Japanese mainly in
+Hiragana or Katakana. There are many more.
The Unicode Script property gives what script a given character is in,
and can be matched with the compound form like C<\p{Script=Hebrew}> (short:
rest of the characters in ucfirst()).
The string returned by the subroutines needs to be two hexadecimal numbers
-separated by two tabulators: the source code point and the destination code
-point. For example:
+separated by two tabulators: the two numbers being, respectively, the source
+code point and the destination code point. For example:
sub ToUpper {
return <<END;
[1] \x{...}
[2] \p{...} \P{...}
- [3] supports not only minimal list (general category, scripts,
- Alphabetic, Lowercase, Uppercase, WhiteSpace,
- NoncharacterCodePoint, DefaultIgnorableCodePoint, Any,
- ASCII, Assigned), but also bidirectional types, blocks, etc.
- (see "Unicode Character Properties")
+ [3] supports not only minimal list, but all Unicode character
+ properties (see L</Unicode Character Properties>)
[4] \d \D \s \S \w \W \X [:prop:] [:^prop:]
[5] can use regular expression look-ahead [a] or
user-defined character properties [b] to emulate set operations
[6] \b \B
- [7] note that Perl does Full case-folding in matching, not Simple:
- for example U+1F88 is equivalent to U+1F00 U+03B9,
+ [7] note that Perl does Full case-folding in matching (but with bugs),
+ not Simple: for example U+1F88 is equivalent to U+1F00 U+03B9,
not with 1F80. This difference matters mainly for certain Greek
capital letters with certain modifiers: the Full case-folding
decomposes the letter, while the Simple case-folding would map
[10] see UAX#15 "Unicode Normalization Forms"
[11] have Unicode::Normalize but not integrated to regexes
- [12] have \X but at this level . should equal that
- [13] UAX#29 "Text Boundaries" considers CRLF and Hangul syllable
- clusters as a single grapheme cluster.
+ [12] have \X but we don't have a "Grapheme Cluster Mode"
[14] see UAX#29, Word Boundaries
[15] see UAX#21 "Case Mappings"
[16] have \N{...} but neither compute names of CJK Ideographs
The following table is from Unicode 3.2.
- Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
+ Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
- U+0000..U+007F 00..7F
- U+0080..U+07FF C2..DF 80..BF
- U+0800..U+0FFF E0 A0..BF 80..BF
+ U+0000..U+007F 00..7F
+ U+0080..U+07FF * C2..DF 80..BF
+ U+0800..U+0FFF E0 * A0..BF 80..BF
U+1000..U+CFFF E1..EC 80..BF 80..BF
U+D000..U+D7FF ED 80..9F 80..BF
- U+D800..U+DFFF ******* ill-formed *******
+ U+D800..U+DFFF +++++++ utf16 surrogates, not legal utf8 +++++++
U+E000..U+FFFF EE..EF 80..BF 80..BF
- U+10000..U+3FFFF F0 90..BF 80..BF 80..BF
- U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
- U+100000..U+10FFFF F4 80..8F 80..BF 80..BF
-
-Note the C<A0..BF> in C<U+0800..U+0FFF>, the C<80..9F> in
-C<U+D000...U+D7FF>, the C<90..B>F in C<U+10000..U+3FFFF>, and the
-C<80...8F> in C<U+100000..U+10FFFF>. The "gaps" are caused by legal
-UTF-8 avoiding non-shortest encodings: it is technically possible to
-UTF-8-encode a single code point in different ways, but that is
-explicitly forbidden, and the shortest possible encoding should always
-be used. So that's what Perl does.
+ U+10000..U+3FFFF F0 * 90..BF 80..BF 80..BF
+ U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
+ U+100000..U+10FFFF F4 80..8F 80..BF 80..BF
+
+Note the gaps before several of the byte entries above marked by '*'. These are
+caused by legal UTF-8 avoiding non-shortest encodings: it is technically
+possible to UTF-8-encode a single code point in different ways, but that is
+explicitly forbidden, and the shortest possible encoding should always be used
+(and that is what Perl does).
Another way to look at it is via bits:
00000dddccccccbbbbbbaaaaaa 11110ddd 10cccccc 10bbbbbb 10aaaaaa
As you can see, the continuation bytes all begin with C<10>, and the
-leading bits of the start byte tell how many bytes the are in the
+leading bits of the start byte tell how many bytes there are in the
encoded character.
=item *
$uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);
If you try to generate surrogates (for example by using chr()), you
-will get a warning if warnings are turned on, because those code
+will get a warning, if warnings are turned on, because those code
points are not valid for a Unicode character.
Because of the 16-bitness, UTF-16 is byte-order dependent. UTF-16
C<U+FFFE> is guaranteed not to be a valid Unicode character, so the
sequence of bytes C<0xFF 0xFE> is unambiguously "BOM, represented in
little-endian format" and cannot be C<U+FFFE>, represented in big-endian
-format".
+format". (Actually, C<U+FFFE> is legal for use by your program, even for
+input/output, but better not use it if you need a BOM. But it is "illegal for
+interchange", so that an unsuspecting program won't get confused.)
=item *
=head2 Security Implications of Unicode
+Read L<Unicode Security Considerations|http://www.unicode.org/reports/tr36>.
+Also, note the following:
+
=over 4
=item *
possible sequence of UTF-8 bytes should be generated,
because otherwise there is potential for an input buffer overflow at
the receiving end of a UTF-8 connection. Perl always generates the
-shortest length UTF-8, and with warnings on Perl will warn about
+shortest length UTF-8, and with warnings on, Perl will warn about
non-shortest length UTF-8 along with other malformations, such as the
surrogates, which are not real Unicode code points.
encoding or another) could be given as arguments or received as
results, or both, but it is not.
-The following are such interfaces. For all of these interfaces Perl
+The following are such interfaces. Also, see L</The "Unicode Bug">.
+For all of these interfaces Perl
currently (as of 5.8.3) simply assumes byte strings both as arguments
and results, or UTF-8 strings if the C<encoding> pragma has been used.
One reason why Perl does not attempt to resolve the role of Unicode in
-this cases is that the answers are highly dependent on the operating
+these cases is that the answers are highly dependent on the operating
system and the file system(s). For example, whether filenames can be
in Unicode, and in exactly what kind of encoding, is not exactly a
portable concept. Similarly for the qx and system: how well will the
=back
+=head2 The "Unicode Bug"
+
+The term, the "Unicode bug" has been applied to an inconsistency with the
+Unicode characters whose code points are in the Latin-1 Supplement block, that
+is, between 128 and 255. Without a locale specified, unlike all other
+characters or code points, these characters have very different semantics in
+byte semantics versus character semantics.
+
+In character semantics they are interpreted as Unicode code points, which means
+they have the same semantics as Latin-1 (ISO-8859-1).
+
+In byte semantics, they are considered to be unassigned characters, meaning
+that the only semantics they have is their ordinal numbers, and that they are
+not members of various character classes. None are considered to match C<\w>
+for example, but all match C<\W>. (On EBCDIC platforms, the behavior may
+be different from this, depending on the underlying C language library
+functions.)
+
+The behavior is known to have effects on these areas:
+
+=over 4
+
+=item *
+
+Changing the case of a scalar, that is, using C<uc()>, C<ucfirst()>, C<lc()>,
+and C<lcfirst()>, or C<\L>, C<\U>, C<\u> and C<\l> in regular expression
+substitutions.
+
+=item *
+
+Using caseless (C</i>) regular expression matching
+
+=item *
+
+Matching a number of properties in regular expressions, such as C<\w>
+
+=item *
+
+User-defined case change mappings. You can create a C<ToUpper()> function, for
+example, which overrides Perl's built-in case mappings. The scalar must be
+encoded in utf8 for your function to actually be invoked.
+
+=back
+
+This behavior can lead to unexpected results in which a string's semantics
+suddenly change if a code point above 255 is appended to or removed from it,
+which changes the string's semantics from byte to character or vice versa. As
+an example, consider the following program and its output:
+
+ $ perl -le'
+ $s1 = "\xC2";
+ $s2 = "\x{2660}";
+ for ($s1, $s2, $s1.$s2) {
+ print /\w/ || 0;
+ }
+ '
+ 0
+ 0
+ 1
+
+If there's no \w in s1 or in s2, why does their concatenation have one?
+
+This anomaly stems from Perl's attempt to not disturb older programs that
+didn't use Unicode, and hence had no semantics for characters outside of the
+ASCII range (except in a locale), along with Perl's desire to add Unicode
+support seamlessly. The result wasn't seamless: these characters were
+orphaned.
+
+Work is being done to correct this, but only some of it was complete in time
+for the 5.12 release. What has been finished is the important part of the case
+changing component. Due to concerns, and some evidence, that older code might
+have come to rely on the existing behavior, the new behavior must be explicitly
+enabled by the feature C<unicode_strings> in the L<feature> pragma, even though
+no new syntax is involved.
+
+See L<perlfunc/lc> for details on how this pragma works in combination with
+various others for casing. Even though the pragma only affects casing
+operations in the 5.12 release, it is planned to have it affect all the
+problematic behaviors in later releases: you can't have one without them all.
+
+In the meantime, a workaround is to always call utf8::upgrade($string), or to
+use the standard modules L<Encode> or L<charnames>.
+
=head2 Forcing Unicode in Perl (Or Unforcing Unicode in Perl)
-Sometimes (see L</"When Unicode Does Not Happen">) there are
-situations where you simply need to force a byte
+Sometimes (see L</"When Unicode Does Not Happen"> or L</The "Unicode Bug">)
+there are situations where you simply need to force a byte
string into UTF-8, or vice versa. The low-level calls
utf8::upgrade($bytestring) and utf8::downgrade($utf8string[, FAIL_OK]) are
the answers.
Note that utf8::downgrade() can fail if the string contains characters
that don't fit into a byte.
+Calling either function on a string that already is in the desired state is a
+no-op.
+
=head2 Using Unicode in XS
If you want to handle Perl Unicode in XS extensions, you may find the
For more information, see L<perlapi>, and F<utf8.c> and F<utf8.h>
in the Perl source code distribution.
+=head2 Hacking Perl to work on earlier Unicode versions (for very serious hackers only)
+
+Perl by default comes with the latest supported Unicode version built in, but
+you can change to use any earlier one.
+
+Download the files in the version of Unicode that you want from the Unicode web
+site L<http://www.unicode.org>). These should replace the existing files in
+C<\$Config{privlib}>/F<unicore>. (C<\%Config> is available from the Config
+module.) Follow the instructions in F<README.perl> in that directory to change
+some of their names, and then run F<make>.
+
+It is even possible to download them to a different directory, and then change
+F<utf8_heavy.pl> in the directory C<\$Config{privlib}> to point to the new
+directory, or maybe make a copy of that directory before making the change, and
+using C<@INC> or the C<-I> run-time flag to switch between versions at will
+(but because of caching, not in the middle of a process), but all this is
+beyond the scope of these instructions.
+
=head1 BUGS
=head2 Interaction with Locales
Unicode support will also tend to run slower. Use of locales with
Unicode is discouraged.
-=head2 Problems with characters whose ordinal numbers are in the range 128 - 255 with no Locale specified
+=head2 Problems with characters in the C<Latin-1 Supplement> range
-Without a locale specified, unlike all other characters or code points,
-these characters have very different semantics in byte semantics versus
-character semantics.
-In character semantics they are interpreted as Unicode code points, which means
-they are viewed as Latin-1 (ISO-8859-1).
-In byte semantics, they are considered to be unassigned characters,
-meaning that the only semantics they have is their
-ordinal numbers, and that they are not members of various character classes.
-None are considered to match C<\w> for example, but all match C<\W>.
-Besides these class matches,
-the known operations that this affects are those that change the case,
-regular expression matching while ignoring case,
-and B<quotemeta()>.
-This can lead to unexpected results in which a string's semantics suddenly
-change if a code point above 255 is appended to or removed from it,
-which changes the string's semantics from byte to character or vice versa.
-This behavior is scheduled to change in version 5.12, but in the meantime,
-a workaround is to always call utf8::upgrade($string), or to use the
-standard modules L<Encode> or L<charnames>.
+See L</The "Unicode Bug">
+
+=head2 Problems with case-insensitive regular expression matching
+
+There are problems with case-insensitive matches, including those involving
+character classes (enclosed in [square brackets]), characters whose fold
+is to multiple characters (such as the single character C<LATIN SMALL LIGATURE
+FFL> matches case-insensitively with the 3-character string C<ffl>), and
+characters in the C<Latin-1 Supplement>.
=head2 Interaction with Extensions
like C<\d> (then again, there 268 Unicode characters matching C<Nd>
compared with the 10 ASCII characters matching C<d>).
-=head2 Possible problems on EBCDIC platforms
+=head2 Problems on EBCDIC platforms
+
+There are a number of known problems with Perl on EBCDIC platforms. If you
+want to use Perl there, send email to perlbug@perl.org.
In earlier versions, when byte and character data were concatenated,
the new string was sometimes created by
=back
-=head2 Hacking Perl to work on earlier Unicode versions (for very serious hackers only)
-
-Perl by default comes with the latest supported Unicode version built in, but
-you can change to use any earlier one.
-
-Download the files in the version of Unicode that you want from the Unicode web
-site L<http://www.unicode.org>). These should replace the existing files in
-C<\$Config{privlib}>/F<unicore>. (C<\%Config> is available from the Config
-module.) Follow the instructions in F<README.perl> in that directory to change
-some of their names, and then run F<make>.
-
-It is even possible to download them to a different directory, and then change
-F<utf8_heavy.pl> in the directory C<\$Config{privlib}> to point to the new
-directory, or maybe make a copy of that directory before making the change, and
-using C<@INC> or the C<-I> run-time flag to switch between versions at will,
-but all this is beyond the scope of these instructions.
-
=head1 SEE ALSO
L<perlunitut>, L<perluniintro>, L<perluniprops>, L<Encode>, L<open>, L<utf8>, L<bytes>,
No, and this isn't really a Unicode FAQ.
-Perl has an abstracted interface for all supported character encodings, so they
+Perl has an abstracted interface for all supported character encodings, so this
is actually a generic C<Encode> tutorial and C<Encode> FAQ. But many people
think that Unicode is special and magical, and I didn't want to disappoint
them, so I decided to call the document a Unicode tutorial.
=head2 Why do some characters not uppercase or lowercase correctly?
It seemed like a good idea at the time, to keep the semantics the same for
-standard strings, when Perl got Unicode support. While it might be repaired
-in the future, we now have to deal with the fact that Perl treats equal
-strings differently, depending on the internal state.
-
-Affected are C<uc>, C<lc>, C<ucfirst>, C<lcfirst>, C<\U>, C<\L>, C<\u>, C<\l>,
+standard strings, when Perl got Unicode support. The plan is to fix this
+in the future, and the casing component has in fact mostly been fixed, but we
+have to deal with the fact that Perl treats equal strings differently,
+depending on the internal state.
+
+First the casing. Just put a C<use feature 'unicode_strings'> near the
+beginning of your program. Within its lexical scope, C<uc>, C<lc>, C<ucfirst>,
+C<lcfirst>, and the regular expression escapes C<\U>, C<\L>, C<\u>, C<\l> use
+Unicode semantics for changing case regardless of whether the UTF8 flag is on
+or not. However, if you pass strings to subroutines in modules outside the
+pragma's scope, they currently likely won't behave this way, and you have to
+try one of the solutions below. There is another exception as well: if you
+have furnished your own casing functions to override the default, these will
+not be called unless the UTF8 flag is on)
+
+This remains a problem for the regular expression constructs
C<\d>, C<\s>, C<\w>, C<\D>, C<\S>, C<\W>, C</.../i>, C<(?i:...)>,
-C</[[:posix:]]/>, and C<quotemeta> (though this last should not cause any real
-problems).
+and C</[[:posix:]]/>.
To force Unicode semantics, you can upgrade the internal representation to
by doing C<utf8::upgrade($string)>. This can be used
This is a term used both for characters with an ordinal value greater than 127,
characters with an ordinal value greater than 255, or any character occupying
-than one byte, depending on the context.
+more than one byte, depending on the context.
The Perl warning "Wide character in ..." is caused by a character with an
ordinal value greater than 255. With no specified encoding layer, Perl tries to
The UTF8 flag, also called SvUTF8, is an internal flag that indicates that the
current internal representation is UTF-8. Without the flag, it is assumed to be
-ISO-8859-1. Perl converts between these automatically.
+ISO-8859-1. Perl converts between these automatically. (Actually Perl assumes
+the representation is ASCII; see L</Why do regex character classes sometimes
+match only in the ASCII range?> above.)
One of Perl's internal formats happens to be UTF-8. Unfortunately, Perl can't
keep a secret, so everyone knows about this. That is the source of much
A Unicode I<character> is an abstract entity. It is not bound to any
particular integer width, especially not to the C language C<char>.
Unicode is language-neutral and display-neutral: it does not encode the
-language of the text and it does not generally define fonts or other graphical
+language of the text, and it does not generally define fonts or other graphical
layout details. Unicode operates on characters and on text built from
those characters.
otherwise used blocks. Secondly, there are special Unicode control
characters that do not represent true characters.
-A common myth about Unicode is that it would be "16-bit", that is,
+A common myth about Unicode is that it is "16-bit", that is,
Unicode is only represented as C<0x10000> (or 65536) characters from
C<0x0000> to C<0xFFFF>. B<This is untrue.> Since Unicode 2.0 (July
1996), Unicode has been defined all the way up to 21 bits (C<0x10FFFF>),
3.1, 17 (yes, seventeen) planes in all were defined--but they are
nowhere near full of defined characters, yet.
-Another myth is that the 256-character blocks have something to
+Another myth is about Unicode blocks--that they have something to
do with languages--that each block would define the characters used
by a language or a set of languages. B<This is also untrue.>
The division into blocks exists, but it is almost completely
accidental--an artifact of how the characters have been and
-still are allocated. Instead, there is a concept called I<scripts>,
-which is more useful: there is C<Latin> script, C<Greek> script, and
-so on. Scripts usually span varied parts of several blocks.
-For further information see L<Unicode::UCD>.
+still are allocated. Instead, there is a concept called I<scripts>, which is
+more useful: there is C<Latin> script, C<Greek> script, and so on. Scripts
+usually span varied parts of several blocks. For more information about
+scripts, see L<perlunicode/Scripts>.
The Unicode code points are just abstract numbers. To input and
output these abstract numbers, the numbers must be I<encoded> or
I<serialised> somehow. Unicode defines several I<character encoding
forms>, of which I<UTF-8> is perhaps the most popular. UTF-8 is a
variable length encoding that encodes Unicode characters as 1 to 6
-bytes (only 4 with the currently defined characters). Other encodings
+bytes. Other encodings
include UTF-16 and UTF-32 and their big- and little-endian variants
(UTF-8 is byte-order independent) The ISO/IEC 10646 defines the UCS-2
and UCS-4 encoding forms.
Perl supports both pre-5.6 strings of eight-bit native bytes, and
strings of Unicode characters. The principle is that Perl tries to
keep its data as eight-bit bytes for as long as possible, but as soon
-as Unicodeness cannot be avoided, the data is transparently upgraded
-to Unicode.
+as Unicodeness cannot be avoided, the data is (mostly) transparently upgraded
+to Unicode. There are some problems--see L<perlunicode/The "Unicode Bug">.
Internally, Perl currently uses either whatever the native eight-bit
character set of the platform (for example Latin-1) is, defaulting to
Perl 5.8.0 also supports Unicode on EBCDIC platforms. There,
Unicode support is somewhat more complex to implement since
-additional conversions are needed at every step. Some problems
-remain, see L<perlebcdic> for details.
+additional conversions are needed at every step.
+
+Later Perl releases have added code that will not work on EBCDIC platforms, and
+no one has complained, so the divergence has continued. If you want to run
+Perl on an EBCDIC platform, send email to perlbug@perl.org
-In any case, the Unicode support on EBCDIC platforms is better than
-in the 5.6 series, which didn't work much at all for EBCDIC platform.
On EBCDIC platforms, the internal Unicode encoding form is UTF-EBCDIC
instead of UTF-8. The difference is that as UTF-8 is "ASCII-safe" in
that ASCII characters encode to UTF-8 as-is, while UTF-EBCDIC is
CAPITAL LETTER As should be considered equal, or even As of any case.
The long answer is that you need to consider character normalization
-and casing issues: see L<Unicode::Normalize>, Unicode Technical
-Reports #15 and #21, I<Unicode Normalization Forms> and I<Case
-Mappings>, L<http://www.unicode.org/unicode/reports/tr15/> and
-L<http://www.unicode.org/unicode/reports/tr21/>
+and casing issues: see L<Unicode::Normalize>, Unicode Technical Report #15,
+L<Unicode Normalization Forms|http://www.unicode.org/unicode/reports/tr15> and
+sections on case mapping in the L<Unicode Standard|http://www.unicode.org>.
As of Perl 5.8.0, the "Full" case-folding of I<Case
-Mappings/SpecialCasing> is implemented.
+Mappings/SpecialCasing> is implemented, but bugs remain in C<qr//i> with them.
=item *
You shouldn't have to care. But you may, because currently the semantics of the
characters whose ordinals are in the range 128 to 255 is different depending on
whether the string they are contained within is in Unicode or not.
-(See L<perlunicode>.)
+(See L<perlunicode/When Unicode Does Not Happen>.)
To determine if a string is in Unicode, use:
=head3 Unicode
B<Unicode> is a character set with room for lots of characters. The ordinal
-value of a character is called a B<code point>.
+value of a character is called a B<code point>. (But in practice, the
+distinction between code point and character is blurred, so the terms often
+are used interchangeably.)
-There are many, many code points, but computers work with bytes, and a byte can
-have only 256 values. Unicode has many more characters, so you need a method
-to make these accessible.
+There are many, many code points, but computers work with bytes, and a byte has
+room for only 256 values. Unicode has many more characters, so you need a
+method to make these accessible.
Unicode is encoded using several competing encodings, of which UTF-8 is the
most used. In a Unicode encoding, multiple subsequent bytes can be used to
Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
U+0000..U+007F 00..7F
- U+0080..U+07FF C2..DF 80..BF
+ U+0080..U+07FF * C2..DF 80..BF
U+0800..U+0FFF E0 * A0..BF 80..BF
U+1000..U+CFFF E1..EC 80..BF 80..BF
- U+D000..U+D7FF ED * 80..9F 80..BF
+ U+D000..U+D7FF ED 80..9F 80..BF
U+D800..U+DFFF +++++++ utf16 surrogates, not legal utf8 +++++++
U+E000..U+FFFF EE..EF 80..BF 80..BF
U+10000..U+3FFFF F0 * 90..BF 80..BF 80..BF
U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
U+100000..U+10FFFF F4 80..8F 80..BF 80..BF
-Note the gaps before the 2nd Byte entries above marked by '*'. These are
+Note the gaps before several of the byte entries above marked by '*'. These are
caused by legal UTF-8 avoiding non-shortest encodings: it is technically
possible to UTF-8-encode a single code point in different ways, but that is
explicitly forbidden, and the shortest possible encoding should always be used
00000dddccccccbbbbbbaaaaaa 11110ddd 10cccccc 10bbbbbb 10aaaaaa
As you can see, the continuation bytes all begin with C<10>, and the
-leading bits of the start byte tell how many bytes the are in the
+leading bits of the start byte tell how many bytes there are in the
encoded character.
*/