You can also use the C<encoding> pragma to change the default encoding
of the data in your script; see L<encoding>.
+=item C<use encoding> needed to upgrade non-Latin-1 byte strings
+
+By default, there is a fundamental asymmetry in Perl's unicode model:
+implicit upgrading from byte strings to Unicode strings assumes that
+they were encoded in I<ISO 8859-1 (Latin-1)>, but Unicode strings are
+downgraded with UTF-8 encoding. This happens because the first 256
+codepoints in Unicode happens to agree with Latin-1.
+
+If you wish to interpret byte strings as UTF-8 instead, use the
+C<encoding> pragma:
+
+ use encoding 'utf8';
+
+See L</"Byte and Character Semantics"> for more details.
+
=back
=head2 Byte and Character Semantics
be used to force byte semantics on Unicode data.
If strings operating under byte semantics and strings with Unicode
-character data are concatenated, the new string will be upgraded to
-I<ISO 8859-1 (Latin-1)>, even if the old Unicode string used EBCDIC.
-This translation is done without regard to the system's native 8-bit
-encoding, so to change this for systems with non-Latin-1 and
-non-EBCDIC native encodings use the C<encoding> pragma. See
-L<encoding>.
+character data are concatenated, the new string will be created by
+decoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the
+old Unicode string used EBCDIC. This translation is done without
+regard to the system's native 8-bit encoding. To change this for
+systems with non-Latin-1 and non-EBCDIC native encodings, use the
+C<encoding> pragma. See L<encoding>.
Under character semantics, many operations that formerly operated on
bytes now operate on characters. A character in Perl is
Unicode properties database. C<\w> can be used to match a Japanese
ideograph, for instance.
+(However, and as a limitation of the current implementation, using
+C<\w> or C<\W> I<inside> a C<[...]> character class will still match
+with byte semantics.)
+
=item *
Named Unicode properties, scripts, and block ranges may be used like
(^) between the first brace and the property name: C<\p{^Tamil}> is
equal to C<\P{Tamil}>.
+B<NOTE: the properties, scripts, and blocks listed here are as of
+Unicode 3.2.0, March 2002, or Perl 5.8.0, July 2002. Unicode 4.0.0
+came out in April 2003, and Perl 5.8.1 in September 2003.>
+
Here are the basic Unicode General Category properties, followed by their
long form. You can use either; C<\p{Lu}> and C<\p{UppercaseLetter}>,
for instance, are identical.
Short Long
L Letter
+ LC CasedLetter
Lu UppercaseLetter
Ll LowercaseLetter
Lt TitlecaseLetter
Single-letter properties match all characters in any of the
two-letter sub-properties starting with the same letter.
-C<L&> is a special case, which is an alias for C<Ll>, C<Lu>, and C<Lt>.
+C<LC> and C<L&> are special cases, which are aliases for the set of
+C<Ll>, C<Lu>, and C<Lt>.
Because Perl hides the need for the user to understand the internal
representation of Unicode characters, there is no need to implement
supported.
Because scripts differ in their directionality--Hebrew is
-written right to left, for example--Unicode supplies these properties:
+written right to left, for example--Unicode supplies these properties in
+the BidiClass class:
Property Meaning
- BidiL Left-to-Right
- BidiLRE Left-to-Right Embedding
- BidiLRO Left-to-Right Override
- BidiR Right-to-Left
- BidiAL Right-to-Left Arabic
- BidiRLE Right-to-Left Embedding
- BidiRLO Right-to-Left Override
- BidiPDF Pop Directional Format
- BidiEN European Number
- BidiES European Number Separator
- BidiET European Number Terminator
- BidiAN Arabic Number
- BidiCS Common Number Separator
- BidiNSM Non-Spacing Mark
- BidiBN Boundary Neutral
- BidiB Paragraph Separator
- BidiS Segment Separator
- BidiWS Whitespace
- BidiON Other Neutrals
-
-For example, C<\p{BidiR}> matches characters that are normally
+ L Left-to-Right
+ LRE Left-to-Right Embedding
+ LRO Left-to-Right Override
+ R Right-to-Left
+ AL Right-to-Left Arabic
+ RLE Right-to-Left Embedding
+ RLO Right-to-Left Override
+ PDF Pop Directional Format
+ EN European Number
+ ES European Number Separator
+ ET European Number Terminator
+ AN Arabic Number
+ CS Common Number Separator
+ NSM Non-Spacing Mark
+ BN Boundary Neutral
+ B Paragraph Separator
+ S Segment Separator
+ WS Whitespace
+ ON Other Neutrals
+
+For example, C<\p{BidiClass:R}> matches characters that are normally
written right to left.
=back
Most operators that deal with positions or lengths in a string will
automatically switch to using character positions, including
-C<chop()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>,
+C<chop()>, C<chomp()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>,
C<sprintf()>, C<write()>, and C<length()>. Operators that
specifically do not switch include C<vec()>, C<pack()>, and
-C<unpack()>. Operators that really don't care include C<chomp()>,
+C<unpack()>. Operators that really don't care include
operators that treats strings as a bucket of bits such as C<sort()>,
and operators dealing with filenames.
=head2 User-Defined Character Properties
You can define your own character properties by defining subroutines
-whose names begin with "In" or "Is". The subroutines must be defined
-in the C<main> package. The user-defined properties can be used in the
-regular expression C<\p> and C<\P> constructs. Note that the effect
-is compile-time and immutable once defined.
+whose names begin with "In" or "Is". The subroutines can be defined in
+any package. The user-defined properties can be used in the regular
+expression C<\p> and C<\P> constructs; if you are using a user-defined
+property from a package other than the one you are in, you must specify
+its package in the C<\p> or C<\P> construct.
+
+ # assuming property IsForeign defined in Lang::
+ package main; # property package name required
+ if ($txt =~ /\p{Lang::IsForeign}+/) { ... }
+
+ package Lang; # property package name not required
+ if ($txt =~ /\p{IsForeign}+/) { ... }
+
+
+Note that the effect is compile-time and immutable once defined.
The subroutines must return a specially-formatted string, with one
or more newline-separated lines. Each line must be one of the following:
=item *
Something to include, prefixed by "+": a built-in character
-property (prefixed by "utf8::"), to represent all the characters in that
-property; two hexadecimal code points for a range; or a single
-hexadecimal code point.
+property (prefixed by "utf8::") or a user-defined character property,
+to represent all the characters in that property; two hexadecimal code
+points for a range; or a single hexadecimal code point.
=item *
Something to exclude, prefixed by "-": an existing character
-property (prefixed by "utf8::"), for all the characters in that
-property; two hexadecimal code points for a range; or a single
-hexadecimal code point.
+property (prefixed by "utf8::") or a user-defined character property,
+to represent all the characters in that property; two hexadecimal code
+points for a range; or a single hexadecimal code point.
=item *
Something to negate, prefixed "!": an existing character
-property (prefixed by "utf8::") for all the characters except the
-characters in the property; two hexadecimal code points for a range;
-or a single hexadecimal code point.
+property (prefixed by "utf8::") or a user-defined character property,
+to represent all the characters in that property; two hexadecimal code
+points for a range; or a single hexadecimal code point.
+
+=item *
+
+Something to intersect with, prefixed by "&": an existing character
+property (prefixed by "utf8::") or a user-defined character property,
+for all the characters except the characters in the property; two
+hexadecimal code points for a range; or a single hexadecimal code point.
=back
END
}
+Intersection is useful for getting the common characters matched by
+two (or more) classes.
+
+ sub InFooAndBar {
+ return <<'END';
+ +main::Foo
+ &main::Bar
+ END
+ }
+
+It's important to remember not to use "&" for the first set -- that
+would be intersecting with nothing (resulting in an empty set).
+
You can also define your own mappings to be used in the lc(),
lcfirst(), uc(), and ucfirst() (or their string-inlined versions).
The principle is the same: define subroutines in the C<main> package
The following list of Unicode support for regular expressions describes
all the features currently supported. The references to "Level N"
and the section numbers refer to the Unicode Technical Report 18,
-"Unicode Regular Expression Guidelines".
+"Unicode Regular Expression Guidelines", version 6 (Unicode 3.2.0,
+Perl 5.8.0).
=over 4
[ 1] \x{...}
[ 2] \N{...}
[ 3] . \p{...} \P{...}
- [ 4] now scripts (see UTR#24 Script Names) in addition to blocks
+ [ 4] support for scripts (see UTR#24 Script Names), blocks,
+ binary properties, enumerated non-binary properties, and
+ numeric properties (as listed in UTR#18 Other Properties)
[ 5] have negation
[ 6] can use regular expression look-ahead [a]
or user-defined character properties [b] to emulate subtraction
capital letters with certain modifiers: the Full case-folding
decomposes the letter, while the Simple case-folding would map
it to a single character.
- [ 9] see UTR#13 Unicode Newline Guidelines
+ [ 9] see UTR #13 Unicode Newline Guidelines
[10] should do ^ and $ also on \x{85}, \x{2028} and \x{2029}
(should also affect <>, $., and script line numbers)
(the \x{85}, \x{2028} and \x{2029} do match \s)
[a] You can mimic class subtraction using lookahead.
-For example, what TR18 might write as
+For example, what UTR #18 might write as
[{Greek}-[{UNASSIGNED}]]
which will match assigned characters known to be part of the Greek script.
+Also see the Unicode::Regex::Set module, it does implement the full
+UTR #18 grouping, intersection, union, and removal (subtraction) syntax.
+
[b] See L</"User-Defined Character Properties">.
=item *
=item *
-UTF-16, UTF-16BE, UTF16-LE, Surrogates, and BOMs (Byte Order Marks)
+UTF-16, UTF-16BE, UTF-16LE, Surrogates, and BOMs (Byte Order Marks)
The followings items are mostly for reference and general Unicode
knowledge, Perl doesn't use these constructs internally.
=item *
-UTF-32, UTF-32BE, UTF32-LE
+UTF-32, UTF-32BE, UTF-32LE
The UTF-32 family is pretty much like the UTF-16 family, expect that
the units are 32-bit, and therefore the surrogate scheme is not
=back
+=head2 When Unicode Does Not Happen
+
+While Perl does have extensive ways to input and output in Unicode,
+and few other 'entry points' like the @ARGV which can be interpreted
+as Unicode (UTF-8), there still are many places where Unicode (in some
+encoding or another) could be given as arguments or received as
+results, or both, but it is not.
+
+The following are such interfaces. For all of these interfaces Perl
+currently (as of 5.8.3) simply assumes byte strings both as arguments
+and results, or UTF-8 strings if the C<encoding> pragma has been used.
+
+One reason why Perl does not attempt to resolve the role of Unicode in
+this cases is that the answers are highly dependent on the operating
+system and the file system(s). For example, whether filenames can be
+in Unicode, and in exactly what kind of encoding, is not exactly a
+portable concept. Similarly for the qx and system: how well will the
+'command line interface' (and which of them?) handle Unicode?
+
+=over 4
+
+=item *
+
+chmod, chmod, chown, chroot, exec, link, lstat, mkdir,
+rename, rmdir, stat, symlink, truncate, unlink, utime, -X
+
+=item *
+
+%ENV
+
+=item *
+
+glob (aka the <*>)
+
+=item *
+
+open, opendir, sysopen
+
+=item *
+
+qx (aka the backtick operator), system
+
+=item *
+
+readdir, readlink
+
+=back
+
+=head2 Forcing Unicode in Perl (Or Unforcing Unicode in Perl)
+
+Sometimes (see L</"When Unicode Does Not Happen">) there are
+situations where you simply need to force Perl to believe that a byte
+string is UTF-8, or vice versa. The low-level calls
+utf8::upgrade($bytestring) and utf8::downgrade($utf8string) are
+the answers.
+
+Do not use them without careful thought, though: Perl may easily get
+very confused, angry, or even crash, if you suddenly change the 'nature'
+of scalar like that. Especially careful you have to be if you use the
+utf8::upgrade(): any random byte string is not valid UTF-8.
+
=head2 Using Unicode in XS
If you want to handle Perl Unicode in XS extensions, you may find the
=item *
-C<uvuni_to_utf8(buf, chr>) writes a Unicode character code point into
+C<uvuni_to_utf8(buf, chr)> writes a Unicode character code point into
a buffer encoding the code point as UTF-8, and returns a pointer
pointing after the UTF-8 bytes.
In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1
a caching scheme was introduced which will hopefully make the slowness
-somewhat less spectacular. Operations with UTF-8 encoded strings are
-still slower, though.
+somewhat less spectacular, at least for some operations. In general,
+operations with UTF-8 encoded strings are still slower. As an example,
+the Unicode properties (character classes) like C<\p{Nd}> are known to
+be quite a bit slower (5-20 times) than their simpler counterparts
+like C<\d> (then again, there 268 Unicode characters matching C<Nd>
+compared with the 10 ASCII characters matching C<d>).
=head2 Porting code from perl-5.6.X