=head2 Scripts
The scripts available via C<\p{...}> and C<\P{...}>, for example
-C<\p{Latin}> or \p{Cyrillic>, are as follows:
+C<\p{Latin}> or C<\p{Cyrillic}>, are as follows:
Arabic
Armenian
ID_Continue ID_Start + Mn + Mc + Nd + Pc
Any Any character
- Assigned Any non-Cn character (i.e. synonym for C<\P{Cn}>)
- Unassigned Synonym for C<\p{Cn}>
+ Assigned Any non-Cn character (i.e. synonym for \P{Cn})
+ Unassigned Synonym for \p{Cn}
Common Any character (or unassigned code point)
not explicitly assigned to a script
=back
+=head2 User-defined Character Properties
+
+You can define your own character properties by defining subroutines
+that have names beginning with "In" or "Is". The subroutines must be
+visible in the package that uses the properties. The user-defined
+properties can be used in the regular expression C<\p> and C<\P>
+constructs.
+
+The subroutines must return a specially formatted string: one or more
+newline-separated lines. Each line must be one of the following:
+
+=over 4
+
+=item *
+
+Two hexadecimal numbers separated by horizontal whitespace (space or
+tabulator characters) denoting a range of Unicode codepoints to include.
+
+=item *
+
+Something to include, prefixed by "+": either an built-in character
+property (prefixed by "utf8::"), for all the characters in that
+property; or two hexadecimal codepoints for a range; or a single
+hexadecimal codepoint.
+
+=item *
+
+Something to exclude, prefixed by "-": either an existing character
+property (prefixed by "utf8::"), for all the characters in that
+property; or two hexadecimal codepoints for a range; or a single
+hexadecimal codepoint.
+
+=item *
+
+Something to negate, prefixed "!": either an existing character
+property (prefixed by "utf8::") for all the characters except the
+characters in the property; or two hexadecimal codepoints for a range;
+or a single hexadecimal codepoint.
+
+=back
+
+For example, to define a property that covers both the Japanese
+syllabaries (hiragana and katakana), you can define
+
+ sub InKana {
+ return <<END;
+ 3040\t309F
+ 30A0\t30FF
+ END
+ }
+
+Imagine that the here-doc end marker is at the beginning of the line.
+Now you can use C<\p{InKana}> and C<\P{InKana}>.
+
+You could also have used the existing block property names:
+
+ sub InKana {
+ return <<'END';
+ +utf8::InHiragana
+ +utf8::InKatakana
+ END
+ }
+
+Suppose you wanted to match only the allocated characters,
+not the raw block ranges: in other words, you want to remove
+the non-characters:
+
+ sub InKana {
+ return <<'END';
+ +utf8::InHiragana
+ +utf8::InKatakana
+ -utf8::IsCn
+ END
+ }
+
+The negation is useful for defining (surprise!) negated classes.
+
+ sub InNotKana {
+ return <<'END';
+ !utf8::InHiragana
+ -utf8::InKatakana
+ +utf8::IsCn
+ END
+ }
+
=head2 Character encodings for input and output
See L<Encode>.
[ 3] . \p{...} \P{...}
[ 4] now scripts (see UTR#24 Script Names) in addition to blocks
[ 5] have negation
- [ 6] can use look-ahead to emulate subtraction (*)
+ [ 6] can use regular expression look-ahead [a]
+ or user-defined character properties [b] to emulate subtraction
[ 7] include Letters in word characters
[ 8] note that perl does Full casefolding in matching, not Simple:
for example U+1F88 is equivalent with U+1F000 U+03B9,
(should also affect <>, $., and script line numbers)
(the \x{85}, \x{2028} and \x{2029} do match \s)
-(*) You can mimic class subtraction using lookahead.
+[a] You can mimic class subtraction using lookahead.
For example, what TR18 might write as
[{Greek}-[{UNASSIGNED}]]
which will match assigned characters known to be part of the Greek script.
+[b] See L</User-defined Character Properties>.
+
=item *
Level 2 - Extended Unicode Support
UCS-2, UCS-4
Encodings defined by the ISO 10646 standard. UCS-2 is a 16-bit
-encoding, UCS-4 is a 32-bit encoding. Unlike UTF-16, UCS-2
-is not extensible beyond 0xFFFF, because it does not use surrogates.
+encoding. Unlike UTF-16, UCS-2 is not extensible beyond 0xFFFF,
+because it does not use surrogates. UCS-4 is a 32-bit encoding,
+functionally identical to UTF-32.
=item *
not think that LATIN SMALL LETTER ETH is a letter (unless you happen
to speak Icelandic), but Unicode does.
-As discussed elsewhere, Perl tries to stand one leg (two legs, being
-a quadrupled camel?) in two worlds: the old world of byte and the new
+As discussed elsewhere, Perl tries to stand one leg (two legs, as
+camels are quadrupeds?) in two worlds: the old world of bytes and the new
world of characters, upgrading from bytes to characters when necessary.
If your legacy code is not explicitly using Unicode, no automatic
switchover to characters should happen, and characters shouldn't get
So if you're working with Unicode data, consult the documentation of
every module you're using if there are any issues with Unicode data
exchange. If the documentation does not talk about Unicode at all,
-suspect the worst and probably look at the source how the module is
-implemented. Modules written completely in perl shouldn't cause
-problems. Modules that directly or indirectly access code written in
-other programming languages are at risk.
+suspect the worst and probably look at the source to learn how the
+module is implemented. Modules written completely in perl shouldn't
+cause problems. Modules that directly or indirectly access code written
+in other programming languages are at risk.
For affected functions the simple strategy to avoid data corruption is
to always make the encoding of the exchanged data explicit. Choose an
Sometimes, when the extension does not convert data but just stores
and retrieves them, you will be in a position to use the otherwise
dangerous Encode::_utf8_on() function. Let's say the popular
-<Foo::Bar> extension, written in C, provides a C<param> method that
+C<Foo::Bar> extension, written in C, provides a C<param> method that
lets you store and retrieve data according to these prototypes:
$self->param($name, $value); # set a scalar
}
}
-Some extensions provide filters on data entry/exit points, as e.g.
-DB_File::filter_store_key and family. Watch out for such filters in
-the documentations of your extensions, they can make the transition to
+Some extensions provide filters on data entry/exit points, such as
+DB_File::filter_store_key and family. Look out for such filters in
+the documentation of your extensions, they can make the transition to
Unicode data much easier.
=head2 speed