=back
+=head2 User-defined Character Properties
+
+You can define your own character properties by defining subroutines
+that have names beginning with "In" or "Is". The subroutines must be
+visible in the package that uses the properties. The user-defined
+properties can be used in the regular expression C<\p> and C<\P>
+constructs.
+
+The subroutines must return a specially formatted string: one or more
+newline-separated lines. Each line must be one of the following:
+
+=over 4
+
+=item *
+
+Two hexadecimal numbers separated by horizontal whitespace (space or
+tabulator characters) denoting a range of Unicode codepoints to include.
+
+=item *
+
+Something to include, prefixed by "+": either an built-in character
+property (prefixed by "utf8::"), for all the characters in that
+property; or two hexadecimal codepoints for a range; or a single
+hexadecimal codepoint.
+
+=item *
+
+Something to exclude, prefixed by "-": either an existing character
+property (prefixed by "utf8::"), for all the characters in that
+property; or two hexadecimal codepoints for a range; or a single
+hexadecimal codepoint.
+
+=item *
+
+Something to negate, prefixed "!": either an existing character
+property (prefixed by "utf8::") for all the characters except the
+characters in the property; or two hexadecimal codepoints for a range;
+or a single hexadecimal codepoint.
+
+=back
+
+For example, to define a property that covers both the Japanese
+syllabaries (hiragana and katakana), you can define
+
+ sub InKana {
+ return <<END;
+ 3040\t309F
+ 30A0\t30FF
+ END
+ }
+
+Imagine that the here-doc end marker is at the beginning of the line.
+Now you can use C<\p{InKana}> and C<\P{InKana}>.
+
+You could also have used the existing block property names:
+
+ sub InKana {
+ return <<'END';
+ +utf8::InHiragana
+ +utf8::InKatakana
+ END
+ }
+
+Suppose you wanted to match only the allocated characters,
+not the raw block ranges: in other words, you want to remove
+the non-characters:
+
+ sub InKana {
+ return <<'END';
+ +utf8::InHiragana
+ +utf8::InKatakana
+ -utf8::IsCn
+ END
+ }
+
+The negation is useful for defining (surprise!) negated classes.
+
+ sub InNotKana {
+ return <<'END';
+ !utf8::InHiragana
+ -utf8::InKatakana
+ +utf8::IsCn
+ END
+ }
+
=head2 Character encodings for input and output
See L<Encode>.
[ 3] . \p{...} \P{...}
[ 4] now scripts (see UTR#24 Script Names) in addition to blocks
[ 5] have negation
- [ 6] can use look-ahead to emulate subtraction (*)
+ [ 6] can use regular expression look-ahead [a]
+ or user-defined character properties [b] to emulate subtraction
[ 7] include Letters in word characters
[ 8] note that perl does Full casefolding in matching, not Simple:
for example U+1F88 is equivalent with U+1F000 U+03B9,
(should also affect <>, $., and script line numbers)
(the \x{85}, \x{2028} and \x{2029} do match \s)
-(*) You can mimic class subtraction using lookahead.
+[a] You can mimic class subtraction using lookahead.
For example, what TR18 might write as
[{Greek}-[{UNASSIGNED}]]
which will match assigned characters known to be part of the Greek script.
+[b] See L</User-defined Character Properties>.
+
=item *
Level 2 - Extended Unicode Support
UCS-2, UCS-4
Encodings defined by the ISO 10646 standard. UCS-2 is a 16-bit
-encoding, UCS-4 is a 32-bit encoding. Unlike UTF-16, UCS-2
-is not extensible beyond 0xFFFF, because it does not use surrogates.
+encoding. Unlike UTF-16, UCS-2 is not extensible beyond 0xFFFF,
+because it does not use surrogates. UCS-4 is a 32-bit encoding,
+functionally identical to UTF-32.
=item *
to speak Icelandic), but Unicode does.
As discussed elsewhere, Perl tries to stand one leg (two legs, as
-camels are quadrupeds?) in two worlds: the old world of byte and the new
+camels are quadrupeds?) in two worlds: the old world of bytes and the new
world of characters, upgrading from bytes to characters when necessary.
If your legacy code is not explicitly using Unicode, no automatic
switchover to characters should happen, and characters shouldn't get