=back
-=head2 Defining your own character properties
+=head2 User-defined Character Properties
You can define your own character properties by defining subroutines
that have names beginning with "In" or "Is". The subroutines must be
=item *
-Two hexadecimal numbers separated by a tabulator denoting a range
-of Unicode codepoints.
+Two hexadecimal numbers separated by horizontal whitespace (space or
+tabulator characters) denoting a range of Unicode codepoints to include.
=item *
-An existing character property prefixed by "+utf8::" to include
-all the characters in that property.
+Something to include, prefixed by "+": either an built-in character
+property (prefixed by "utf8::"), for all the characters in that
+property; or two hexadecimal codepoints for a range; or a single
+hexadecimal codepoint.
=item *
-An existing character property prefixed by "-utf8::" to exclude
-all the characters in that property.
+Something to exclude, prefixed by "-": either an existing character
+property (prefixed by "utf8::"), for all the characters in that
+property; or two hexadecimal codepoints for a range; or a single
+hexadecimal codepoint.
=item *
-An existing character property prefixed by "!utf8::" to include
-all except the characters in that property.
+Something to negate, prefixed "!": either an existing character
+property (prefixed by "utf8::") for all the characters except the
+characters in the property; or two hexadecimal codepoints for a range;
+or a single hexadecimal codepoint.
=back
syllabaries (hiragana and katakana), you can define
sub InKana {
- return <<'END';
- 3040 309F
- 30A0 30FF
+ return <<END;
+ 3040\t309F
+ 30A0\t30FF
END
}
-Imagine that the here-doc end marker is at the beginning of the line,
-and that the hexadecimal numbers are separated by a tabulator.
-Now you can use C<\p{InKana}> and C<\P{IsKana}>.
+Imagine that the here-doc end marker is at the beginning of the line.
+Now you can use C<\p{InKana}> and C<\P{InKana}>.
You could also have used the existing block property names:
}
Suppose you wanted to match only the allocated characters,
-not the by raw block ranges: in other words, you want to remove
+not the raw block ranges: in other words, you want to remove
the non-characters:
sub InKana {
[ 3] . \p{...} \P{...}
[ 4] now scripts (see UTR#24 Script Names) in addition to blocks
[ 5] have negation
- [ 6] can use look-ahead to emulate subtraction (*)
+ [ 6] can use regular expression look-ahead [a]
+ or user-defined character properties [b] to emulate subtraction
[ 7] include Letters in word characters
[ 8] note that perl does Full casefolding in matching, not Simple:
for example U+1F88 is equivalent with U+1F000 U+03B9,
(should also affect <>, $., and script line numbers)
(the \x{85}, \x{2028} and \x{2029} do match \s)
-(*) You can mimic class subtraction using lookahead.
+[a] You can mimic class subtraction using lookahead.
For example, what TR18 might write as
[{Greek}-[{UNASSIGNED}]]
which will match assigned characters known to be part of the Greek script.
+[b] See L</User-defined Character Properties>.
+
=item *
Level 2 - Extended Unicode Support
UCS-2, UCS-4
Encodings defined by the ISO 10646 standard. UCS-2 is a 16-bit
-encoding, UCS-4 is a 32-bit encoding. Unlike UTF-16, UCS-2
-is not extensible beyond 0xFFFF, because it does not use surrogates.
+encoding. Unlike UTF-16, UCS-2 is not extensible beyond 0xFFFF,
+because it does not use surrogates. UCS-4 is a 32-bit encoding,
+functionally identical to UTF-32.
=item *